Dtool Solutions

The problem with mixing data

12 Mar 2018

You are running out of space on your computer and you need to delete files from old projects. If this concept fills you with fear of boredom, or worse, fear of loosing data then read on. Here are some practical advice on how to structure files and directories to make this easier in the future.

The problem

You have run out of disk space. You need to delete unnecessary files from an old project. Below is a tree view of the files in the old project.

$ tree some_project/
some_project/
├── analysis.py
├── data.csv
├── more_data.csv
├── multiple_file_analysis.py
├── output.csv
├── pre_process.py
└── cleaned_data.csv

What files can you safely delete?

What a mess

This toy example may seem artificial, but it is based on a working pattern that is very common.

When one starts working on a project it is convenient to keep scripts close to the raw data. A project may therefore start out along the lines of the below, with a data file (data.csv) and an analysis script (analysis.py).

$ tree some_project/
some_project/
├── analysis.py
└── data.csv

The analysis script works on the data to produce an output file (output.csv).

$ tree some_project/
some_project/
├── analysis.py
├── data.csv
└── output.csv

As it turns out the data needs some pre-processing to standardise it. One therefore creates a script to do this pre-processing (pre_process.py) which generates cleaned data (cleaned_data.csv).

$ tree some_project/
some_project/
├── analysis.py
├── data.csv
├── output.csv
├── pre_process.py
└── cleaned_data.csv

Then more data (more_data.csv) arrives and one realises the analysis script needs to be able to work on more than one input file. However, rather than editing the existing scripts one produces a new one to work on multiple files (multiple_file_analysis.py).

$ tree some_project/
some_project/
├── analysis.py
├── data.csv
├── more_data.csv
├── multiple_file_analysis.py
├── output.csv
├── pre_process.py
└── cleaned_data.csv

Nine months later you run out of disk space. You need to delete unnecessary files from an old project. Which files can you safely delete?

A better way

Now suppose that you were faced with the same challenge of deleting unnecessary files, but the files were organised into the directory structure below. Which file(s) would you delete?

$ tree some_project/
some_project/
├── final_results
│   └── output.csv
├── intermediate_data
│   └── cleaned_data.csv
├── raw_data
│   ├── data.csv
│   └── more_data.csv
└── scripts
    ├── analysis.py
    ├── multiple_file_analysis.py
    └── pre_process.py

The directory names suggest that the intermediate file cleaned_data.csv is a good candidate for deletion. If the whole analysis pipeline was automated one may also consider deleting the final result (output.csv), depending on how long it takes to run the analysis.

The setup above is an improvement. However it is not ideal. What if you, or one of your colleagues, wanted to use the raw data for a different analysis. How would you enable that?

Avoiding data duplication

There are several reason to want to avoid duplicating data. One reason is that every duplication costs disk space. Another reason is that having multiple copies of data makes it ambiguous what the canonical source of the data is.

A better solution is therefore to have a canonical place to store your raw data in. For example something along the lines of the below.

$ tree research-group-share/
research-group-share/
├── projects
│   ├── cure_cancer
│   │   ├── final_results
│   │   ├── intermediate_data
│   │   └── scripts
│   └── feed_the_world
│       ├── final_results
│       ├── intermediate_data
│       └── scripts
└── raw_data
    ├── raw_data_1.csv
    ├── raw_data_2.csv
    └── raw_data_3.csv

In the above the all raw data is potentially available to all projects and it is up to each project to specify which raw data files to include in its analysis.

Having the data in a canonical location helps avoids data duplication. It also has the added benefit of making it easier to find data.

Summary

The key take home messages are:

Keep raw data separate from derived/intermediate data
Keep raw data in a consistent location
Standardise the structure of your projects so that it is easy to identify files that can be easily re-generated