I’m sure you’re all familiar with the situation in figure 1.1. Research (like writing) tends to be incremental. Project files start out as random notes here, a bit of code there, some data over there… and eventually a project thesis emerges.
But the situation in figure 1.1 is bad. It’s not clear which order the documents were created in, or which files contain which changes. There are many reasons why we should always avoid working in a haphazard manner, or naming files like they do figure 1.2.
Regardless of your project topic, you are going to spend a lot of time working on a computer. Your relationship with your computer is going to be one of the most important ones in your career. It’s worth making the effort to ensure it is a healthy and productive one.
It’s possible to spend a lot of time chasing down old file versions, and a lot of that pain can be avoided by using good project management principles.
All projects are different, to some extent, so there is no one-size-fits-all advice. It’s important to find something that works for you, but it’s also important that it works for collaborators.
It’s very tempting to imagine that you will always understand your current project management structure. We have all learned from experience though, that “me” in six months is quite different from “me” right now. A useful rule of thumb for assessing whether your data management is working is:
which is more or less the same thing as asking:
You will probably never know more about the files you’re creating than at the time you create them. Similarly, you’ll likely never understand what you did in an analysis better than when you’re running that analysis, so:
Your goal is to make the analysis clear enough to be repeated.
In general, if you can name all files and folders to be self-describing, it makes everyone’s life easier, including your own. A file called analysis.doc
could contain anything, but if the file was called 2021-08-21_geldoc_sampleABC.doc
you could probably guess that it was a geldoc image for sample ABC
, taken on 31st August 2021. A similar principle works for naming folders. A folder called mass_spec
data suggests it contains mass spectrometry data, but 2021-05-15_ms_sampleABC
makes it clearer which sample was run, and when.
This approach does not cover all circumstances, and it’s not a good idea to make file or folder names arbitrarily long. So it is good practice to include a plain text README.txt
file in each folder. This file should describe the files in the folder briefly, but more importantly should describe what the data represents and why it is important (e.g. the research goal and which project it belongs to). There should also be a contact name/email in the file.
These small steps will make the project data much more understandable to others you might share it with, and to yourself if you return to the project, later.
If you keep all relevant files together in a single folder, it is easy to package up (e.g. as a .zip
archive), move around, share and publish alongside your paper. Having all relevant files in one place means you will spend less time hunting around your hard drive to find files.
Name your project’s working directory after the project.
This is only one of many ways to structure a working directory.
It’s a good starting point, but something else might be more appropriate for your own work.
The intent behind each directory in figure 2.1 is:
WORKING_DIR/
is the root directory of the project.
data/
is a subdirectory for storing data.
data/raw
, data/intermediate
, data/processed
, etc.)data_output/
could be a place to write analysis output.documents/
is a directory that could contain notes, draft papers, supporting material, etc.fig_output/
is a directory that could be used to store graphical output from the analysis.scripts/
is a directory where you could store executable code that automates your analysis.README.txt
file in each folder and subfolder to explain its contents.Keeping your raw data in its original state establishes provenance and enables reanalysis. This means that you will always have the original data (and be certain that it’s the original data!), and you’ll be able to return to it to repeat your analysis or carry out a different analysis as needed.
You should never modify your raw data files in-place. If you do so, it can be impossible to recover their original state, rendering the original data collection suspect and compromising the integrity of the entire project.
Keep raw data in a separate, clearly-labeled, subfolder of the project working directory.
Most of the time, scientific data is dirty, by which we mean it is not exactly in the form necessary for analysis3. Cleaning scientific data is part of the analysis process, and can require quite sophisticated choices. Should you, for instance, remove missing values or record them as Null
? All of these decisions are part of your analysis.
Manual modification and cleaning of data tends to not be reproducible. It can be difficult to remember (or even express) the complex operations involved in removing outliers, fixing typographical errors, or the many other operations that go into cleaning a dataset. For that reason, it is far preferable to automate the process as much as possible.
We will cover one of the tools available for automated and reproducible data-cleaning (OpenRefine
), in the workshop in week 4.
Data is like chicken.
Cleaned data should always be kept separate from raw data.
Place cleaned data in its own subfolder.
As with data cleaning, manual analysis can sometimes be4 not entirely reproducible. Especially when dealing with large datasets, it can be very difficult to be consistent with manual analysis. For very large datasets, manual analysis may not be possible.
Automating your analyses is outwith the scope of these workshops, but there are many very good online resources that can help you get started.
As you will have seen from figure 1.1, it is easy to get into a situation where we have multiple nearly-identical versions of the same document. That can happen because we share document versions via email, or make updates in one file and not another for all kinds of reasons.
You may be familiar with Microsoft Word’s Track Changes, Google Docs’ Version History or LibreOffice’s Recording and Displaying Changes. All of these options allow you to snapshot different versions of the same document, and to preferentially accept or reject individual changes. This is a kind of version control - active management of several versions and the history of a document. Version control can help avoid the file naming issues in figure 1.1 and make your workflow much easier to manage.
A detailed account of version control is outwith the scope of these workshops, but there are many very good online resources that can help you get started.
If a file’s name is meaningful and self-explanatory, then you don’t need to open it to see what’s in it. That will save you time.
Avoid using terms like “draft” or “final” for different versions of the same document. It’s easy to lose track of successive drafts and “final” versions.
If you aren’t going to use some kind of version control (see above), then a quick way to get some of its benefits is to date versions of a document with ISO 8601
dates. For example:
2021-09-02_thesis.docx
2021-08-31_thesis.docx
2021-08-16_thesis.docx
2021-08-09_thesis.docx
Files with this format can be sorted automatically in your file explorer.
Each project should have a data management plan, which is developed through discussion between the project leader (e.g. your supervisor) and those working on the project. This should help ensure continuation of the project in case of hardware failure or other catastrophe which, in part, means considering where you should store your data.
There is no one-size-fits-all data management plan, but there are some good principles to follow for how and where to store your data that means you are at less risk of accidental (or malicious) loss.
Keep backups. This can be automated. macOS has TimeMachine
built-in, which makes continuous incremental backups. Cloud services such as OneDrive
can link in with your laptop to ensure that data is continuously synchronised to cloud storage.
Those three places could be: your laptop, an external hard drive in your office, and a cloud service. It is important to have a copy of the key data at two physically-separated sites, in case of physical accident or disaster at one of those sites. Cloud services distribute their storage and can often be considered safer than on-site storage.
It might seem sufficient to keep backups, and put your data in the cloud. But consider how your project partners (and supervisor) will get access if, for some reason, you are not available. Would it help to have shared access to the same cloud resource?
The University of Strathclyde uses Microsoft’s OneDrive to securely store and share data in the cloud.
OneDrive can integrate with your operating system, providing fine-grained permissions for access to shared project documents and data. Users get 1TB of storage space.
We will work with OpenRefine
as a tool for data cleaning, in the workshop in week 4↩︎
It is sometimes argued that ≈80% of data analysis is spent on the process of cleaning and preparing data.↩︎
Some would say frequently is↩︎