OpenRefine
can automate and simplify data cleaning, and provide a clear log of the actions that were taken.
OpenRefine
can automate this task and make it reproducible.OpenRefine
specifically requires that you write cleaned data to a new file. It aims to maintain data integrity and will not allow you to overwrite the original datafile.To participate in the workshop, or to work through the example in your own time, you will need the following on your own computer:
OpenRefine
software, installed on your computeresoph.tab
, which can be downloaded to your computer using this linkIf you do not already have the OpenRefine
software installed, please see the configuration notebook for detailed instructions how to obtain and install the tool on your operating system.
Research data can be collected in a wide variety of ways. Sometimes it is handwritten after human measurement; sometimes it is an electronic file produced by a machine; or a hardcopy printout from some device. These outputs of primary data collection are often termed “raw” data. Raw data may not be in a format that can be analysed directly, for all sorts of different reasons. To make it usable for analysis, we must first “clean” it. In this part of the workshop, we will go through a process of “cleaning” data prior to analysis.
We have a table of data in electronic format (esoph.tab
) that has been collected but is not yet in a form that can be analysed in the second half of the workshop. We will use a program called OpenRefine
to standardise and clean this data.
OpenRefine
OpenRefine
is a Java program that runs locally on your own machine. It runs in your web browser, but no internet connection is required. No data or other information is sent to any remote servers. OpenRefine
preserves data integrity and does not modify or overwrite your data file; all information is stored in a project file, which you can return to at any time, or use to export a new “clean” dataset.
The actions recorded in your data cleaning activity can be “replayed” on a different dataset. In this way, OpenRefine
is helpful for standardising and cleaning data for all similar datafiles in a project.
OpenRefine is very powerful and can do much more than we will cover today. You can find out more about OpenRefine
at https://openrefine.org/.
If you would like more practice using OpenRefine
, a longer workshop (from which some of this material was adapted) is provided by Data Carpentry at https://datacarpentry.org/OpenRefine-ecology-lesson.
esoph.tab
It is always good practice to know exactly where your dataset comes from, and what it represents. The nature and content of your dataset should guide your cleaning and analysis.
The file esoph.tab
describes data from a case-control study of oesophageal cancer in Ille-et-Vilaine, France. The dataset is built-in dataset in R
, and is derived from the publication below:
Breslow, N. E. and Day, N. E. (1980) Statistical Methods in Cancer Research. Volume 1: The Analysis of Case-Control Studies. IARC Lyon / Oxford University Press.
The columns of the data describe the following:
agegp
: age group of each patient, in six age-range categories (units: years)alcgp
: daily alcohol consumption of each patient, in four categories (units: gm/day)tobgp
: daily tobacco consumption of each patient, in four categories (units: gm/day)ncases
: number of oesophageal cancer cases observed in each group (count data, discrete)ncontrols
: number of controls (non-cancer individuals) observed in each group (count data, discrete)The dataset was further modified by LP to introduce commonly-encountered issues when cleaning data, and can be downloaded from this link
OpenRefine
programOpenRefine
should start in your browser. If it does not, try one of the following links in your browser: http://127.0.0.1:3333/ or http://localhost:3333/, and see the configuration page.Browse to the location of the esoph.tab
file.
If you have not done so already, you will need to download the dataset from this link.
Start a new project
Click Create Project
.
Select Get data from
\(\rightarrow\) This Computer
.
Browse
and select the file esoph.tab
.Open
or double-click the esoph.tab
filename.Next>>
At this point, OpenRefine
shows a preview of the data. This is a good point to pause and inspect the data to see if OpenRefine
understood the columns, rows and headers correctly.
OpenRefine
?OpenRefine
handle the column header row?If you inspect the esoph.tab
file in Notepad
or some other application, you will see that line 28 is incorrectly formatted, because there are two tab characters between the first and second columns:
"35-44" "40-79" "20-29" 1 14
"35-44" "40-79" "30+" 0 8
"35-44" "80-119" "0-9g/day" 0 11
"35-44" "80-119" "10-19" 0 6
"35-44" "80-119" "20-29" 0 2
"35-44" "80-119" "30+" 0 1
OpenRefine
can detect problems like this automatically, and correct them in the preview, as it does here for esoph.tab
.
This is your first chance to change some of OpenRefine
’s settings to control how it processes (or parses) your data.
To change the options, modify the settings in Parse data as
\(\rightarrow\) CSV / TSV / separator-based files
and click Update Preview
(at the bottom right).
If everything looks OK, you can create the project. But something doesn’t quite look right.
There are blank rows in the file.
These rows carry no data and do not help the downstream analysis. They have possibly been introduced as a guide to the reader’s eye, so they can see the gaps between age divisions. They are, however, not meaningful in the analysis we want to perform, and should be removed.
In general, data tables should not contain cosmetic features like font differences, colours, or blank rows or columns for the sake of “readability” or to carry information. Such decoration should be cleaned from the dataset.
Store blank rows
Update Preview
OpenRefine
now show?OpenRefine
preview does not show the quotes that surround data in cells like "30+"
. Why do you think this is?Create Project>>
(at the top right).OpenRefine
now presents its main window with an overview of your data.
Facets are a central feature of OpenRefine
and allow you to apply multiple filters to inspect your data, prior to cleaning, and to select subsets of the data for cleaning. Facets group all like values that appear in a column, and let you filter and edit the values that appear in that column.
OpenRefine
tries to help you with your work. The left panel often includes links to information and videos about how to use the software.
agegp
column
agegp
and choose Facet
\(\rightarrow\) Text facet
.agegp
column, and a count of the number of times they occur.agegp
column?agegp
column
45_54
values look to be in error. An underscore (_
) has been used in place of a dash (-
).45_54
category in the left panel. Two options appear: edit
and include
. Click edit
.45_54
. Change this text to 45-54
and click Apply
.65- 74
that has an extra space, and this appears to be in error.65- 74
, and click edit
.65- 74
to 65-74
and click Apply
agegp
column, now?In OpenRefine
, clustering means “finding groups of different values that might be alternative representations of the same thing.” For instance, data might be inconsistently capitalised or abbreviated in a table but refer to the same thing, like “E. coli”, “Escherichia coli”, and “Escherichia Coli”. Technically, these are all different strings so would not be recognised as the same by most analysis tools. OpenRefine
can detect similarities between these strings, and propose to cluster them all together with the same text.
alcgp
column
alcgp
and choose Facet
\(\rightarrow\) Text Facet
.To make space, you may have to close the agegp
facet
alcgp
column?alcgp
column
Cluster
button in the alcgp
facetOpenRefine
has grouped similar terms. Has it grouped all the similar terms correctly?Merge?
box for any terms that should be merged, and click Merge Selected & Re-Cluster
0-39
gm/day, so we need to change the settings.nearest neighbour
and levenshtein
clustering method, with radius 10
and 4
block chars.Merge?
box for the 0-39
gm/day entriesMerge Selected & Close
alcgp
column?alcgp
column clean yet?Notice that only one category has a set of units. For consistency (which is always desirable) could choose here to have units in all categories or none. My preference would be to have no units indicated here, but instead to place units in the header, or in an accompanying metadata file.
0-39g/day
and click edit
Apply
tobgp
There are several issues with data in tobgp
, and we can fix them all using faceting and clustering.
tobgp
nearest neighbour
, ppm
with radius 5
and 4
block chars`20-29
and 20-29 a day
) and click Merge Selected & Re-Cluster
nearest neighbour
, ppm
with radius 5
and 3
block chars`0-9g/day
, 0-9g per day
, 0-9g/DAY
and 0-9g/Day
) and click Merge Selected & Close
Again, notice that only one category has a set of units. As above, we prefer to place units in the header or an accompanying metadata file.
0-9g/day
and click edit
alcgp
column?tobgp
column clean yet?OpenRefine
does not modify your data directly. Instead it builds an ordered series of instructions - a script - in a “language” called JSON
1 that describes the processes applied to the input dataset. This can be “replayed” live within OpenRefine
using the Undo/Redo
options in the left panel.
Undo/Redo
You can move forwards and backwards in the cleaning process to see the state of the data each time, by clicking on the steps in the left panel.
The instructions in your script can be saved to a file. This can be applied to other files in OpenRefine. If you had several files with similar kinds of errors, you could open the saved file in OpenRefine
and apply it to the dataset, so that each file is cleaned in exactly the same way.
You can attach the file to a publication, report or thesis, as supplementary information, so that all processes in analysing the data for a project are open.
Extract...
OpenRefine
saves your project as you go. If you close and reopen OpenRefine
, your projects will be listed and you can resume any of the projects at any time.
Your project includes the raw data and cleaning steps that are applied. You can share this with collaborators, or attach to a paper as supplementary information.
Export
button (top right) and select OpenRefine project archive to file
.tar.gz
fileThe .tar.gz
file format may not be familiar to you, but it is a widely-used format for compressing and archiving data.
As OpenRefine
does not modify your raw data, you will need to export your cleaned data, to take it forward for analysis with other tools.
Export
button (top right) and select the file type you want to use to export the dataBe sure to save your data as both tab-separated format and comma-separated format, as you will need both formats for the second part of the workshop.
Remember that using widely-supported, plain text, open non-proprietary formats like .tsv
/.tab
(tab-separated value) or .csv
(comma-separated value) makes it easier for yourself and others to use your data.
When you export your cleaned data and project files, remember to follow good data management principles for naming and saving these files. You should be able to come back to your computer in six months and still be able to tell which file contains the cleaned data.