OpenRefine can automate and simplify data cleaning, and provide a clear log of the actions that were taken.
OpenRefine can automate this task and make it reproducible.OpenRefine specifically requires that you write cleaned data to a new file. It aims to maintain data integrity and will not allow you to overwrite the original datafile.To participate in the workshop, or to work through the example in your own time, you will need the following on your own computer:
OpenRefine software, installed on your computeresoph.tab, which can be downloaded to your computer using this linkIf you do not already have the OpenRefine software installed, please see the configuration notebook for detailed instructions how to obtain and install the tool on your operating system.
Research data can be collected in a wide variety of ways. Sometimes it is handwritten after human measurement; sometimes it is an electronic file produced by a machine; or a hardcopy printout from some device. These outputs of primary data collection are often termed “raw” data. Raw data may not be in a format that can be analysed directly, for all sorts of different reasons. To make it usable for analysis, we must first “clean” it. In this part of the workshop, we will go through a process of “cleaning” data prior to analysis.
We have a table of data in electronic format (esoph.tab) that has been collected but is not yet in a form that can be analysed in the second half of the workshop. We will use a program called OpenRefine to standardise and clean this data.
OpenRefineOpenRefine is a Java program that runs locally on your own machine. It runs in your web browser, but no internet connection is required. No data or other information is sent to any remote servers. OpenRefine preserves data integrity and does not modify or overwrite your data file; all information is stored in a project file, which you can return to at any time, or use to export a new “clean” dataset.
The actions recorded in your data cleaning activity can be “replayed” on a different dataset. In this way, OpenRefine is helpful for standardising and cleaning data for all similar datafiles in a project.
OpenRefine is very powerful and can do much more than we will cover today. You can find out more about OpenRefine at https://openrefine.org/.
If you would like more practice using OpenRefine, a longer workshop (from which some of this material was adapted) is provided by Data Carpentry at https://datacarpentry.org/OpenRefine-ecology-lesson.
esoph.tabIt is always good practice to know exactly where your dataset comes from, and what it represents. The nature and content of your dataset should guide your cleaning and analysis.
The file esoph.tab describes data from a case-control study of oesophageal cancer in Ille-et-Vilaine, France. The dataset is built-in dataset in R, and is derived from the publication below:
Breslow, N. E. and Day, N. E. (1980) Statistical Methods in Cancer Research. Volume 1: The Analysis of Case-Control Studies. IARC Lyon / Oxford University Press.
The columns of the data describe the following:
agegp: age group of each patient, in six age-range categories (units: years)alcgp: daily alcohol consumption of each patient, in four categories (units: gm/day)tobgp: daily tobacco consumption of each patient, in four categories (units: gm/day)ncases: number of oesophageal cancer cases observed in each group (count data, discrete)ncontrols: number of controls (non-cancer individuals) observed in each group (count data, discrete)The dataset was further modified by LP to introduce commonly-encountered issues when cleaning data, and can be downloaded from this link
OpenRefine programOpenRefine should start in your browser. If it does not, try one of the following links in your browser: http://127.0.0.1:3333/ or http://localhost:3333/, and see the configuration page.Figure 3.1: OpenRefine landing page.
Browse to the location of the esoph.tab file.
If you have not done so already, you will need to download the dataset from this link.
Start a new project
Click Create Project.
Select Get data from \(\rightarrow\) This Computer.
Figure 3.2: OpenRefine data file selection.
Browse and select the file esoph.tab.Open or double-click the esoph.tab filename.Next>>At this point, OpenRefine shows a preview of the data. This is a good point to pause and inspect the data to see if OpenRefine understood the columns, rows and headers correctly.
Figure 3.3: OpenRefine data preview.
OpenRefine?OpenRefine handle the column header row?If you inspect the esoph.tab file in Notepad or some other application, you will see that line 28 is incorrectly formatted, because there are two tab characters between the first and second columns:
"35-44" "40-79" "20-29" 1 14
"35-44" "40-79" "30+" 0 8
"35-44" "80-119" "0-9g/day" 0 11
"35-44" "80-119" "10-19" 0 6
"35-44" "80-119" "20-29" 0 2
"35-44" "80-119" "30+" 0 1
OpenRefine can detect problems like this automatically, and correct them in the preview, as it does here for esoph.tab.
This is your first chance to change some of OpenRefine’s settings to control how it processes (or parses) your data.
To change the options, modify the settings in Parse data as \(\rightarrow\) CSV / TSV / separator-based files and click Update Preview (at the bottom right).
If everything looks OK, you can create the project. But something doesn’t quite look right.
There are blank rows in the file.
These rows carry no data and do not help the downstream analysis. They have possibly been introduced as a guide to the reader’s eye, so they can see the gaps between age divisions. They are, however, not meaningful in the analysis we want to perform, and should be removed.
In general, data tables should not contain cosmetic features like font differences, colours, or blank rows or columns for the sake of “readability” or to carry information. Such decoration should be cleaned from the dataset.
Store blank rowsUpdate PreviewFigure 3.4: OpenRefine data preview with cleaning on import.
OpenRefine now show?OpenRefine preview does not show the quotes that surround data in cells like "30+". Why do you think this is?Create Project>> (at the top right).OpenRefine now presents its main window with an overview of your data.
Figure 3.5: OpenRefine data overview.
Facets are a central feature of OpenRefine and allow you to apply multiple filters to inspect your data, prior to cleaning, and to select subsets of the data for cleaning. Facets group all like values that appear in a column, and let you filter and edit the values that appear in that column.
OpenRefine tries to help you with your work. The left panel often includes links to information and videos about how to use the software.
agegp column
agegp and choose Facet \(\rightarrow\) Text facet.agegp column, and a count of the number of times they occur.Figure 4.1: OpenRefine faceting view.
agegp column?agegp column
45_54 values look to be in error. An underscore (_) has been used in place of a dash (-).45_54 category in the left panel. Two options appear: edit and include. Click edit.45_54. Change this text to 45-54 and click Apply.65- 74 that has an extra space, and this appears to be in error.65- 74, and click edit.65- 74 to 65-74 and click Applyagegp column, now?In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing.” For instance, data might be inconsistently capitalised or abbreviated in a table but refer to the same thing, like “E. coli”, “Escherichia coli”, and “Escherichia Coli”. Technically, these are all different strings so would not be recognised as the same by most analysis tools. OpenRefine can detect similarities between these strings, and propose to cluster them all together with the same text.
alcgp column
alcgp and choose Facet \(\rightarrow\) Text Facet.To make space, you may have to close the agegp facet
alcgp column?alcgp column
Cluster button in the alcgp facetOpenRefine has grouped similar terms. Has it grouped all the similar terms correctly?Merge? box for any terms that should be merged, and click Merge Selected & Re-Cluster0-39 gm/day, so we need to change the settings.nearest neighbour and levenshtein clustering method, with radius 10 and 4 block chars.Merge? box for the 0-39 gm/day entriesMerge Selected & Closealcgp column?alcgp column clean yet?Notice that only one category has a set of units. For consistency (which is always desirable) could choose here to have units in all categories or none. My preference would be to have no units indicated here, but instead to place units in the header, or in an accompanying metadata file.
0-39g/day and click editApplytobgpThere are several issues with data in tobgp, and we can fix them all using faceting and clustering.
tobgpnearest neighbour, ppm with radius 5 and 4 block chars`20-29 and 20-29 a day) and click Merge Selected & Re-Clusternearest neighbour, ppm with radius 5 and 3 block chars`0-9g/day, 0-9g per day, 0-9g/DAY and 0-9g/Day) and click Merge Selected & CloseAgain, notice that only one category has a set of units. As above, we prefer to place units in the header or an accompanying metadata file.
0-9g/day and click editalcgp column?tobgp column clean yet?OpenRefine does not modify your data directly. Instead it builds an ordered series of instructions - a script - in a “language” called JSON1 that describes the processes applied to the input dataset. This can be “replayed” live within OpenRefine using the Undo/Redo options in the left panel.
Undo/RedoFigure 7.1: OpenRefine Undo/Redo view.
You can move forwards and backwards in the cleaning process to see the state of the data each time, by clicking on the steps in the left panel.
The instructions in your script can be saved to a file. This can be applied to other files in OpenRefine. If you had several files with similar kinds of errors, you could open the saved file in OpenRefine and apply it to the dataset, so that each file is cleaned in exactly the same way.
You can attach the file to a publication, report or thesis, as supplementary information, so that all processes in analysing the data for a project are open.
Extract...OpenRefine saves your project as you go. If you close and reopen OpenRefine, your projects will be listed and you can resume any of the projects at any time.
Your project includes the raw data and cleaning steps that are applied. You can share this with collaborators, or attach to a paper as supplementary information.
Export button (top right) and select OpenRefine project archive to file.tar.gz fileThe .tar.gz file format may not be familiar to you, but it is a widely-used format for compressing and archiving data.
As OpenRefine does not modify your raw data, you will need to export your cleaned data, to take it forward for analysis with other tools.
Export button (top right) and select the file type you want to use to export the dataBe sure to save your data as both tab-separated format and comma-separated format, as you will need both formats for the second part of the workshop.
Remember that using widely-supported, plain text, open non-proprietary formats like .tsv/.tab (tab-separated value) or .csv (comma-separated value) makes it easier for yourself and others to use your data.
When you export your cleaned data and project files, remember to follow good data management principles for naming and saving these files. You should be able to come back to your computer in six months and still be able to tell which file contains the cleaned data.