• Data is often messy and in a state where it cannot be analysed directly without “cleaning” it.
  • It is important for reproducibility that what you do with your data is recorded.
  • Using tools like OpenRefine can automate and simplify data cleaning, and provide a clear log of the actions that were taken.
    • Logs of data cleaning are valuable supplementary information.
  • Data cleaning is often a repetitive task, for single steps (e.g. removing quotes), and for complete data-cleaning pipelines (e.g. cleaning the output from a specific machine). Tools like OpenRefine can automate this task and make it reproducible.
  • OpenRefine specifically requires that you write cleaned data to a new file. It aims to maintain data integrity and will not allow you to overwrite the original datafile.

1 Setup

To participate in the workshop, or to work through the example in your own time, you will need the following on your own computer:

  1. The OpenRefine software, installed on your computer
  2. The example dataset esoph.tab, which can be downloaded to your computer using this link

If you do not already have the OpenRefine software installed, please see the configuration notebook for detailed instructions how to obtain and install the tool on your operating system.

2 Introduction

Research data can be collected in a wide variety of ways. Sometimes it is handwritten after human measurement; sometimes it is an electronic file produced by a machine; or a hardcopy printout from some device. These outputs of primary data collection are often termed “raw” data. Raw data may not be in a format that can be analysed directly, for all sorts of different reasons. To make it usable for analysis, we must first “clean” it. In this part of the workshop, we will go through a process of “cleaning” data prior to analysis.

We have a table of data in electronic format (esoph.tab) that has been collected but is not yet in a form that can be analysed in the second half of the workshop. We will use a program called OpenRefine to standardise and clean this data.

2.1 OpenRefine

OpenRefine is a Java program that runs locally on your own machine. It runs in your web browser, but no internet connection is required. No data or other information is sent to any remote servers. OpenRefine preserves data integrity and does not modify or overwrite your data file; all information is stored in a project file, which you can return to at any time, or use to export a new “clean” dataset.

The actions recorded in your data cleaning activity can be “replayed” on a different dataset. In this way, OpenRefine is helpful for standardising and cleaning data for all similar datafiles in a project.

OpenRefine is very powerful and can do much more than we will cover today. You can find out more about OpenRefine at https://openrefine.org/.

If you would like more practice using OpenRefine, a longer workshop (from which some of this material was adapted) is provided by Data Carpentry at https://datacarpentry.org/OpenRefine-ecology-lesson.

2.2 Dataset - esoph.tab

It is always good practice to know exactly where your dataset comes from, and what it represents. The nature and content of your dataset should guide your cleaning and analysis.

The file esoph.tab describes data from a case-control study of oesophageal cancer in Ille-et-Vilaine, France. The dataset is built-in dataset in R, and is derived from the publication below:

Breslow, N. E. and Day, N. E. (1980) Statistical Methods in Cancer Research. Volume 1: The Analysis of Case-Control Studies. IARC Lyon / Oxford University Press.

The columns of the data describe the following:

  1. agegp: age group of each patient, in six age-range categories (units: years)
  2. alcgp: daily alcohol consumption of each patient, in four categories (units: gm/day)
  3. tobgp: daily tobacco consumption of each patient, in four categories (units: gm/day)
  4. ncases: number of oesophageal cancer cases observed in each group (count data, discrete)
  5. ncontrols: number of controls (non-cancer individuals) observed in each group (count data, discrete)

The dataset was further modified by LP to introduce commonly-encountered issues when cleaning data, and can be downloaded from this link

3 Create a New Project

OpenRefine landing page.

Figure 3.1: OpenRefine landing page.

  • Browse to the location of the esoph.tab file.

  • If you have not done so already, you will need to download the dataset from this link.

  • Start a new project

  • Click Create Project.

  • Select Get data from \(\rightarrow\) This Computer.

OpenRefine data file selection.

Figure 3.2: OpenRefine data file selection.

  • Load in the data
  • Click Browse and select the file esoph.tab.
  • Click Open or double-click the esoph.tab filename.
  • Click Next>>

At this point, OpenRefine shows a preview of the data. This is a good point to pause and inspect the data to see if OpenRefine understood the columns, rows and headers correctly.

OpenRefine data preview.

Figure 3.3: OpenRefine data preview.

  1. How many lines of data are identified by OpenRefine?
  2. How does OpenRefine handle the column header row?

If you inspect the esoph.tab file in Notepad or some other application, you will see that line 28 is incorrectly formatted, because there are two tab characters between the first and second columns:

"35-44" "40-79" "20-29" 1   14
"35-44" "40-79" "30+"   0   8
"35-44" "80-119"    "0-9g/day"  0   11
"35-44"   "80-119"  "10-19" 0   6
"35-44" "80-119"    "20-29" 0   2
"35-44" "80-119"    "30+"   0   1

OpenRefine can detect problems like this automatically, and correct them in the preview, as it does here for esoph.tab.

This is your first chance to change some of OpenRefine’s settings to control how it processes (or parses) your data.

To change the options, modify the settings in Parse data as \(\rightarrow\) CSV / TSV / separator-based files and click Update Preview (at the bottom right).

If everything looks OK, you can create the project. But something doesn’t quite look right.

There are blank rows in the file.

These rows carry no data and do not help the downstream analysis. They have possibly been introduced as a guide to the reader’s eye, so they can see the gaps between age divisions. They are, however, not meaningful in the analysis we want to perform, and should be removed.

In general, data tables should not contain cosmetic features like font differences, colours, or blank rows or columns for the sake of “readability” or to carry information. Such decoration should be cleaned from the dataset.

  • Remove blank rows from the dataset
    • Uncheck the box next to Store blank rows
    • The preview may update automatically, but if it does not, click Update Preview
OpenRefine data preview with cleaning on import.

Figure 3.4: OpenRefine data preview with cleaning on import.

  1. How many rows of data does OpenRefine now show?
  2. The OpenRefine preview does not show the quotes that surround data in cells like "30+". Why do you think this is?
  • Create the project
    • If everything looks OK, click Create Project>> (at the top right).

OpenRefine now presents its main window with an overview of your data.

OpenRefine data overview.

Figure 3.5: OpenRefine data overview.

4 Faceting data

Facets are a central feature of OpenRefine and allow you to apply multiple filters to inspect your data, prior to cleaning, and to select subsets of the data for cleaning. Facets group all like values that appear in a column, and let you filter and edit the values that appear in that column.

OpenRefine tries to help you with your work. The left panel often includes links to information and videos about how to use the software.

4.1 Cleaning data columns, using facets

  • Inspecting the agegp column
    • Click the down arrow next to agegp and choose Facet \(\rightarrow\) Text facet.
    • The left panel now contains every unique value in the agegp column, and a count of the number of times they occur.
OpenRefine faceting view.

Figure 4.1: OpenRefine faceting view.

  1. How many unique values are there in the agegp column?
  2. Do you notice any problems with the data? If so, what are they, and how would you fix them?
  • Fixing the agegp column
    • The 45_54 values look to be in error. An underscore (_) has been used in place of a dash (-).
    • Hover the mouse over the 45_54 category in the left panel. Two options appear: edit and include. Click edit.
    • A window appears, where you can edit the text 45_54. Change this text to 45-54 and click Apply.
    • There is a single instance of 65- 74 that has an extra space, and this appears to be in error.
    • Hover the mouse over 65- 74, and click edit.
    • Change the text in the new window from 65- 74 to 65-74 and click Apply
  1. How many unique values are there in the agegp column, now?

5 Clustering data

In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing.” For instance, data might be inconsistently capitalised or abbreviated in a table but refer to the same thing, like “E. coli”, “Escherichia coli”, and “Escherichia Coli”. Technically, these are all different strings so would not be recognised as the same by most analysis tools. OpenRefine can detect similarities between these strings, and propose to cluster them all together with the same text.

5.1 Cleaning data columns, using clustering

  • Inspecting the alcgp column
    • Click the down arrow next to alcgp and choose Facet \(\rightarrow\) Text Facet.

To make space, you may have to close the agegp facet

  1. How many categories are there in the alcgp column?
  2. Do you notice any problems with the data? If so, what are they, and how would you fix them?
  • Fixing the alcgp column
    • Click on the Cluster button in the alcgp facet
    • Note how OpenRefine has grouped similar terms. Has it grouped all the similar terms correctly?
    • Check the Merge? box for any terms that should be merged, and click Merge Selected & Re-Cluster
    • The default settings do not correctly cluster the two entries for 0-39 gm/day, so we need to change the settings.
    • Choose the nearest neighbour and levenshtein clustering method, with radius 10 and 4 block chars.
    • Check the Merge? box for the 0-39 gm/day entries
    • Click Merge Selected & Close
  1. How many categories are there in the alcgp column?
  2. Is the data in the alcgp column clean yet?

Notice that only one category has a set of units. For consistency (which is always desirable) could choose here to have units in all categories or none. My preference would be to have no units indicated here, but instead to place units in the header, or in an accompanying metadata file.

  • Removing the units
    • Hover the mouse over 0-39g/day and click edit
    • Delete the units and click Apply

6 Cleaning tobgp

There are several issues with data in tobgp, and we can fix them all using faceting and clustering.

  • Create a text facet for tobgp
  • Cluster the categories using nearest neighbour, ppm with radius 5 and 4 block chars`
  • Select the rows that should be merged (20-29 and 20-29 a day) and click Merge Selected & Re-Cluster
  • Cluster the categories using nearest neighbour, ppm with radius 5 and 3 block chars`
  • Select the rows that should be merged (0-9g/day, 0-9g per day, 0-9g/DAY and 0-9g/Day) and click Merge Selected & Close

Again, notice that only one category has a set of units. As above, we prefer to place units in the header or an accompanying metadata file.

  • Removing the units
    • Hover the mouse over 0-9g/day and click edit
    • Delete the units and click `Apply
  1. How many categories are there in the alcgp column?
  2. Is the data in the tobgp column clean yet?

7 Reproducible cleaning

OpenRefine does not modify your data directly. Instead it builds an ordered series of instructions - a script - in a “language” called JSON1 that describes the processes applied to the input dataset. This can be “replayed” live within OpenRefine using the Undo/Redo options in the left panel.

  • Viewing the cleaning script
    • Click on Undo/Redo
  1. How many steps are applied in the script?
  2. Which columns were modified?
  3. How many cells were modified each time?
OpenRefine Undo/Redo view.

Figure 7.1: OpenRefine Undo/Redo view.

You can move forwards and backwards in the cleaning process to see the state of the data each time, by clicking on the steps in the left panel.

7.1 Exporting the cleaning script

The instructions in your script can be saved to a file. This can be applied to other files in OpenRefine. If you had several files with similar kinds of errors, you could open the saved file in OpenRefine and apply it to the dataset, so that each file is cleaned in exactly the same way.

You can attach the file to a publication, report or thesis, as supplementary information, so that all processes in analysing the data for a project are open.

  • Viewing your cleaning script
    • In the left panel, click Extract...
    • A new window appears. The script is in the right-hand panel
    • Individual steps can be included or excluded by selecting or deselecting the check boxes to the left

8 Saving and exporting your project and cleaned data

OpenRefine saves your project as you go. If you close and reopen OpenRefine, your projects will be listed and you can resume any of the projects at any time.

8.1 Exporting your project

Your project includes the raw data and cleaning steps that are applied. You can share this with collaborators, or attach to a paper as supplementary information.

  • Exporting the project
    • Click the Export button (top right) and select OpenRefine project archive to file
    • The standard file download dialogue will appear for your browser
    • The project will be saved as a .tar.gz file

The .tar.gz file format may not be familiar to you, but it is a widely-used format for compressing and archiving data.

  • On macOS and Linux, the extraction tools are available by default. On macOS, double-clicking the file in Finder should extract its contents.
  • On Windows, you may need to use additional software such as 7-zip or WinZip.

8.2 Exporting your cleaned data

As OpenRefine does not modify your raw data, you will need to export your cleaned data, to take it forward for analysis with other tools.

  • Exporting cleaned data
    • Click the Export button (top right) and select the file type you want to use to export the data
    • The standard file download dialogue will appear for your browser

Be sure to save your data as both tab-separated format and comma-separated format, as you will need both formats for the second part of the workshop.

Remember that using widely-supported, plain text, open non-proprietary formats like .tsv/.tab (tab-separated value) or .csv (comma-separated value) makes it easier for yourself and others to use your data.

When you export your cleaned data and project files, remember to follow good data management principles for naming and saving these files. You should be able to come back to your computer in six months and still be able to tell which file contains the cleaned data.