R
provide more flexible (and better) options for visualisation than tools like Excel
The amount of data being generated and processed in biological sciences is always increasing, seemingly at an ever-faster pace. As a scientist you will have to learn to work with this data without being overwhelmed by its sheer volume and complexity. Data visualisation is a tool that allows us to summarise, understand, and explore our - sometimes incredibly large - datasets intuitively. In this workshop, we will focus on common approaches by which statistical data can and should be summarised visually, with an emphasis on best practice.
Data visualisation is a broad topic that touches on elements of colour theory, psychology, graphic design, and several other fields that we sadly don’t have time to explore, here. But we will provide links to help you learn more about this area, in your own time.
Data visualisation is not often just a simple rendering of the data. Fundamentally, data visualisation is a method of communication, and it is worth thinking of it as a form of storytelling, similar to writing. We may as scientists have specific goals for data exploration, such as to maximise patterns that may indicate correlation between variables - to tell ourselves a story from the data. When writing reports and papers, we want to share what we see in the data with our reader. We want to tell them a story in pictures, to share our mental model with them.
There are strong similarities to storytelling with words. For instance, there is a grammar to visual information (called gestalt principles), just as there is a grammar to written language. As humans we have evolved to use this grammar to recognise patterns in what we see, including visual images on the page, and interpret them as having meaning. We can use those natural, innate tendencies to recognise and interpret patterns to make understanding and interpretation of data almost effortless for the reader.
It is also possible to use these natural ways that we interpret data to mislead people - accidentally or otherwise. We will see some examples of this.
If you know that it is possible to be misled by visual representation, you will be better able to spot bad graphics in papers.
Here’s a story as it might appear in a presentation or a paper, told with a single figure:
We measured chlorosis in 30 plants after the application of either AvrX or AvrY. The figure demonstrates that the induction of chlorosis was stronger for AvrY than for AvrX.
As it stands we dont know what the chlorosis measurements are, which means we can’t interpret them as “strong” or “weak”, and we can’t tell how different the values are for AvrX and AvrY. We can improve this plot by indicating what quantities are meant by the bars in the chart, adding values on the y-axis:
Now we can read off the chart that AvrX has a value of 9.5 and AvrY has a value of about 9.75. If we knew more about chlorosis, we could now tell if this was a strong or a weak effect.
But there is something not right about this graphic. The blue bar looks to be twice the size of the green block, yet the difference between values is only \(\frac{0.25}{0.95} = 2.5\%\)! It would be fairer to start the y-axis at zero, so that the areas of each bar are related to the absolute difference between observed values.
People do, in fact, interpret bar charts by area, not height. Humans have a limited repertoire of elementary perceptual tasks that we use to extract data and information from graphical representation. Graphs may use these well for communication, or misuse them to fool the reader. A number of studies empirically test people’s ability to correctly “read” graph types, and a couple of these are indicated below..
Cleveland, W.S. & McGill, R. (1984) “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods” J. Am. Stat. Ass. 79 531-554 https://doi.org/10.2307/2288400
Heer, J. & Bostock, M. (2010) “Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design” CHI ’10 pp.203-212 https://doi.org/10.1145/1753326.1753357
When we read the literature, it is tempting to skim over papers and “read the figures,” trusting that they tell an unbiased and fair story about the data. That is not always true. Figures should be read at least as carefully as the text.
There is one more thing we need to make proper sense of the figure. At the moment, the values for AvrX and AvrY are presented as single values: one location for each gene. But we saw at the start of the story that 30 plants were measured:
We measured chlorosis in 30 plants after the application of either AvrX or AvrY. The figure demonstrates that the induction of chlorosis was stronger for AvrY than for AvrX.
It’s unlikely that all of the measurements for each gene were identical. We would assume that there has been some spread of data around that location. This spread is commonly represented with error bars.
As it happens, there is more that we might want to know, and you can explore this in the interactive session at the link below:
It is usually best to show original numerical data as completely as possible. This may mean showing each datapoint or observation individually. If that is not possible or desirable, then there are several options for showing the distribution of the data as an summary of the data (such as a histogram, a cumulative distribution function (CDF) or kernel density estimate (KDE)), or a representation of summary statistics of the data (e.g. a box plot/box-and-whisker plot). In general, your preference should be:
I strongly discourage the use of bar charts to represent numerical data distributions. The reasons for this are reported in the literature:
Zacks, J. & Tversky, B. (1999) “Bars and Lines: A Study of Graphic Communication” Mem. Cognit. https://doi.org/10.3758/bf03201236
Weissgerber, T. et al. (2015) “Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm” PLoS Biol. https://doi.org/10.1371/journal.pbio.1002128
In a univariate scatterplot, datapoints are plotted against a single numerical axis, with the other axis showing the identity of the dataset. This allows all values in a dataset to be shown, which is the most transparent way of presenting the data to the reader.
Where there is potential for overlap of datapoints, their locations can be “jittered” - moved slightly in a horizontal and/or vertical direction - in order to make the number of datapoints at each value clearer. It is however not always possible to interpret the density of datapoints very well, so this representation is best combined with a summary of the distribution that conveys this information also.
A univariate scatterplot for the iris
sepal length data in Figure 2.1 shows the distributions of this variable overlap for each species, and that there is a possible outlier in the I. virginica data. The nature of the data is also evident: the datapoints are regularly-spaced, and we can see where there are relatively few or many datapoints.
Boxplots represent distributions of continous variables. The box represents the first, second and third quartiles of the dataset, and the whiskers extend to \(1.5 \times \textrm{IQR}\), where \(\textrm{IQR}\) is the interquartile range of the data. The red dot indicates an outlier. Here, that means any value in the dataset that lies outwith \(1.5 \times \textrm{IQR}\).
Quartiles are obtained by taking each value in the data and sorting them from smallest to largest value, in an ordered list. The quartiles are then the values at one quarter (25%, first quartile), halfway (50%, second quartile) and three-quarters (75%, third quartile) along the list. The second quartile is the same thing as the median.
The interquartile range (IQR) is the difference between the first and third quartiles, i.e. \(\textrm{third quartile} - \textrm{first quartile}\)Outliers are datapoints that appear to be quite different from the rest of the observations for that variable. This is often taken to be a warning sign, and outliers need to be handled with care. They can arise by a number of different means:
Outliers should not be deleted from the dataset unless they are caused by experimental or recording error.
Histograms divide a dataset into bins - ranges of a set size. The count of datapoints in each bin is shown as a bar. This can be thought of as a bar chart, where the bin is a categorical variable.
A histogram is not the same thing as a bar chart.
Histograms can be very useful, but there is more than one way to visually present histogram data.
In Figure 2.3 the histograms for the three species are presented, and overlaid. This has the advantage that the histogram values (datapoints in each bin) can be read easily from the y-axis, and the shape of the histogram of the distribution is not distorted for each species. Howevere there is a disadvantage that each bar may obscure others behind it. An attempt has been made in this figure to reduce this problem with the use of transparancy (also known as alpha channel), but the image is still not very clear.
The “neatness” of a histogram is essentially controlled by a single parameter: the bin width. This defines the width of the bars in the histogram. Too small, and each bar has a count of one (also known as a rug plot). Too large, and the shape of the data is masked.
Stacked histograms such as the one in 2.4 are frequently used, and resolve an issue in 2.3 where bars may be obscured. However, the “shape” of the distribution for each of the species is now distorted when bars are offset, and it can be difficult to compare how distributions differ, or see the relationships between them.
Figure 2.5 shows a dodged histogram. This attempts to avoid the problem of obscuring data in 2.3 by shifting distributions to the side. This doesn’t distort the shape of the distribution, but it can make it difficult to register comparisons between datasets (here, for species), due to the spacing. That is particularly pronounced when there are multiple distributions to compare.
The best solution for representing histograms of multiple distributions is often a small multiple plot. These do not distort the data, and avoid problems with overlaying or jogging data by representing each dataset separately. To be successful, small multiple plots should share the same axes (to facilitate comparisons) and each subplot should be clearly labelled.
Kernel Density Estimate plots, like that in 2.7, smooth histograms so that they can be represented as areas with continuous smooth boundaries. They look neat, and are not as immediately sensitive to choice of bin size as histograms. However, the extent of smoothing is under the control of both the choice of kernel (the smoothing function) and any parameters for the kernel.
KDE plots may imply more or less “shape” (i.e. undulations up and down, implying minor peaks) to a dataset than the data actually contains. However, they are widely used, and form the basis for several other visualisations. As with histograms, when plotting multiple distributions on the same axes it may be necessary to use transparency to avoid obscuring data. Graphs with many datasets rapidly become confusing and hard to follow. Small multiple plots (Figure 2.8) can also be useful here.
The ridgeplot (also known as a joyplot, or stacked density plot) tries to resolve problems due to the presentation of overlapping areas by offsetting the distributions on the y-axis. This improves visualisation for multiple distributions, but data can still be obscured.
Violin plots are a variation of KDE plots. Variable values are presented on the y-axis, and datasets are separated along the x-axis (like a rotated ridgeplot), but the KDE is mirrored right-to-left, producing shapes apparently reminiscent of violin bodies (though people claim that lamps, faces, and other things are more easily seen).
Violin plots avoid the problem of overlapping datasets, and look especially attractive as small multiple plots, but retain the other problems of KDEs (dependence on kernel and parameter choice)
No single visual representation is perfect, and it is often useful to combine representations to combine their advantages, or offset disadvantages, as with figures 2.11, 2.12, and 2.13, below.
Tools like R
and ggplot
make these complex figures straightforward to generate. Tools like Excel
do not.
For comparison, we present the usual literature representation of this kind of data: a bar chart with error bars showing standard deviation of each dataset.
Relationships between numerical data are usually summarised with scatterplots. You have already seen a number of scatterplots, and that the convention is to plot the explanatory variable on the x-axis, and the response variable on the y-axis. But when exploring datasets we may not always know which variables control the response, and we may be looking for correlations. In those cases the convention is often relaxed.
In figure 2.15 we plot Sepal.Length
against Sepal.Width
for the iris
dataset, to see if there is any obvious relationship.
At first sight there appears to be no obvious relationship between the two variables. We can overlay a linear regression to see if there is a linear relationship between them, as in Figure 2.16.
The coefficient of correlation is -0.1175698, suggesting at first glance that there is no linear relationship. However, careful use of colour, as in 2.17, can aid greatly with interpretation of scatterplots:
Now in Figure 2.18 it looks as though, rather than there being no linear relationship, that there are two different linear relationships, with reasonable correlation coefficients.
By exploring the scatterplot using colour, we can propose that I. versicolor and I. virginica could be considered separately from I. setosa, in terms of the relationship between sepal width and length.
When we visualise categorical data, we usually mean - implicitly - visualising counts of categories in the data, such as the numbers in: Control
and Treatment
; None
, Weak
, Moderate
, and Strong
; or Europe
, North America
, South America
, Asia
, Africa
, and Australasia
. Tables may well be clearer than visualisations, for smaller datasets. Visual options include bar charts, stacked bar charts and pie charts, though pie charts are rarely a good option to choose.
The message we want to get across for a categorical variable is often declarative (“There were this many examples of each category”) or comparative (“There were this many examples of category A relative to category B”). This kind of data is well-represented by a dot chart (essentially, a univariate scatterplot with a single value) or bar chart. But if the data is proportion or percentage data, and we want to emphasise the proportion of the total, a more natural representation might seem to be the stacked/divided bar chart or pie chart. However, there is no data that can be represented by a bar chart or pie chart that cannot be represented by a dot chart.
Bar charts are a reasonable way to represent the total counts in a category, as in Figure 3.1 which shows the number of passengers on the Titanic, by class.
Dot charts represent all the same information as Bar Charts, with less ink.
To see the counts of passengers who were in each class, conditioned on sex, we can use a stacked bar chart. This places each category on the same bar.
In Figure 3.3 the y-axis is the total count of passengers, but the representation of proportions in 3.4 is almost identical.
Pie charts are frequently criticised. In my view this is with good reason. Humans find it easier to judge differences between lengths than between areas, and differences between areas than differences between angles. Pie charts use angle and area to represent data; bar charts use area and length; and dot charts use length alone. It is difficult - unless the data is specifically ordered - to rank categories in bar charts and pie charts; it is much easier to do so with dot charts.
Also, pie charts can only represent proportional data, not absolute counts. They are more limited than bar charts or dot charts (which can each represent exactly the same data as the pie chart, and much more), and less easy to understand than the equivalent stacked bar chart. But, for completeness, the class data for the Titanic passenger list is presented in Figure 3.5.
A benefit to pie charts is that people recognise “anchor points” on the circle. We can recognise 25% (90 degrees), 50% (180 degrees) and 75% marks quite well, but it is unusual to be able to distinguish between other angles. This is particularly problematic for small angles.
Graphical relationships representing comparisons of categorical variables aren’t very common. They are conceptually not very difficult, but few tools have provided a ready way to generate suitable figures (scatter plots, bar charts, and pie charts are more readily found). However, tools like GGally
in R
make complex graphs accessible, and that package provides a figure style that is a grid of rectangles with proportional areas.
In Figure 3.6, the proporition of Titanic survivors by sex is shown, where larger rectangles indicate a larger absolute count (area is proportional to count).
You will see later that this representation is an intuitive graphical version of the data used in a chi-square test.
One advantage of this representation is that we can use the stacked bar representation (see above) to subdivide each of the rectangular areas by a further categorical variable, as in Figure 3.7 where we divide the blocks representing passengers conditioned on sex and survival, according to class.
Tools like GGally
and ggplot2
in R
provide a pairs plot graphics that combine the best-practice versions of the above representations to get a quick overview of a dataset. You met examples of this in Notebook 02 (“What is a Dataset?”). The iris
and titanic
datasets are summarised below in figures 4.1 and 4.2.
Bang Wong’s series of “Points of View” articles in Nature Methods are an excellent and clearly-written entry point covering many areas of data visualisation.