Density Plots – BM432 Data Visualisation Workshop

Kernel Density Estimate plots, like that in Figure 1, effectively smooth histograms so that they can be represented as areas with continuous smooth boundaries. They look neat, and are not as immediately sensitive to choice of bin size as histograms. However, the extent of smoothing is under the control of both the choice of kernel (the mathematical smoothing function) and any parameters for the kernel.

Figure 1: Kernel Density Estimate (KDE) plot of sepal length for each species from the `iris` dataset.

KDE plots may imply more or less “shape” (i.e. undulations up and down, implying minor peaks) to a dataset than the data actually contains. However, they are widely used, and form the basis for several other visualisations.

1 Small Multiples

As with histograms, when plotting multiple distributions on the same axes it may be necessary to use transparency to avoid obscuring data. Graphs with many datasets rapidly become confusing and hard to follow. Small multiple plots (Figure 2) can also be useful here.

2 Stacked Density Plot/Joyplot/Ridgeplot

The ridgeplot (also known as a joyplot, or stacked density plot) tries to resolve problems due to the presentation of overlapping areas by offsetting the distributions on the y-axis (Figure 3). This improves visualisation for multiple distributions, but data can still be obscured, and there may be a false implication of perspective in the image.

Figure 3: Ridgeplot/joyplot/stacked density plot of sepal length for each species from the `iris` dataset.

3 Violin plot

Violin plots (Figure 4) are a variation of KDE plots. Variable values are presented on the y-axis, and datasets are separated along the x-axis (like a rotated ridgeplot), but the KDE is mirrored right-to-left, producing shapes apparently reminiscent of violin bodies (though people claim that lamps, faces, and other things are more easily seen).

Figure 4: Violin plot of sepal length for each species from the `iris` dataset.

Violin plots avoid the problem of overlapping datasets, and look especially attractive side-by-side or as small multiple plots, but retain the other problems of KDEs (dependence on kernel and parameter choice)