Histograms

Summarising continuous data distributions with histograms

The values in a dataset span a range from the smallest to the largest value. Histograms divide such datasets into bins - smaller ranges of a set size. The count of datapoints in each bin is shown as a bar. (This can be thought of as a bar chart, where the bin is a categorical variable representing a range from the dataset).

CautionA histogram is not the same thing as a bar chart.
  • Histogram: a representation of the distribution of numerical data
  • Bar Chart: heights or lengths associated with categorical variables

Histograms can be very useful, but there is more than one way to visually present histogram data.

1 Histogram

Figure 1: Histograms of sepal length for each species from the iris dataset.

In Figure Figure 1 histograms for each of the three species are presented, and overlaid. This has the advantage that the histogram values (datapoints in each bin) can be read easily from the y-axis, and the shape of the histogram of the distribution is not distorted for each species. However there is a disadvantage that each bar may obscure others behind it. An attempt has been made in this figure to reduce this problem with the use of transparancy (also known as alpha channel), but the image is still not very clear.

The “neatness” of a histogram is essentially controlled by a single parameter: the bin width. This defines the width of the bars in the histogram. Too small, and each bar has a count of one or zero (also known as a rug plot). Too large, and the shape of the data is masked.

2 Stacked Histogram

Figure 2: Stacked histograms of sepal length for each species from the iris dataset.

Stacked histograms such as the one in Figure 2 are frequently used, and resolve an issue in Figure 1 where bars may be obscured. However, the “shape” of the distribution for each of the species is now distorted when bars are offset, and it can be difficult to compare how distributions differ, or see the relationships between them.

3 Dodged Histogram

Figure 3: Dodged histogram of sepal length for each species from the iris dataset.

Figure 3 shows a dodged histogram. This attempts to avoid the problem of obscuring data in Figure 1 by shifting distributions to the side. This doesn’t distort the shape of the distribution, but it can make it difficult to register comparisons between datasets (here, for species), due to the spacing. That is particularly pronounced when there are multiple distributions to compare.

4 Small Multiples

Figure 4: Small multiple histograms of sepal length for each species from the iris dataset.

The best solution for representing histograms of multiple distributions is often a small multiple plot as in Figure 4. These do not distort the data, and avoid problems with overlaying or jogging data by representing each dataset separately. To be successful, small multiple plots should share the same axes (to facilitate comparisons) and each subplot should be clearly labelled.