Numerical Data

Good practice for presenting numerical data

It is usually best to show original numerical data as completely as possible. This may mean showing each datapoint or observation individually. If that is not possible or desirable, then there are several options for showing the distribution of the data as a summary of the data:

In general, your preference should be:

  1. Show all datapoints
  2. Show a summary or model of the data (e.g. a CDF, KDE, or violin plot)
  3. Show a representation of summary statistics data (e.g. a box plot)
CautionAvoid bar charts

I strongly discourage the use of bar charts to represent numerical data distributions. The reasons for this are reported in the relevant literature:

  1. People interpret bar graphs as comparisons of discrete datapoints (Zacks and Tversky (1999))
  2. Identical bar graphs can be generated from multiple very different data sets (Weissgerber et al. (2015)), potentially disguising significant differences in the data
  3. Bar graphs are intended to represent categorical variables, not numerical, paired or nonindependent data (by definition)
  4. Presentations of error bars imply that data are Normally distributed, which may not be true (Weissgerber et al. (2015)), causing wrong statstical inferences to be drawn

References

Weissgerber, Tracey L, Natasa M Milic, Stacey J Winham, and Vesna D Garovic. 2015. “Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm.” PLoS Biol. 13 (4): e1002128. https://doi.org/10.1371/journal.pbio.1002128.
Zacks, J, and B Tversky. 1999. “Bars and Lines: A Study of Graphic Communication.” Mem. Cognit. 27 (6): 1073–79. https://doi.org/10.3758/bf03201236.