2D Scatterplots – BM432 Data Visualisation Workshop

Relationships between numerical data are often summarised with scatterplots. You have already seen a number of scatterplots, and you will be aware that the convention is to plot the explanatory (also known as independent) variable on the x-axis, and the response (or dependent) variable on the y-axis. But when exploring datasets we may not always know which parameters or variables control the response, and we may be looking at - or mining - the data for correlations. In those cases the convention is often relaxed.

In figure Figure 1 we plot Sepal.Length against Sepal.Width for the iris dataset, to see if there is any obvious relationship.

Figure 1: Scatterplot of sepal length against sepal width for the `iris` dataset.

1 Regression lines

At first sight there appears to be no obvious relationship between the two variables. We can overlay a linear regression to see if there is a linear relationship between them, as in Figure Figure 2.

Figure 2: Scatterplot of sepal length against sepal width for the `iris` dataset, showing linear regression between the variables.

The coefficient of correlation is -0.12, suggesting at first glance that there is no strong linear relationship. The ribbon representing a confidence interval for the slope of the fit includes a horizontal line, supporting an interpretation that there is no relationship between sepal width and length. But, if there is any kind of relationship, it looks like it would be negative (sepal length falling with increased sepal width)

2 Visualising multiple categories

However, there are several categories in the dataset, and careful use of colour, as in Figure 3, can aid greatly with interpretation:

Figure 3: Scatterplot of sepal length against sepal width for the `iris` dataset, coloured by species.”

Now in Figure 4 it looks as though, rather than there being no linear relationship or a negative one, that there are at least two different positive linear relationships, with reasonable correlation coefficients: one for I. setosa, and one for the other two species.

Caution

The category or class structure of a dataset can disguise true relationships, or even mislead the reader as to the magnitude or direction of a relationship. Careful analysis and visualisation is required to present relationships accurately

Figure 4: Scatterplot of sepal length against sepal width for the `iris` dataset, coloured by species, with linear regression on two separate groups.

By exploring the scatterplot using colour, we can propose that I. versicolor and I. virginica could be considered separately from I. setosa, in terms of the relationship between sepal width and length.