• 1D scatterplots and boxplots are good visualisations of the distribution of a dataset.
    • Excel cannot produce a 1D scatterplot easily
    • To produce a boxplot, Excel needs data in wide format
    • Excel cannot easily convert long format data to wide format data
  • Linear regression is a method for fitting a linear model to a relationship between variables
    • Excel can fit a basic trendline that is a linear regression model
    • To obtain useful diagnostic statistics, and confidence intervals, the analysis Toolpak must be used
  • Excel plots usually need manual attention to conform to good visualisation guidelines

1 Setup

To work through these examples, you will need the following on your own computer:

  1. The datasets (see this link for a description)
  2. Microsoft Excel (you have a licence for this via the university)

1.1 Boxplot

Excel does not have an option to produce a 1D scatterplot, natively. We therefore produce only a boxplot (also known as a box-and-whisker plot)

You cannot easily convert the data in toothgrowth.tab to the format you need, in order to create a boxplot, using only Excel. The original file presents data in long format (one row per observation). To construct a boxplot in Excel, the data must be in wide format. This requires either a lot of function programming in Excel, or the use of the Power Query plugin, both of which are beyond the scope of this workshop.

We are instead providing a pre-prepared dataset: toothgrowth_wide.tab so that you can construct a boxplot in Excel:

Click here to toggle details of how we modified the data using R

Being a statistical programming package, R has several tools for reshaping datasets. We used the tidyr package function pivot_wider and some dplyr convenience functions to modify the data, then save it, using the commands below:

widedata = ToothGrowth %>%
  group_by(supp, dose) %>%
  mutate(row=row_number()) %>%
  pivot_wider(names_from=c(supp, dose), values_from=len)
write.table(widedata, "data/toothgrowth_wide.tab",
  sep="\t", row.names=FALSE)

Our goal is to show how the measured guinea pig tooth growth varies by combination of supplement and supplement dosage. We could approach this in any of several ways, but here we want to treat each supplement as a category or factor, and each dosage as a category or factor. We’d like to see the distribution of measured tooth lengths conditioned on these explanatory variables.

What we’re looking for is a visual representation of the variation in the dataset, for each combination of supplement and dosage. Normally, a 1D scatterplot is a good way to visualise the raw data, with a boxplot/box-and-whisker plot to represent summary statistics. Sadly, Excel cannot represent a 1D scatterplot easily, so we only show the box-and-whisker plot.

  • Create a new BlankWorkbook` in Excel
Create a blank workbook in `Excel`

Figure 1.1: Create a blank workbook in Excel

Create a blank workbook in `Excel`

Figure 1.2: Create a blank workbook in Excel

  • Use File \(\rightarrow\) Import to begin the file import wizard.
  • Navigate to and select the toothgrowth_wide.tab file
Select data file for import

Figure 1.3: Select data file for import

Select data file for import

Figure 1.4: Select data file for import

  • Using the wizard, identify the file delimiter, set the column datatypes, and import the data into the current worksheet.
Select import options

Figure 1.5: Select import options

Select import options

Figure 1.6: Select import options

Select import options

Figure 1.7: Select import options

Select import options

Figure 1.8: Select import options

Select import options

Figure 1.9: Select import options

At this point the data are in six columns, identified as a combination of supplement (VC or OJ) and supplement dose (one of 0.5, 1, or 2 mg/day). Each measurement is tooth length from a single guinea pig.

To create the box-and-whisker plot, select the complete dataset (including headers), then click on Insert and select the statistical plots button from the ribbon, then the Box-and-Whisker option in the dropdown menu.

Select box-and-whisker plot

Figure 1.10: Select box-and-whisker plot

Select box-and-whisker plot

Figure 1.11: Select box-and-whisker plot

This will generate a new graph in your worksheet.

The graph is, by default, not informatively-labelled.

  • Use the Add Chart Element \(\rightarrow\) Legend \(\rightarrow\) Right menu to add a legend to the graph.
  • Double-click on the chart title to edit it
  • Select and delete the uninformative horizontal axis label
Tidy plot details

Figure 1.12: Tidy plot details

Tidy plot details

Figure 1.13: Tidy plot details

Tidy plot details

Figure 1.14: Tidy plot details

This looks OK, and the interpretation is straightforward enough:

  • Increasing supplement dose looks to be correlated with tooth growth
  • At 2mg/day both supplements are associated with similar tooth growth
  • There appears to be a straightforward relationship between tooth length and dosage where increaseing supplement dosage is associated with increasing tooth length, for both supplements
  • At lower dosages, OJ appears to be associated with greater tooth length
  • The mean (cross) and median (line) values do not coincide, so the data may be skewed/non-normal
  • There is a single outlier in the VC_1 data

but we could do better with the colours. As the supplement choice is categorical data, we can colour OJ and VC consistently, but differently: orange for OJ and green for VC. We can then use different intensities (or saturation) to imply variation in dosage, so that more intense colours imply stronger dosage.

To change the colour of a box/whisker combination, click on the box and choose Format Data Series... then change the colours as appropriate.

Recolour data series

Figure 1.15: Recolour data series

Recolour data series

Figure 1.16: Recolour data series

Recolour data series

Figure 1.17: Recolour data series

With the Format Data Series side panel open, you can colour each of the other data series, to obtain a final graph.

Final graph

Figure 1.18: Final graph

1.2 Linear regression

The Prestige dataset is described in the introduction notebook. It describes a set of occupations - one per row (observations) - with variables describing properties of each occupation, such as percentage of women, the “prestige” of the occupation, and the average number of years in education of a person in that occupation.

Here, we will use Excel to load in the dataset and model the relationship between prestige and years in education, using a linear relationship. We’d like to overlay a line describing the relationship, with some statistical information about goodness of fit and the inferred parameters of the model (gradient and intercept).

  • Create a new BlankWorkbook` in Excel
Create a blank workbook in `Excel`

Figure 1.19: Create a blank workbook in Excel

Create a blank workbook in `Excel`

Figure 1.20: Create a blank workbook in Excel

  • Use File \(\rightarrow\) Import to begin the file import wizard.
  • Navigate to and select the prestige.tab file
Select data file for import

Figure 1.21: Select data file for import

Select data file for import

Figure 1.22: Select data file for import

  • Using the wizard, identify the file delimiter, set the column datatypes, and import the data into the current worksheet.

The first column has no header

To import the data correctly, you must select both the Tab and Space delimiter options.

Select import options

Figure 1.23: Select import options

Select import options

Figure 1.24: Select import options

Select import options

Figure 1.25: Select import options

Select import options

Figure 1.26: Select import options

Select import options

Figure 1.27: Select import options

Select the education and prestige columns, then click on Insert \(\rightarrow\) ’Scatterplot`. This will create a scatterplot in the worksheet.

Create scatterplot

Figure 1.28: Create scatterplot

To add a regression model, left-click on any of the datapoints in the graph, then right-click to bring up the context menu, and select Add trendline... The Format Trendline side panel will appear. Ensure that the Linear model is selected, and that Display Equation on chart and Display R-squared value on chart are checked (this adds the regression equation to the plot).

Create trendline

Figure 1.29: Create trendline

Create trendline

Figure 1.30: Create trendline

Create trendline

Figure 1.31: Create trendline

The model fit looks plausible, and has \(r^2\) value 0.7228, which indicates a moderately strong positive correlation (around 30% of the variation in the value of job prestige cannot be explained solely by the average time spent in education). But the graph needs some attention before it is ready.

We need to add meaningful \(x\)- and \(y\)-axis labels, and to improve the title. To rewrite the title, click on the graph element and write in a new title. To obtain axis title elements that you can edit in the same way, click on Add Chart Element \(\rightarrow\) Axis Titles \(\rightarrow\) More Axis Title Options.... Then edit the axis labels accordingly.

Cleaning up the graph

Figure 1.32: Cleaning up the graph

Cleaning up the graph

Figure 1.33: Cleaning up the graph

The fitted equation implies that each additional year of education is associated with an increased apparent occupational prestige of around 5.3 units. The model also implies that zero years of education results in an apparent occupational prestige of -10.7 units (which you may or may not think is appropriate in the context of the model).

Excel gives us an equation for a linear regression, but it does not provide us with any information about uncertainty or confidence. What range of lines are plausible fits to the data? What are the reasonable ranges of estimates of intercept and gradient?

We can’t know from this analysis, but Excel 365 does provide an Analysis ToolPak which can give you that information.

Click to toggle information about the Analysis Toolpak output

Select the data (not the headers) in columns B (education) and E (prestige). Then click on Data \(\rightarrow\) Data Analysis \(\rightarrow\) Regression to get the dialogue box.

Regression dialogue box

Figure 1.34: Regression dialogue box

Clicking OK gives a new worksheet with more information about the linear regression than can be obtained through Excel’s usual plots.

Regression output

Figure 1.35: Regression output

This worksheet shows us that the 95% confidence interval for the intercept runs between -18 and -3.4, and the confidence interval for the gradient is in the range 4.7 to 6.

The worksheet also presents useful diagnostic plots of residuals and Normal probability that help us assess whether the regression was appropriate. Such graphs are always worth generating to see if they indicate problems with the data or the model (though a discussion is beyond the scope of this workshop).