Excel
cannot produce a 1D scatterplot easilyExcel
needs data in wide formatExcel
cannot easily convert long format data to wide format dataExcel
can fit a basic trendline that is a linear regression modelanalysis Toolpak
must be usedExcel
plots usually need manual attention to conform to good visualisation guidelinesTo work through these examples, you will need the following on your own computer:
Excel
(you have a licence for this via the university)Excel
does not have an option to produce a 1D scatterplot, natively. We therefore produce only a boxplot (also known as a box-and-whisker plot)
You cannot easily convert the data in toothgrowth.tab
to the format you need, in order to create a boxplot, using only Excel
. The original file presents data in long format (one row per observation). To construct a boxplot in Excel
, the data must be in wide format. This requires either a lot of function programming in Excel
, or the use of the Power Query
plugin, both of which are beyond the scope of this workshop.
We are instead providing a pre-prepared dataset: toothgrowth_wide.tab
so that you can construct a boxplot in Excel
:
R
Being a statistical programming package, R
has several tools for reshaping datasets. We used the tidyr
package function pivot_wider
and some dplyr
convenience functions to modify the data, then save it, using the commands below:
widedata = ToothGrowth %>%
group_by(supp, dose) %>%
mutate(row=row_number()) %>%
pivot_wider(names_from=c(supp, dose), values_from=len)
write.table(widedata, "data/toothgrowth_wide.tab",
sep="\t", row.names=FALSE)
Our goal is to show how the measured guinea pig tooth growth varies by combination of supplement and supplement dosage. We could approach this in any of several ways, but here we want to treat each supplement as a category or factor, and each dosage as a category or factor. We’d like to see the distribution of measured tooth lengths conditioned on these explanatory variables.
What we’re looking for is a visual representation of the variation in the dataset, for each combination of supplement and dosage. Normally, a 1D scatterplot is a good way to visualise the raw data, with a boxplot/box-and-whisker plot to represent summary statistics. Sadly, Excel
cannot represent a 1D scatterplot easily, so we only show the box-and-whisker plot.
Blank
Workbook` in ExcelFile
\(\rightarrow\) Import
to begin the file import wizard.toothgrowth_wide.tab
fileAt this point the data are in six columns, identified as a combination of supplement (VC
or OJ
) and supplement dose (one of 0.5
, 1
, or 2
mg/day). Each measurement is tooth length from a single guinea pig.
To create the box-and-whisker plot, select the complete dataset (including headers), then click on Insert
and select the statistical plots button from the ribbon, then the Box-and-Whisker
option in the dropdown menu.
This will generate a new graph in your worksheet.
The graph is, by default, not informatively-labelled.
Add Chart Element
\(\rightarrow\) Legend
\(\rightarrow\) Right
menu to add a legend to the graph.This looks OK, and the interpretation is straightforward enough:
OJ
appears to be associated with greater tooth lengthVC_1
databut we could do better with the colours. As the supplement choice is categorical data, we can colour OJ
and VC
consistently, but differently: orange for OJ
and green for VC
. We can then use different intensities (or saturation) to imply variation in dosage, so that more intense colours imply stronger dosage.
To change the colour of a box/whisker combination, click on the box and choose Format Data Series...
then change the colours as appropriate.
With the Format Data Series
side panel open, you can colour each of the other data series, to obtain a final graph.
The Prestige
dataset is described in the introduction notebook. It describes a set of occupations - one per row (observations) - with variables describing properties of each occupation, such as percentage of women, the “prestige” of the occupation, and the average number of years in education of a person in that occupation.
Here, we will use Excel
to load in the dataset and model the relationship between prestige and years in education, using a linear relationship. We’d like to overlay a line describing the relationship, with some statistical information about goodness of fit and the inferred parameters of the model (gradient and intercept).
Blank
Workbook` in ExcelFile
\(\rightarrow\) Import
to begin the file import wizard.prestige.tab
fileThe first column has no header
To import the data correctly, you must select both the Tab
and Space
delimiter options.
Select the education
and prestige
columns, then click on Insert
\(\rightarrow\) ’Scatterplot`. This will create a scatterplot in the worksheet.
To add a regression model, left-click on any of the datapoints in the graph, then right-click to bring up the context menu, and select Add trendline...
The Format Trendline
side panel will appear. Ensure that the Linear
model is selected, and that Display Equation on chart
and Display R-squared value on chart
are checked (this adds the regression equation to the plot).
The model fit looks plausible, and has \(r^2\) value 0.7228, which indicates a moderately strong positive correlation (around 30% of the variation in the value of job prestige cannot be explained solely by the average time spent in education). But the graph needs some attention before it is ready.
We need to add meaningful \(x\)- and \(y\)-axis labels, and to improve the title. To rewrite the title, click on the graph element and write in a new title. To obtain axis title elements that you can edit in the same way, click on Add Chart Element
\(\rightarrow\) Axis Titles
\(\rightarrow\) More Axis Title Options...
. Then edit the axis labels accordingly.
The fitted equation implies that each additional year of education is associated with an increased apparent occupational prestige of around 5.3 units. The model also implies that zero years of education results in an apparent occupational prestige of -10.7 units (which you may or may not think is appropriate in the context of the model).
Excel
gives us an equation for a linear regression, but it does not provide us with any information about uncertainty or confidence. What range of lines are plausible fits to the data? What are the reasonable ranges of estimates of intercept and gradient?
We can’t know from this analysis, but Excel
365 does provide an Analysis ToolPak
which can give you that information.
Analysis Toolpak
output
Select the data (not the headers) in columns B
(education
) and E
(prestige
). Then click on Data
\(\rightarrow\) Data Analysis
\(\rightarrow\) Regression
to get the dialogue box.
Clicking OK
gives a new worksheet with more information about the linear regression than can be obtained through Excel
’s usual plots.
This worksheet shows us that the 95% confidence interval for the intercept runs between -18 and -3.4, and the confidence interval for the gradient is in the range 4.7 to 6.
The worksheet also presents useful diagnostic plots of residuals and Normal probability that help us assess whether the regression was appropriate. Such graphs are always worth generating to see if they indicate problems with the data or the model (though a discussion is beyond the scope of this workshop).