R
has many options for data visualisation, but for this workshop we will use only base graphics and ggplot2
ggplot2
graphics are extremely powerful and produce attractive professional graphics, but can take some getting used toTo work through these examples, you will need the following on your own computer:
R
and RStudio
Our goal is to show how the measured guinea pig tooth growth varies by combination of supplement and supplement dosage. We could approach this in any of several ways, but here we want to treat each supplement as a category or factor, and each dosage as a category or factor. We’d like to see the distribution of measured tooth lengths conditioned on these explanatory variables.
What we’re looking for is a visual representation of the variation in the dataset, for each combination of supplement and dosage. A 1D scatterplot is a good way to visualise the raw data, and a boxplot/box-and-whisker plot is a good way to represent summary statistics.
To work in a reproducible and replicable way, we will create a script to manage our data visualisation.
Start a new RStudio
session, and use the Files
panel to navigate to the location of the data for this workshop.
Click on More
and select Set As Working Directory
. A command will be executed in the console window, to set the working directory.
Create a new R
script by clicking on File
\(\rightarrow\) New File
\(\rightarrow\) R Script
. This will open a new, empty R
script in the upper left panel.
Enter the following comment and code into the R
script. Save the script with a suitable name (e.g. boxplot.R
), and select the code and click Run
to load the ToothGrowth
dataset into a variable called toothgrowth
.
# Load the ToothGrowth dataset
# There are three columns:
# - len: length of tooth (cm)
# - supp: supplement (VC or OJ)
# - dose: supplement dosage (mg/day)
toothgrowth = read.table("toothgrowth.tab",
header=TRUE,
sep="\t",
stringsAsFactors=TRUE)
R
command
read.table()
is a function that reads data in from a file, in tabular format. The first argument to the function is the name of the input fileheader=TRUE
tells the function to read the first line of the file as a headersep="\t"
tells the function to interpret the tab character as a column separatorstringsAsFactors=TRUE
tells the function to read data that is a “string” as a qualitative data categoryR
has built-in graphics capability called base graphics. These are available as functions in R
without the need to load any other libraries. Although they are powerful and quick, they don’t always give the most attractive results.
To use R
’s base graphics for generation of a boxplot, enter the following comment and code into the R
script and run it.
# Use R base graphics to produce a boxplot
boxplot(len ~ supp + dose,
data=toothgrowth,
xlab="Supplement and dosage",
ylab="Tooth length",
main="Tooth length by supplement and dosage")
R
command
boxplot()
is a function that creates a boxplotlen ~ supp + dose
tells the function which data to plot on the \(y\)- and \(x\)-axes. Read this as <YDATA> ~ <XDATA1> + <XDATA2>
; using this expression tells R
to group the length data by both supplement and dosedata=toothgrowth
tells the function that len
, supp
and dose
come from the toothgrowth
dataframexlab=
and ylab=
tell the function what to write as \(x\)- and \(y\)-axis labelsmain=
tells the function what the title of the graph should beThe combinations of supplement and dosage are out of order in the boxplot. This can be fixed, but takes a little effort.
R
’s base graphics have a stripchart()
function, which works in a similar way to boxplot()
for producing 1D scatterplots. To see it in action, add the code below to your script and run it.
# Use R base graphics to produce a stripchart/1D scatterplot
stripchart(len ~ supp + dose,
data=toothgrowth,
method="jitter",
pch=19,
frame=FALSE,
vertical=TRUE,
xlab="Supplement and dosage",
ylab="Tooth length",
main="Tooth length by supplement and dosage")
We only needed to swap the stripchart()
function call for the boxplot()
function to get a workable graph. The other options rotate the graph and improve its appearance.
R
command
The features here that differ from the boxplot are:
stripchart()
is a function that creates a stripchart/1D scatterplotmethod="jitter"
moves datapoints left and right a bit, so they don’t overlappch=19
changes the symbol for datapoints to be filled circlesframe=FALSE
turns off the axis framevertical=TRUE
plots the tooth lengths vertically, instead of horizontallyThe combinations of supplement and dosage are out of order in the stripchart, too. This can again be fixed, but takes a little effort.
The ggplot2
library provides functions for generating high-quality, publication-ready figures using the Grammar of Graphics. The syntax for creating graphs takes a bit of time to learn, but the results are worth it.
To use a ggplot2
approach here, combining both a boxplot and 1D scatterplot, add the code below to your script, and run it.
# Use ggplot2 to overlay a stripchart on a boxplot
# First, import the ggplot2 library
library(ggplot2)
# Ensure that dosage is recorded as a category/factor
toothgrowth$dose = as.factor(toothgrowth$dose)
# Plot the graph
ggplot(toothgrowth,
aes(x=dose, y=len, fill=supp)) +
geom_boxplot() +
geom_jitter(position=position_jitterdodge())
R
commands
library(ggplot2)
loads the ggplot2
library, so we can use it
as.factor(toothgrowth$dose)
converts the numeric values in the dose
column of the dataset into categories, so we can use them to group the datasets in ggplot2
(otherwise, ggplot2
thinks they’re continuous values)
ggplot()
is a function that sets up the data for our plot; the first argument is the dataset we’re using
aes(x=dose, y=len, fill=supp)
tells ggplot()
that we want to use the dose
data on the \(x\)-axis, the len
data on the \(y\) axis, and to colour (fill) the boxes by the category data in the supp
column
geom_boxplot()
draws a boxplot
geom_jitter()
draws jittered datapoints
position=position_jitterdodge()
tells geom_jitter()
to align the points with the boxesThis way of producing a graph is fast, economical, and powerful. We have accomplished in four lines of code something that we couldn’t even achieve in Excel
or Minitab
. It also allows us to experiment by making minor changes to the code to modify the visualisation. For instance, we could add a line to the code to change the category colours:
ggplot(toothgrowth,
aes(x=dose, y=len, fill=supp)) +
geom_boxplot() +
geom_jitter(position=position_jitterdodge()) +
scale_fill_manual(values=c("#999999", "#E69F00"))
R
command
The feature here that differs from the plot above is:
scale_fill_manual()
is a function that defines a colour scale
OJ
and VC
: one is grey, and the other yellow.The Prestige
dataset described in the introduction notebook represents a set of occupations - one per row (observations) - with variables describing properties of each occupation, such as percentage of women, the “prestige” of the occupation, and the average number of years in education of a person in that occupation.
In R
(like Minitab
) we can model the relationship between prestige and years in education, using a linear relationship, without plotting it - and we can use graphical tools to overlay a regression curve within a plot, that describes the relationship, with some statistical information about goodness of fit and the inferred parameters of the model (gradient and intercept).
Create a new R
script by clicking on File
\(\rightarrow\) New File
\(\rightarrow\) R Script
. This will open a new, empty R
script in the upper left panel.
Enter the following comment and code into the R
script. Save the script with a suitable name (e.g. linreg.R
), and select the code and click Run
to load the Prestige
dataset into a variable called prestige
.
# Load the Prestige dataset
prestige = read.table("prestige.tab",
header=TRUE,
sep="\t",
stringsAsFactors=TRUE)
R
command
read.table()
is a function that reads data in from a file, in tabular format. The first argument to the function is the name of the input fileheader=TRUE
tells the function to read the first line of the file as a headersep="\t"
tells the function to interpret the tab character as a column separatorstringsAsFactors=TRUE
tells the function to read data that is a “string” as a qualitative data categoryThe first step in the analysis to conduct the linear regression. To do this in R
, use the function lm()
to regress the prestige
variable onto the education
variable with the code below:
# Carry out linear regression on the dataset, with prestige as
# the dependent variable, and education as the explanatory
# variable
model = lm(prestige ~ education, data=prestige)
R
command
lm()
function carries out a linear regressionprestige ~ education
tells the function to use education
as the explanatory variable, and prestige
as the dependent variabledata=prestige
tells the function to use the dataframe prestige
as the source for the education
and prestige
columns in the regressionThis appears to run silently, but in the upper right panel you will see a new variable called model
has been populated. This contains the output of the linear regression, including the intercept and gradient (the coefficients
), and diagnostic information.
The diagnostic information about the regression can be plotted using R
’s base plot()
function. To use it enter the following code in your script and run it.
# Show diagnostic graphics for the regression model
plot(model)
This will ask you to hit the <Return>
key to list through diagnostic plots in the lower right panel.
This is inconvenient, so we can modify our code in-place to put all four graphs together, as follows:
# Show diagnostic graphics for the regression model
par(mfrow=c(2,2))
plot(model)
par(mfrow=c(1,1))
Running this code now gives all four diagnostic plots together.
It may seem odd to you that, when we plot()
the linear regression output we don’t see the overlaid regression curve. This reflect’s R
’s usage as a statistical programming tool. In reality, we tend to fit much more complex models than simple linear regression, and overlaying the fitted curve onto data isn’t generally as informative or interpretable about the quality of the model fit as these diagnostic plots.
To plot the fitted regression curve over the data using base graphics, we can use the code below:
# Plot fitted regression curve on the data
plot(prestige ~ education, data=prestige)
lines(sort(prestige$education),
fitted(model)[order(prestige$education)],
col="red")
So far, we have not seen an explicit numerical representation of the fitted model parameters: intercept and gradient. They are present, in the summary of the model
variable in the top right panel (the values are -10.73
and 5.36
as we found with Excel
and Minitab
). We can access them directly using the console (lower left panel): type the command below, and hit the <Return>
key:
model
This returns the information:
Call:
lm(formula = prestige ~ education, data = prestige)
Coefficients:
(Intercept) education
-10.732 5.361
Reminding us of the nature of the model (prestige ~ education
) and giving us the fitted parameter values.
Here R
calls the intercept (Intercept)
but does not refer to a gradient. Instead, it associates the gradient value with the education
variable. This reinforces the meaning that each unit (year) increase in the education
variable is associated with a predicted 5.361 increase in prestige
score.
Alternatively, enter the following R
code in the console and hit <Return>
:
summary(model)
This gives us more information about the model fit:
Call:
lm(formula = prestige ~ education, data = prestige)
Residuals:
Min 1Q Median 3Q Max
-26.0397 -6.5228 0.6611 6.7430 18.1636
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.732 3.677 -2.919 0.00434 **
education 5.361 0.332 16.148 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.103 on 100 degrees of freedom
Multiple R-squared: 0.7228, Adjusted R-squared: 0.72
F-statistic: 260.8 on 1 and 100 DF, p-value: < 2.2e-16
Again, it reminds us of the model formula (prestige ~ education
), but gives us some more information, including the parameter estimates (again -10.732
, and 5.361
) but also the \(r^2\) value (0.7228
, as before), and some other measures of significance.
The probability estimates in the ANOVA table (here the values 0.00434 **
and < 2e-16 ***
) are estimates of the probability that the true value of the parameter is zero.
It is possible that you may have a different null hypothesis in mind. For instance, that the intercept is equal to -10, or that the gradient has the value 10. You would need to perform a different statistical test for those.
We can add arbitrary text to our regression plot, including the regression parameters and \(r^2\) value. Because these are results we’ve obtained in R
, we can refer to them with variable names, rather than having to type them directly. That way, if the data (or fit) changes, the values are still correct.
R
> model$coefficients[1]
(Intercept)
-10.73198
> model$coefficients[2]
education
5.360878
> summary(model)$r.squared
[1] 0.7228007
To add the equation to the plot in base graphics, we can add the code below to our script, and run it:
# Add text annotation to plot
text(x=6.2, y=70, pos=4,
labels=paste("y =",
round(model$coefficients[2], 2),
"x +",
round(model$coefficients[1], 2),
"; r^2 = ",
round(summary(model)$r.squared, 2)))
R
command
text()
function places text on the current plot
x=
sets the location of the text on the \(x\)-axisy=
sets the location of the text on the \(y\)-axispos=4
makes the text start to the right of the (\(x\), \(y\)) point specified in x=
, y=
labels=
sets the text to be shownpaste()
“sticks together” the text that follows into a single string
round()
rounds a value to the specified number of significant figuresggplot2
The procedure for generating an annotated linear regression plot using ggplot2
is simpler.
We can start by producing a scatterplot of prestige
against education
with the following code:
# Import ggplot for graphics
library(ggplot2)
# Generate an annotated linear regression plot in ggplot2
ggplot(prestige, aes(x=education, y=prestige)) +
geom_point()
R
command
library(ggplot2)
imports the ggplot2
graphics libraryggplot()
function sets up the data that will be used in the plot. We use the prestige
data, with education
as the explanatory variable, and prestige
as the dependent variable.geom_point()
function adds a scatterplot layer to the plot, and displays the dataWe can add a regression plot to this, using the geom_smooth()
function from ggplot
:
# Generate an annotated linear regression plot in ggplot2
ggplot(prestige, aes(x=education, y=prestige)) +
geom_point() +
geom_smooth(method="lm")
R
command
geom_smooth()
function adds a smoothing function to summarise the dataset. This can take any of many forms, including a linear regression curve.
method="lm"
specifies that the smoothing curve we want is a linear regressionBy default, the smoothed curve includes a ribbon that represents the 95% confidence interval of the fitted curve.
We can annotate the graph with the fitted equation, and its \(r^2\) value using the ggpubr
function stat_regline_equation()
:
# Import ggplot/ggpubr for graphics
library(ggplot)
library(ggpubr)
# Generate an annotated linear regression plot in ggplot2
ggplot(prestige, aes(x=education, y=prestige)) +
geom_point() +
geom_smooth(method="lm") +
stat_regline_equation(label.x = 3, label.y = 75,
aes(label = ..rr.label..)) +
stat_regline_equation(label.x = 3, label.y = 7,
aes(label = ..eq.label..))
R
command
library(ggpubr)
imports the ggpubr
library - a set of extensions to ggplot2
that makes some tasks, like adding the equation of a regression line, simplerstat_regline_equation()
is a function that adds a regression label to the graph
label.x
and label.y
set the co-ordinates of the regression labelaes(label = ..rr.label..)
represents the equation showing the \(r^2\) (rr
) valueaes(label = ..eq.label..)
represents the equation showing the linear equation (eq
)Using R
and ggplot
/ggpubr
, we have been able - in only four lines of code - to produce a linear regression plot with overlaid equation and \(r^2\) value.
The code is reusable and reproducible, and able to be shared in a report, thesis, or as supplementary information in a publication.