2 The Grammar of Graphics
It may take a couple of minutes to install the necessary packages in WebR
- please be patient.
In this section you will meet ggplot2
, a very popular and powerful data visualisation package in R
. We will also learn about the “grammar of graphics,” a way of thinking about constructing data visualisations that also is the source of the gg
in ggplot2
.
The grammar of graphics is a set of concepts relevant to visualising data. It separates the data from the way the data is represented, which may be a new approach to you - it’s certainly different to the way Excel and Graphpad Prism choose to control visualisation. It is, though, highly effective for generating powerful visualisations.
2.1 Load in the data
In order to visualise data using ggplot2
in R
, we have to load it. We have provided the data for this walkthrough in the file gapminder.csv
, which can be loaded with the command:
<- read_csv("gapminder.csv", col_types="fnnfnn") gapminder
The column types are, in order: factor, number, number, factor, number, number and can be expressed as "fnnfnn"
for the col_types
option in read_csv()
.
Using the col_types
option means that R
“knows” what the data in each column should be, and report problems if they arise.
2.2 Inspect the data
It is always good practice to visually examine your dataset, to become familiar with your data and check for obvious problems. With the data loaded in, let’s take a look at it and see what it contains, using the commands:
str(gapminder)
summary(gapminder)
Here, str
is an abbreviation of “structure” and the command will show us the structure of the dataset
2.3 A basic scatterplot
You can use ggplot2
in a similar way to other tools (like Excel and Prism) to produce “canned” visualisations like scatterplots. Here, we need to specify which data go on the x and y axes, and the dataframe we’re using, and some options like which variable to use to colour datapoints. For example, the code:
qplot(lifeExp, gdpPercap, data=gapminder, colour=continent)
will generate a (quick, hence qplot()
) scatterplot of GDP per capita (gdpPercap
, y-axis) against life expectancy (lifeExp
, x-axis) from the corresponding columns in the gapminder
dataframe. Each different continent
in the dataset will be plotted with its own colour.
Although the plot looks to be nice enough, it’s still constraining and doesn’t really express the power of ggplot2
.
2.4 The grammar of graphics
To really get to grips with ggplot2
we need to talk about what makes up a plot. We can use the example of the scatterplot you’ve just generated (reproduced in Figure 2.1)
2.4.1 Aesthetics
In this plot, every row in the table (we call each row an observation) is represented by a single point. How that point is rendered in the plot is determined by its aesthetics:
- The x position aesthetic determines where the point is rendered in relation to the side of the graph
- The y position aesthetic determines where the point is rendered in relation to the bottom of the graph
- The size of the point is an aesthetic
- The shape of the point is an aesthetic
- The colour of the point is an aesthetic
- The transparency of the point is an aesthetic
In ggplot2
these aesthetics can be constant (like shape in Figure 2.1), or mapped to - under the control of - variables in the dataset (like colour in Figure 2.1).
We can generate many different kinds of plot from the same data just by changing the aesthetics
2.4.2 geom
s
The other key thing to know about in ggplot2
is the idea of a geometry, or geom
. The geom
determines the kind of data representation for the plot. There are many kinds of geom
in ggplot2
, and combinations of aesthetic and geom can reproduce several kinds of plot (as in Figure 2.2).
All ggplot2
graphs are combinations of geom
s and aesthetics.
We can demonstrate this in the WebR
cell below. Use the following code to create a scatterplot:
<- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent))
p + geom_point() p
The first line uses the ggplot()
command to create the plot from the dataset (data=gapminder
) with a set of aesthetics (aes(x=lifeExp, y=gdpPercap, color=continent)
). This plot is put into a variable (p
) for convenience so that we can experiment in the workshop.
Just defining the graph isn’t enough to show it. To do this we also need a geom
Because the plot is in the variable p
, we can “add” a geom
to it. We’ll add geom_point()
so that it gives us a scatterplot, like that in Figure 2.1.
Replace the geom_point()
in the WebR
code with geom_line()
. Does this give a better representation of the dataset?
2.4.3 Layers
Without even thinking about it, we’ve been using the concept of layers.
All ggplot2
plots are built from layers.
All layers have two components:
- data that will be shown, and aesthetics for showing the data
- a
geom
the defines the type of plot on that layer
The layers in the plot you created above are shown in Figure 2.3.
The ggplot()
layer is the “base layer” and contains data
and aesthetics (aes()
). These are inherited by the other layers in the plot, such as the geom_point()
layer (unless overridden).
ggplot()
layer is the “base layer” and contains data
and aesthetics (aes()
). These are inherited by the other layers in the plot, such as the orange geom_point()
layer (unless overridden).
Using the WebR
cell below, create a plot showing how life expectancy (lifeExp
) changes as a function of time (year
), as a scatterplot.
- Can you generate the graph by just changing variables in the code you’ve already written?
Use the same code as above, but put year
on the x
aesthetic/axis, and lifeExp
on the y
aesthetic/axis.
<- ggplot(data=gapminder, aes(x=year, y=lifeExp, color=continent))
p + geom_point() p
We can build up additional layers of geom
s to create more complex plots, and to apply aesthetics specifically to different layers. For example, we can plot GDP per capita (gdpPercap
) against life expectancy (lifeExp
) as a line plot (geom_line()
), grouping points by country
, but coloured by continent. The code for this is:
<- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent))
p + geom_line(aes(group=country)) p
The layer structure of this plot is shown in Figure 2.4 (and the result in Figure 2.5)
ggplot()
layer is the “base layer” and contains data
and aesthetics (aes()
). These are inherited by the other layers in the plot, such as the orange geom_line()
layer, but an additional aesthetic group
ing points by country is added.
We can go on to add a top-level geom_point()
scatterplot with each datapoint shown as a semitransparent (alpha=0.4
) point by changing this code - in the same WebR
cell to:
<- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent))
p + geom_line(aes(group=country)) + geom_point(alpha=0.4) p
This gives the plot in Figure 2.6, which has the layer structure in Figure 2.7.
ggplot()
layer is the “base layer” and contains data
and aesthetics (aes()
). These are inherited by the other layers in the plot, such as the orange geom_line()
and green geom_point()
layers, but additional aesthetics group
ing points by country and changing the transparency of points is added.
Using the WebR
cell below, create a plot showing how life expectancy (lifeExp
) changes as a function of time (year
), coloured by continent, with two layers:
- a line plot, grouping points by country
- a scatterplot showing each data point, with 35% opacity
- Can you generate the graph by just changing variables in the code you’ve already written?
Use the same code as above, but put year
on the x
aesthetic/axis, and lifeExp
on the y
aesthetic/axis, and changing the alpha
parameter to 0.35
.
<- ggplot(data=gapminder, aes(x=year, y=lifeExp, color=continent))
p + geom_line(aes(group=country)) + geom_point(alpha=0.35) p
2.5 Multi-panel figures
All the plots we have made so far have been single-panel figures. With large datasets, this can get a bit messy and obscure the story you want to tell with data.
ggplot2
has a layer called facet_wrap()
which allows us to split data into panels on the basis of a variable. For example, the code below allows us to split the plot in Figure 2.6 into separate panels for each continent:
<- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent))
p + geom_line(aes(group=country)) + geom_point(alpha=0.4) + facet_wrap(~continent) p
Using the WebR
cell below, create a plot showing how life expectancy (lifeExp
) changes as a function of time (year
), coloured by continent, with two layers:
- a line plot, grouping points by country
- a scatterplot showing each data point, with 35% opacity
- Can you generate the graph by just changing variables in the code you’ve already written?
Use the same code as above, but put year
on the x
aesthetic/axis, and lifeExp
on the y
aesthetic/axis, and changing the alpha
parameter to 0.35
, and adding a facet_wrap()
layer:
<- ggplot(data=gapminder, aes(x=year, y=lifeExp, color=continent))
p + geom_line(aes(group=country)) + geom_point(alpha=0.35) + facet_wrap(~continent) p
2.6 What’s next?
Now that you have learned some of the key features of how to make a ggplot2
figure, we’re ready to think about the experiment that’s providing us with data for this workshop, which you’ll meet in the next section.