Data Management and Visualisation in R

11-12/5/2021

1. Data Analysis

Data Analysis in the Scientific Cycle

Data-Intensive Research

Science and humanities are increasingly data-driven
- Early-career training has not prepared all researchers for this

Research Workflows

Enable systematic, replicable and reproducible work
- Design principles
  - Best practices for data
- Software development methods
  - Automation of repetitive calculations

Ten great papers for biologists starting out in computational biology https://widdowquinn.github.io/ten_great_papers/

Pipelines and Workflows

Pipeline

What a computer does
- A series of instructions
- Data is piped through programs, and a result emerges

Workflow

What a researcher does
- Exploring data, developing hypotheses, writing code, interpreting results
Outputs include:
- datasets, methods, teaching materials, software, papers, etc.

Explore, Refine, Produce (ERP)

Reproduced from Stoudt et al. (2021)

2. Welcome to R

Learning Objectives

Fundamentals of R and RStudio
Fundamentals of programming (in R)
Data management with the tidyverse
Publication-quality data visualisation with ggplot2
Reporting with RMarkdown

What is `R`?

R is:
- a programming language
- the software that interprets/runs programs written in the R language

Why use R?

free (though commercial support can be bought)
widely used
- sciences, humanities, engineering, statistics, etc.
has many excellent specialised packages for data analysis and visualisation
international, friendly user community

RStudio community support: https://community.rstudio.com/
Stack Overflow: https://stackoverflow.com/

What is `RStudio`?

Please start RStudio

RStudio is an integrated development environment (IDE)

Script/code editor; Project management
Interaction with R (console/‘scratchpad’); Graphics/visualisation/Help

“Why not use `Excel`?”

Excel is good for some things
R is excellent for analysis and reproducibility…

Separates data from analysis
Not point-and-click: every step is explicit and transparent
Easy to share, adapt, reuse, publish analyses with new/modified data (GitHub)
R can be run on supercomputers, with extremely large datasets…

Mike Croucher’s MLPM talk: https://mikecroucher.github.io/MLPM_talk/

`RStudio` overview - INTERACTIVE DEMO

Variables

Variables are like named boxes

An item (object) of data goes in the box (which is called Name)
When we refer to the box (variable) by its name, we really mean what’s in the box

Variables - Interactive Demo

x <- 1 / 40
x

## [1] 0.025

x ^ 2

## [1] 0.000625

log(x)

## [1] -3.688879

name <- "Samia"
name

## [1] "Samia"

Naming Variables

Variable names are documentation

current_temperature = 28.6
subjectID = "GCF_00001236452.1"
GPS_Location = "54N, 36E"

descriptive, but not too long
letters, numbers, underscores, and periods ([a-zA-z0-9_.])
cannot contain whitespace or start with a number (x2 is allowed, 2x is not)
case sensitive (Weight is not the same as weight)
do not reuse names of built-in functions
Consistent style:
- lower_snake, UPPER_SNAKE, lowerCamelCase, UpperCamelCase

Naming Variables

Functions

Functions (log(), sin() etc.) ≈ “canned script”

automate complicated tasks
make code more readable and reusable

Functions usually take arguments (input)
Functions often return values (output)

Some functions are built-in (in base packages, e.g. sqrt(), lm(), plot())
Groups of related functions can be imported as libraries

Getting Help in `R`

INTERACTIVE DEMO

args(fname)            # arguments for fname
?fname                 # help page for fname
help(fname)            # help page for fname
??fname                # any mention of fname
help.search("text")    # any mention of "text"
vignette(fname)        # worked examples for fname
vignette()             # show all available vignettes

Challenge 01 (1min)

What will be the value of each variable after each statement in the following program?

mass <- 47.5
age <- 122
mass <- mass * 2.3
age <- age - 20

mass = 47.5, age = 102
mass = 109.25, age = 102
mass = 47.5, age = 122
mass = 109.25, age = 122

USE CHALLENGE LINK ON ETHERPAD

3. Project Management in `R`

How Projects Tend To Grow

Good Practice

THERE IS NO ONE TRUE WAY (only principles)

Use a single working directory per project/analysis
- easier to move, share, and find files
- use relative paths to locate files
Treat raw data as read-only
- keep in a separate subfolder (data?)
Clean data ready for work programmatically
- keep cleaned/modified data in separate folder (clean_data?)
Consider output generated by analysis to be disposable
- can be regenerated by running analysis/code

Good Enough Practices in Scientific Computing (2017) Wilson et al. http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510
A Beginner’s Guide https://martinctc.github.io/blog/rstudio-projects-and-working-directories-a-beginner's-guide/
Structuring R Projects (more advanced) https://chrisvoncsefalvay.com/2018/08/09/structuring-r-projects/

Example Directory Structure

Project Management in `RStudio`

RStudio tries to help you manage your projects
- R Project concept - files and subdirectory structure
- integration with version control
- switching between multiple projects within RStudio
- stores project history

Let’s create a project in RStudio

INTERACTIVE DEMO

RStudio projects: https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects

Working in `RStudio`

We can write code in several ways in RStudio

At the console (you’ve done this)
In a script
As an interactive notebook
As a markdown file
As a Shiny app

We’re going to create a new dataset and R script.

Putting code in a script makes it easier to modify, share and run

INTERACTIVE DEMO

4. A First Analysis in `RStudio`

Our Task

Patients have been given a new treatment for arthritis
We have measurements of inflammation over a period of days for each patient
We want to produce a preliminary analysis and graphs for this data

Download the file from the following link to your data/ directory, and extract it

https://github.com/swcarpentry/r-novice-inflammation/raw/main/data/r-novice-inflammation-data.zip

(the link is also available on the course Etherpad page)

EtherPad: https://pad.carpentries.org/2021-05-11-strathclyde-online

Loading Data - Interactive Demo

You created data manually earlier, but this is rare
Data are most commonly read in from plain text files

Data files can be inspected in RStudio

read.csv(file = "data/inflammation-01.csv", header = FALSE)

Challenge 02 (2min)

Someone gives you a data file that has:

a comma (,) as the decimal point character
semi-colon (;) as the field separator

How would you open it, using read.csv()

Use the help function and documentation

USE CHALLENGE LINK ON ETHERPAD

Indexing Data

INTERACTIVE DEMO

We use indexing to refer to elements of a matrix
- square brackets: []
- row, then column: [row, column]

data[1, 1]     # First value in dataset
data[30, 20]   # Middle value of dataset

To get a range of values, use the : separator (meaning ‘to’)

data[1:4, 1:4]   # rows 1 to 4; columns 1 to 4

To select a complete row or column, leave it blank

data[5, ]     # row 5
data[, 16]    # column 16

Summary Functions

INTERACTIVE DEMO

R provides useful functions to summarise data
We can use indexing to get summary information on individual patients and days

max(data)           # largest value in dataset
max(data[2, ])      # largest value for row (patient) 2
min(data[, 7])      # smallest value on column (day) 7
mean(data[, 7])     # mean value on day 7
sd(data[, 7])       # standard deviation of values on day 7

Repetitive Calculations

INTERACTIVE DEMO

Calculating for every patient (or day) this way is tedious

Computers exist to do tedious things for us

So apply a function (mean) to each row in the data:
R has several ways to automate this process

apply(X = data, MARGIN = 1, FUN = mean)

MARGIN = 1: rows
MARGIN = 2: columns

rowMeans(data)
colMeans(data)

Base Graphics

“The purpose of computing is insight, not numbers.” - Richard Hamming

R has many available graphics packages

graphically beautiful
specific problem domains
‘built-in’ graphics are known as base graphics

Base graphics are powerful tools for visualisation and understanding

Plotting

INTERACTIVE DEMO

plot(avg_inflammation_patient)

max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)

plot(apply(dat,2,min))       # 3 functions in one!

Challenge 03 (5min)

Can you add plots to your script showing:

scatterplot of standard deviation of inflammation across all patients, by day
a histogram of average inflammation across all patients, by day

5. Data Types and
Structures in `R`

Learning Objectives

Basic data types in R
Common data structures in R
How to find out the type/structure of R data
Understand how R’s data types and structures relate to your own data

Data Types and Structures in `R`

R is mostly used for data analysis
R has special types and structures to help you work with data
Much of the focus is on tabular data (data frames)

INTERACTIVE DEMO

Understanding data types, their uses, and how they relate to your own data is key to successful analysis with R

(it’s not just about programming)

What Data Types Do You Expect?

What data types would you expect to see?

What examples of data types can you think of from your own experience?

Please write them into the chat

Data Types in `R`

Data types in R are atomic
- All data structures are built from these

logical: TRUE, FALSE
numeric:

integer: 3, 2L, 123456
double (decimal): 3.0, -23.45, pi
complex: 3+0i, 1+4i
character (text): "a", 'SWC', "This is not a string"
raw: binary data (we won’t cover this)

INTERACTIVE DEMO

Challenge 04 (2min)

Create examples of data with the following characteristics:

name: answer, type: logical
name: height, type: numeric
name: dog_name, type: character

For each variable, test that it has the data type you intended

Four Common `R` Data Structures

vector
factor
list
data.frame

INTERACTIVE DEMO

Challenge 05 (5min)

Vectors are atomic: they can contain only a single data type

What data type are the following vectors (xx, yy, zz)?

xx <- c(1.7, "a")
yy <- c(TRUE, 2)
zz <- c("a", TRUE)

Options: logical, integer, numeric, character

USE CHALLENGE LINK ON ETHERPAD

Coercion

Coercion means changing data from one type to another
R will perform implicit coercion on vectors to make them atomic

logical \(\rightarrow\) integer \(\rightarrow\) double \(\rightarrow\) complex \(\rightarrow\) character

If there are formatting problems with your data, you might not have the type you expect when you import into R

Manual coercion with as.<type_name>()

INTERACTIVE DEMO

Factors

Data comes as one of two types:

quantitative: e.g. integers or real numbers
(weight <- 17.2; rooms <- 7)
categorical: e.g. ordered or unordered classes
(grade <- "8", coat <- "brindled")

This kind of distinction critical in many applications (e.g. statistical modelling)

Factors are special vectors that represent categorical data
- Stored as vectors of labelled integers
- Cannot be treated as strings/text

INTERACTIVE DEMO

Challenge 06 (5min)

Create a new factor, defining control and case experiments, and inspect the result:

f <- factor(c("case", "control", "case", "control", "case"))
str(f)

##  Factor w/ 2 levels "case","control": 1 2 1 2 1

In some statistical analyses in R it is important that the control level is numbered 1

Using the help available to you in RStudio, can you create a factor with the same values, but where the control level is numbered 1?

Lists

lists are like vectors, but can hold any combination of datatype
- elements in a list are denoted by [[]] and can be named

INTERACTIVE DEMO

# create a list
l <- list(1, 'a', TRUE, matrix(0, nrow = 2, ncol = 2), f)
l_named <- list(a = "SWC", b = 1:4)

Logical Indexing

We have used indexing, slicing and names to get data by ‘location’

> animal[c(2,4,6)]
[1] "o" "k" "y"
> l_named$b
[1] 1 2 3 4

Logical indexes select data that meets certain criteria

INTERACTIVE DEMO

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
mask <- c(TRUE, FALSE, TRUE, FALSE, TRUE)
x[mask]
x[x > 7]

6. Dataframes

Let’s look at a `data.frame`

The cats data is a data.frame

INTERACTIVE DEMO

> class(cats)
[1] "data.frame"
> cats
    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

What is a `data.frame`?

The standard R data structure for storing tabular, rectangular data

A named list of vectors having identical lengths.
- Each column is a vector
- Each vector can be a different data type

This is very much LIKE a spreadsheet, but…
- Columns are constrained to a type
- Columns are all the same length

Creating a `data.frame`

INTERACTIVE DEMO

# Create a data frame
df <- data.frame(a=c(1,2,3), b=c('eeny', 'meeny', 'miney'),
                 c=c(TRUE, FALSE, TRUE))
summary(df)

##        a            b                 c          
##  Min.   :1.0   Length:3           Mode :logical  
##  1st Qu.:1.5   Class :character   FALSE:1        
##  Median :2.0   Mode  :character   TRUE :2        
##  Mean   :2.0                                     
##  3rd Qu.:2.5                                     
##  Max.   :3.0

Saving a `data.frame` to file

INTERACTIVE DEMO

write.table(df, "data/df_example.tab", sep="\t")

We need to provide

the data.frame
the path to the file being written
a column separator

Loading a `data.frame`

INTERACTIVE DEMO

Download data from https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv
Put the file in the data/ directory

The link is available on the course Etherpad

gapminder <- read.table("data/gapminder-FiveYearData.csv", sep=",", header=TRUE)

R can also read data direct from the internet

url <- paste("https://raw.githubusercontent.com/resbaz/",
             "r-novice-gapminder-files/master/data/",
             "gapminder-FiveYearData.csv", sep = '')
gapminder <- read.table(url, sep=",", header=TRUE)

EtherPad: https://pad.carpentries.org/2021-05-11-strathclyde-online

Investigating `gapminder`

INTERACTIVE DEMO

str(gapminder)              # structure of the data.frame
typeof(gapminder$year)      # data type of a column
length(gapminder)           # length of the data.frame
nrow(gapminder)             # number of rows in data.frame
ncol(gapminder)             # number of columns in data.frame
dim(gapminder)              # number of rows and columns in data.frame
colnames(gapminder)         # column names from data.frame
head(gapminder)             # first few rows of dataframe
summary(gapminder)          # summary of data in data.frame columns

7. Packages

Packages

In R:

a package is a collection (or library) of reusable code
many useful and specialist tools are distributed as packages
over 10,000 packages are available at CRAN
you can distribute your own code as a package

INTERACTIVE DEMO

installed.packages()               # see installed packages
install.packages("packagename")    # install a new package
update.packages()                  # update installed packages
library(packagename)               # import a package for use in your code

CRAN - the Comprehensive R Archive Network: https://cran.r-project.org/

Challenge 07 (5min)

Can you check if the following packages are installed on your system, and install them if necessary?

dplyr
ggplot2
knitr

8. Creating Publication-
Quality Graphics

Visualisation is Critical!

The Grammar of Graphics

ggplot2 is part of the Tidyverse, a collection of packages for data science
- ggplot2 is the graphics package

Implements the “Grammar of Graphics”

Separates data from its representation
Helps iteratively update/refine plots
Helps build complex, effective visualisations from simple elements

data
aesthetics
geoms
layers

Tidyverse: https://www.tidyverse.org/

A Basic Scatterplot

You can use ggplot2 like base graphics
- qplot() ≈ plot()

INTERACTIVE DEMO

library(ggplot2)
plot(gapminder$lifeExp, gapminder$gdpPercap, col=gapminder$continent)
qplot(lifeExp, gdpPercap, data=gapminder, colour=continent)

What is a Plot? aesthetics

Each observation in the data is a point
A point’s aesthetics determine how it is rendered
- co-ordinates on the image; size; shape; colour
aesthetics can be constant or mapped to variables
Many different plots can be generated from the same data by changing aesthetics

What is a Plot? `geom`s

geom (short for geometry) defines the “type” of representation

If data are drawn as points: scatterplot
If data are drawn as lines: line plot
If data are drawn as bars: bar chart

ggplot2 provides several geom types

What is a Plot? `geom`s

The same data and aesthetics can be shown with different geoms

INTERACTIVE DEMO

# Generate plot of GDP per capita against life Expectancy
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent))
p + geom_point()
p + geom_line()

Challenge 08 (2min)

Can you create another figure in your script showing how life expectancy changes as a function of time, as a scatterplot?

What is a Plot? layers

We’ve just used another “Grammar of Graphics” concept: layers
- ggplot2 plots are built as layers

All layers have two components

data and aesthetics
a geom

Data and aesthetics can be defined in a base ggplot object
- values from the base are inherited by the other layers
- the base can be overridden in other layers

What is a Plot? layers

Data and aesthetics can be defined in a base ggplot object
- values from the base are inherited by the other layers
- the base can be overridden in other layers

p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, colour=continent))
p + geom_point()

What is a Plot? layers

Data and aesthetics can be defined in a base ggplot object
- values from the base are inherited by the other layers
- the base can be overridden in other layers

p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, colour=continent))
p + geom_line(aes(group=country))

INTERACTIVE DEMO

What is a Plot? layers

We can use several layers of geoms to build a plot
- alpha controls opacity for a layer

p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent))
p + geom_line(aes(group=country)) + geom_point(alpha=0.4)

INTERACTIVE DEMO

Challenge 09 (5min)

Can you create another figure in your script showing how life expectancy changes as a function of time, coloured by continent, with two layers:

a line plot, grouping points by country
a scatterplot showing each data point, with 35% opacity

Transformations and `scale`s

Data transformations are handled with scale layers

axis scaling (log scales)
colour scaling (changing palettes)

INTERACTIVE DEMO

Statistics layers

Some geom layers transform the dataset
- Usually this is a data summary (e.g. smoothing or binning)

INTERACTIVE DEMO

Multi-panel figures

So far all our plots have all data in a single figure
Comparisons can be clearer with multiple panels:
- facets
- “small multiples plots”

Use the facet_wrap() layer to generate grids of plots

INTERACTIVE DEMO

# Compare life expectancy over time by continent
p <- ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=continent,
                                group=country))
p <- p + geom_line() + scale_y_log10()
p + facet_wrap(~continent)

Challenge 10 (5min)

Can you create a scatterplot and contour densities of GDP per capita against population size, with colour filled by continent?

ADVANCED: Transform the x axis to better visualise data spread, and use facets to panel density plots by year.

9. Data Cleaning/Tidy Data

Why Tidy Data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way”

Data Cleaning is not just a first step
- repeated when new data turns up, new ideas arrive, etc.
About 80% of the effort of data analysis is cleaning and preparing data

Principles of Tidy Data provide a standard way to organise data values within a dataset

R for Data Science: http://r4ds.had.co.nz/
Data Wrangling cheat sheet: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Introduction to dplyr: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
Data Wrangling with R and RStudio: https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/

A Messy Dataset (1)

df1 <- data.frame(treatment=c("treatmenta", "treatmentb"),
                  John.Smith=c(NA, 2),
                  Jane.Doe=c(16, 11),
                  Mary.Johnson=c(3, 1))
colnames(df1) = c("", "John Smith", "Jane Doe", "Mary Johnson")

	John Smith	Jane Doe	Mary Johnson
treatmenta	NA	16	3
treatmentb	2	11	1

A Messy Dataset (2)

df2 <- data.frame(name=c("John Smith", "Jane Doe", "Mary Johnson"),
                 treatmenta=c(NA, 16, 3),
                 treatmentb=c(2, 11, 1))
colnames(df2) = c("", "treatmenta", "treatmentb")

	treatmenta	treatmentb
John Smith	NA	2
Jane Doe	16	11
Mary Johnson	3	1

Data Semantics

A dataset is a collection of VALUES

	John Smith	Jane Doe	Mary Johnson
treatmenta	NA	16	3
treatmentb	2	11	1

Each value belongs to a variable and an observation

VARIABLES: can change or vary
- values that are measured or decided by a researcher: height, temperature, duration, treatment, etc.
OBSERVATIONS: values measured across all variables for the same individual/unit/group: person, reactor, religion, company, etc.

Challenge 11 (2min)

In the table below, what is the correct assignment for rows and columns?

	John Smith	Jane Doe	Mary Johnson
treatmenta	NA	16	3
treatmentb	2	11	1

Rows=OBERVATIONS, Columns=VARIABLES
Rows=OBERVATIONS, Columns=NEITHER
Rows=NEITHER, Columns=VARIABLES
Rows=NEITHER, Columns=NEITHER

USE CHALLENGE LINK ON ETHERPAD

A Tidy Dataset (1)

The dataset contains 18 values:

three variables
six observations

VARIABLES
- person (John, Mary, Jane)
- treatment (a and b)
- result (NA, 16, 3, 2, 11, 1)

Each OBSERVATION includes all three variables

A Tidy Dataset (2)

name	treatment	result
John Smith	a	NA
Jane Doe	a	16
Mary Johnson	a	3
John Smith	b	2
Jane Doe	b	11
Mary Johnson	b	1

Tidy Data

Tidy data is a standard way of structuring a dataset
- not the only way
- makes it easy to extract needed varaibles
- well suited to R (supports vectorisation)

Each VARIABLE forms a column
Each OBSERVATION forms a row

The real data you receive may often need to be cleaned to be in Tidy form
Messy forms are sometimes useful

INTERACTIVE DEMO

10. Working With Tidy Data

Learning Objectives

How to manipulate data.frames with the six verbs of dplyr
- a ‘grammar of data manipulation’

select()
filter()
group_by()
summarize()
mutate()
%>% (pipe)

R for Data Science: http://r4ds.had.co.nz/
Data Wrangling cheat sheet: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Introduction to dplyr: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
Data Wrangling with R and RStudio: https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/

What and Why is `dplyr`?

dplyr is another package in the Tidyverse
Facilitates analysis by groups in Tidy Data
- Helps avoid repetition

> mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
[1] 2193.755
> mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
[1] 7136.11
> mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
[1] 7902.15

Avoiding repetition (though automation) makes code

robust
reproducible

Split-Apply-Combine

`select()` - Interactive Demo

library(dplyr)

select(gapminder, year, country, gdpPercap)
gapminder %>% select(year, country, gdpPercap)

`filter()`

filter() selects rows on the basis of some condition

INTERACTIVE DEMO

filter(gapminder, continent=="Europe")

# Select gdpPercap by country and year, only for Europe
eurodata <- gapminder %>%
              filter(continent == "Europe") %>%
              select(year, country, gdpPercap)

Challenge 12 (5min)

Can you write a single line (which may span multiple lines in your script by including pipes) to produce a dataframe from gapminder containing:

life expectancy, country, and year data
only for African nations

How many rows does the dataframe have

`group_by()`

INTERACTIVE DEMO

group_by(gapminder, continent)
gapminder %>% group_by(continent)

`summarize()`

INTERACTIVE DEMO

# Produce table of mean GDP by continent
gapminder %>%
    group_by(continent) %>%
    summarize(meangdpPercap=mean(gdpPercap))

Challenge 13 (5min)

Can you calculate the average life expectancy per country in the gapminder data?
Which nation has longest life expectancy, and which the shortest?

`count()` and `n()`

Two useful functions related to summarize()
- count()/tally(): a function that reports a table of counts by group
- n(): a function used within summarize(), filter() or mutate() to represent count by group

INTERACTIVE DEMO

gapminder %>%
  filter(year == 2002) %>%
  count(continent, sort = TRUE)

gapminder %>%
  group_by(continent) %>%
  summarize(se_lifeExp = sd(lifeExp)/sqrt(n()))

`mutate()`

mutate() is a function allowing creation of new variables

INTERACTIVE DEMO

# Calculate GDP in $billion
gdp_bill <- gapminder %>%
  mutate(gdp_billion = gdpPercap * pop / 10^9)

# Calculate total/sd of GDP by continent and year
gdp_bycontinents_byyear <- gapminder %>%
    mutate(gdp_billion=gdpPercap*pop/10^9) %>%
    group_by(continent,year) %>%
    summarize(mean_gdpPercap=mean(gdpPercap),
              sd_gdpPercap=sd(gdpPercap),
              mean_gdp_billion=mean(gdp_billion),
              sd_gdp_billion=sd(gdp_billion))

11. Dynamic Reports

Literate Programming

A programming paradigm introduced by Donald Knuth
The program (or analysis) is explained in natural language
- The source code is interspersed
The whole document is executable

We can produce these documents in RStudio

Literate Programming, by Donald Knuth: http://www.literateprogramming.com/knuthweb.pdf

Create an `R Markdown` file

R Markdown files embody Literate Programming in R
File \(\rightarrow\) New File \(\rightarrow\) R Markdown
Enter a title
Save the file (gets the extension .Rmd)

Components of an `R Markdown` file

Header information is fenced by ---

---
title: "Literate Programming"
author: "Leighton Pritchard"
date: "04/12/2017"
output: html_document
---

Natural language is written as plain text

This is an R Markdown document. Markdown is a simple formatting syntax

R code (which is executable) is fenced by backticks (```)

Click on Knit

Creating a Report

We’re going to create an R Markdown report on the gapminder data

INTERACTIVE DEMO

R Markdown documentation: http://rmarkdown.rstudio.com/
R Markdown cheat sheet: http://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf
Getting started with R Markdown: https://www.rstudio.com/resources/webinars/getting-started-with-r-markdown/
Reproducible reporting: https://www.rstudio.com/resources/webinars/reproducible-reporting/

12. Conclusion

You have learned:

About R, RStudio and how to set up a project
How to load data into R and produce summary statistics and plots with base tools
All the data types in R, the most important data structures
How to install and use packages
How to use the Tidyverse to manipulate and plot data
How to create dynamic reports in R

WELL DONE!!

The End Is The Beginning

Where Next?

BONUS. Programming in `R`

Learning Objectives

How to make data-dependent choices in R
Use if() and else()
Repeat operations in R
Use for() loops
vectorisation to avoid repeating operations
Writing functions to avoid repetition and make code reusable

`if()` … `else`

We often want to perform operations (or not) conditional on whether something is TRUE
- The if() and else construct is useful for this

# if
if (condition is true) {
  PERFORM ACTION
}

# if ... else
if (condition is true) {
  PERFORM ACTION
} else {  # i.e. if the condition is false,
  PERFORM ALTERNATIVE ACTION
}

INTERACTIVE DEMO

Challenge 14 (2min)

Can you use an if() statement to report whether there are any records from 2002 in the gapminder dataset?

Can you do the same for 2012?

HINT: Look at the help for the any() function

`for()` loops

for() loops are a very common construct in programming
- for each <item> in a group, <do something (with the item)>
Not as useful in R as in some other languages

for(iterator in set of values){
  do a thing
}

INTERACTIVE DEMO

`while()` loops

while() loops are useful when you need to do something while some condition is true

while(this condition is true){
  do a thing
}

INTERACTIVE DEMO

Challenge 15 (2min)

Can you use a for() loop and an if() statement to print whether each letter in the alphabet is a vowel?

HINT: Use R’s help for letters and %in%

Vectorisation

for() and while() loops can be useful, but are not efficient
Most functions in R are vectorised
- When applied to a vector, apply to all elements in that vector
- No need to loop

You’ve already seen and used much of this behaviour

INTERACTIVE DEMO

x < 1:4
x * 2
y <- 6:9
x + y

Challenge 16 (2min)

We want to sum the following series of fractions

\(\frac{1}{1^2} + \frac{1}{2^2} + \frac{1}{3^2} + \ldots + \frac{1}{n^2}\)

for large values of \(n\)

Can you do this using vectorisation for \(n = 10,000\)?

Functions

Why Functions?

Functions let us run a complex series of commands in one go
- under a memorable/descriptive name
- invoked with that name
- with a defined set of inputs and outputs
- to perform a logically coherent task

Functions are the building blocks of programming

Small functions with one obvious, clearly-defined task are best

Defining a Function

You will often need to write your own functions
They take a standard form

<function_name> <- function(<arg1>, <arg2>) {
  <do something>
  return(<result>)
}

INTERACTIVE DEMO

my_sum <- function(a, b) {
  the_sum <- a + b
  return(the_sum)
}

Documentation

So far, you’ve been able to use R’s built-in help to see function documentation
- This isn’t available for your functions unless you write it

Your future self will thank you!

(and so will your colleagues)

Write programs for people, not for computers

State what the code does (and why)
Define inputs and outputs
Give an example

INTERACTIVE DEMO

Function Arguments

We can define functions that take multiple arguments
We can also define default values for arguments

INTERACTIVE DEMO

# Calculate total GDP in gapminder data
calcGDP <- function(data, year_in=NULL, country_in=NULL) {
  gdp <- data %>% mutate(gdp=(pop * gdpPercap))
  if (!is.null(year_in)) {
    gdp <- gdp %>% filter(year %in% year_in)
  }
  if (!is.null(country_in)) {
    gdp <- gdp %>% filter(country %in% country_in)
  }
  return(gdp)
}

Challenge 17 (10min)

Can you write a function that takes an optional argument called letter, which:

Plots the life expectancy per year for each country
Only for countries whose name starts with a letter in letter
Uses facet_wrap() to produce a grid of output graphs

ADVANCED: Make the facet wrapping optional

HINT: The following code may be useful

starts.with <- substr(gapminder$country, start = 1, stop = 1)
az.countries <- gapminder[starts.with %in% c("A", "Z"), ]

1. Data Analysis

Data Analysis in the Scientific Cycle

Data-Intensive Research

Pipelines and Workflows

Explore, Refine, Produce (ERP)

2. Welcome to R

Learning Objectives

What is R?

What is RStudio?

“Why not use Excel?”

RStudio overview - INTERACTIVE DEMO

Variables

Variables - Interactive Demo

Naming Variables

Naming Variables

Functions

Getting Help in R

Challenge 01 (1min)

3. Project Management in R

How Projects Tend To Grow

Good Practice

Example Directory Structure

Project Management in RStudio

Working in RStudio

4. A First Analysis in RStudio

Our Task

Loading Data - Interactive Demo

Challenge 02 (2min)

Indexing Data

Indexing Data

Summary Functions

Repetitive Calculations

Base Graphics

Plotting

Challenge 03 (5min)

5. Data Types and Structures in R

Learning Objectives

Data Types and Structures in R

What Data Types Do You Expect?

Data Types in R

Challenge 04 (2min)

Four Common R Data Structures

Challenge 05 (5min)

Coercion

Factors

Challenge 06 (5min)

Lists

Logical Indexing

6. Dataframes

Let’s look at a data.frame

What is a data.frame?

Creating a data.frame

Saving a data.frame to file

Loading a data.frame

Investigating gapminder

7. Packages

Packages

Challenge 07 (5min)

8. Creating Publication-Quality Graphics

Visualisation is Critical!

The Grammar of Graphics

A Basic Scatterplot

What is a Plot? aesthetics

What is a Plot? geoms

What is a Plot? geoms

Challenge 08 (2min)

What is a Plot? layers

What is a Plot? layers

What is a Plot? layers

What is a Plot? layers

Challenge 09 (5min)

Transformations and scales

Statistics layers

Multi-panel figures

Challenge 10 (5min)

9. Data Cleaning/Tidy Data

Why Tidy Data?

A Messy Dataset (1)

A Messy Dataset (2)

Data Semantics

What is `R`?

What is `RStudio`?

“Why not use `Excel`?”

`RStudio` overview - INTERACTIVE DEMO

Getting Help in `R`

3. Project Management in `R`

Project Management in `RStudio`

Working in `RStudio`

4. A First Analysis in `RStudio`

5. Data Types and
Structures in `R`

Data Types and Structures in `R`

Data Types in `R`

Four Common `R` Data Structures

Let’s look at a `data.frame`

What is a `data.frame`?

Creating a `data.frame`

Saving a `data.frame` to file

Loading a `data.frame`

Investigating `gapminder`

8. Creating Publication-
Quality Graphics

What is a Plot? `geom`s

What is a Plot? `geom`s

Transformations and `scale`s

What and Why is `dplyr`?

`select()` - Interactive Demo

`filter()`

`group_by()`

`summarize()`

`count()` and `n()`

`mutate()`

Create an `R Markdown` file

Components of an `R Markdown` file

BONUS. Programming in `R`

`if()` … `else`

`for()` loops

`while()` loops