4-5/5/2021

1. Data Analysis

Data Analysis in the Scientific Cycle

Data-Intensive Research

  • Science and humanities are increasingly data-driven
    • Early-career training has not prepared all researchers for this

Research Workflows

  • Enable systematic, replicable and reproducible work
    • Design principles
      • Best practices for data
    • Software development methods
      • Automation of repetitive calculations

Pipelines and Workflows

Pipeline

  • What a computer does
    • A series of instructions
    • Data is piped through programs, and a result emerges

Workflow

  • What a researcher does
    • Exploring data, developing hypotheses, writing code, interpreting results
  • Outputs include:
    • datasets, methods, teaching materials, software, papers, etc.

Explore, Refine, Produce (ERP)

2. Welcome to R

Learning Objectives

  • Fundamentals of R and RStudio
  • Fundamentals of programming (in R)
  • Data management with the tidyverse
  • Publication-quality data visualisation with ggplot2
  • Reporting with RMarkdown

What is R?

  • R is:
    • a programming language
    • the software that interprets/runs programs written in the R language

Why use R?

  • free (though commercial support can be bought)
  • widely used
    • sciences, humanities, engineering, statistics, etc.
  • has many excellent specialised packages for data analysis and visualisation
  • international, friendly user community

What is RStudio?

Please start RStudio

  • RStudio is an integrated development environment (IDE)
  • Script/code editor; Project management
  • Interaction with R (console/‘scratchpad’); Graphics/visualisation/Help

“Why not use Excel?”

  • Excel is good for some things
  • R is excellent for analysis and reproducibility…
  • Separates data from analysis
  • Not point-and-click: every step is explicit and transparent
  • Easy to share, adapt, reuse, publish analyses with new/modified data (GitHub)
  • R can be run on supercomputers, with extremely large datasets…

RStudio overview - INTERACTIVE DEMO

Variables

Variables are like named boxes

  • An item (object) of data goes in the box (which is called Name)
  • When we refer to the box (variable) by its name, we really mean what’s in the box

Variables - Interactive Demo

x <- 1 / 40
x
## [1] 0.025
x ^ 2
## [1] 0.000625
log(x)
## [1] -3.688879
name <- "Samia"
name
## [1] "Samia"

Naming Variables

Variable names are documentation

current_temperature = 28.6
subjectID = "GCF_00001236452.1"
GPS_Location = "54N, 36E"
  • descriptive, but not too long
  • letters, numbers, underscores, and periods ([a-zA-z0-9_.])
  • cannot contain whitespace or start with a number (x2 is allowed, 2x is not)
  • case sensitive (Weight is not the same as weight)
  • do not reuse names of built-in functions
  • Consistent style:
    • lower_snake, UPPER_SNAKE, lowerCamelCase, UpperCamelCase

Naming Variables

Functions

Functions (log(), sin() etc.) ≈ “canned script”

  • automate complicated tasks
  • make code more readable and reusable
  • Functions usually take arguments (input)
  • Functions often return values (output)
  • Some functions are built-in (in base packages, e.g. sqrt(), lm(), plot())
  • Groups of related functions can be imported as libraries

Getting Help in R

INTERACTIVE DEMO

args(fname)            # arguments for fname
?fname                 # help page for fname
help(fname)            # help page for fname
??fname                # any mention of fname
help.search("text")    # any mention of "text"
vignette(fname)        # worked examples for fname
vignette()             # show all available vignettes

Challenge 01 (1min)

What will be the value of each variable after each statement in the following program?

mass <- 47.5
age <- 122
mass <- mass * 2.3
age <- age - 20
  • mass = 47.5, age = 102
  • mass = 109.25, age = 102
  • mass = 47.5, age = 122
  • mass = 109.25, age = 122

USE CHALLENGE LINK ON ETHERPAD

3. Project Management in R

How Projects Tend To Grow

Good Practice

THERE IS NO ONE TRUE WAY (only principles)

  • Use a single working directory per project/analysis
    • easier to move, share, and find files
    • use relative paths to locate files
  • Treat raw data as read-only
    • keep in a separate subfolder (data?)
  • Clean data ready for work programmatically
    • keep cleaned/modified data in separate folder (clean_data?)
  • Consider output generated by analysis to be disposable
    • can be regenerated by running analysis/code

Example Directory Structure

Project Management in RStudio

  • RStudio tries to help you manage your projects
    • R Project concept - files and subdirectory structure
    • integration with version control
    • switching between multiple projects within RStudio
    • stores project history

Let’s create a project in RStudio

INTERACTIVE DEMO

Working in RStudio

We can write code in several ways in RStudio

  • At the console (you’ve done this)
  • In a script
  • As an interactive notebook
  • As a markdown file
  • As a Shiny app

We’re going to create a new dataset and R script.

  • Putting code in a script makes it easier to modify, share and run

INTERACTIVE DEMO

4. A First Analysis in RStudio

Our Task

  • Patients have been given a new treatment for arthritis
  • We have measurements of inflammation over a period of days for each patient
  • We want to produce a preliminary analysis and graphs for this data

Download the file from the following link to your data/ directory, and extract it

(the link is also available on the course Etherpad page)

Loading Data - Interactive Demo

  • You created data manually earlier, but this is rare
  • Data are most commonly read in from plain text files

Data files can be inspected in RStudio

read.csv(file = "data/inflammation-01.csv", header = FALSE)

Challenge 02 (2min)

Someone gives you a data file that has:

  • a comma (,) as the decimal point character
  • semi-colon (;) as the field separator

How would you open it, using read.csv()

Use the help function and documentation

USE CHALLENGE LINK ON ETHERPAD

Indexing Data

Indexing Data

INTERACTIVE DEMO

  • We use indexing to refer to elements of a matrix
    • square brackets: []
    • row, then column: [row, column]
data[1, 1]     # First value in dataset
data[30, 20]   # Middle value of dataset
  • To get a range of values, use the : separator (meaning ‘to’)
data[1:4, 1:4]   # rows 1 to 4; columns 1 to 4
  • To select a complete row or column, leave it blank
data[5, ]     # row 5
data[, 16]    # column 16

Summary Functions

INTERACTIVE DEMO

  • R provides useful functions to summarise data
  • We can use indexing to get summary information on individual patients and days
max(data)           # largest value in dataset
max(data[2, ])      # largest value for row (patient) 2
min(data[, 7])      # smallest value on column (day) 7
mean(data[, 7])     # mean value on day 7
sd(data[, 7])       # standard deviation of values on day 7

Repetitive Calculations

INTERACTIVE DEMO

  • Calculating for every patient (or day) this way is tedious

Computers exist to do tedious things for us

  • So apply a function (mean) to each row in the data:

  • R has several ways to automate this process

apply(X = data, MARGIN = 1, FUN = mean)
  • MARGIN = 1: rows
  • MARGIN = 2: columns
rowMeans(data)
colMeans(data)

Base Graphics

“The purpose of computing is insight, not numbers.” - Richard Hamming

  • R has many available graphics packages
    • graphically beautiful
    • specific problem domains
  • ‘built-in’ graphics are known as base graphics
  • Base graphics are powerful tools for visualisation and understanding

Plotting

INTERACTIVE DEMO

plot(avg_inflammation_patient)

max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)

plot(apply(dat,2,min))       # 3 functions in one!

Challenge 03 (5min)

Can you add plots to your script showing:

  • scatterplot of standard deviation of inflammation across all patients, by day
  • a histogram of average inflammation across all patients, by day

5. Data Types and
Structures in R

Learning Objectives

  • Basic data types in R
  • Common data structures in R
  • How to find out the type/structure of R data
  • Understand how R’s data types and structures relate to your own data

Data Types and Structures in R

  • R is mostly used for data analysis
  • R has special types and structures to help you work with data
  • Much of the focus is on tabular data (data frames)

INTERACTIVE DEMO

Understanding data types, their uses, and how they relate to your own data is key to successful analysis with R

(it’s not just about programming)

What Data Types Do You Expect?

What data types would you expect to see?

What examples of data types can you think of from your own experience?

Please write them into the chat

Data Types in R

  • Data types in R are atomic
    • All data structures are built from these
  1. logical: TRUE, FALSE
  2. numeric:
    • integer: 3, 2L, 123456
    • double (decimal): 3.0, -23.45, pi
  3. complex: 3+0i, 1+4i
  4. character (text): "a", 'SWC', "This is not a string"
  5. raw: binary data (we won’t cover this)

INTERACTIVE DEMO

Challenge 04 (2min)

Create examples of data with the following characteristics:

  • name: answer, type: logical
  • name: height, type: numeric
  • name: dog_name, type: character

For each variable, test that it has the data type you intended

Four Common R Data Structures

  • vector
  • factor
  • list
  • data.frame

INTERACTIVE DEMO

Challenge 05 (5min)

Vectors are atomic: they can contain only a single data type

What data type are the following vectors (xx, yy, zz)?

xx <- c(1.7, "a")
yy <- c(TRUE, 2)
zz <- c("a", TRUE)

Options: logical, integer, numeric, character

USE CHALLENGE LINK ON ETHERPAD

Coercion

  • Coercion means changing data from one type to another
  • R will perform implicit coercion on vectors to make them atomic

logical \(\rightarrow\) integer \(\rightarrow\) double \(\rightarrow\) complex \(\rightarrow\) character

If there are formatting problems with your data, you might not have the type you expect when you import into R

  • Manual coercion with as.<type_name>()

INTERACTIVE DEMO

Factors

Data comes as one of two types:

  • quantitative: e.g. integers or real numbers
    (weight <- 17.2; rooms <- 7)
  • categorical: e.g. ordered or unordered classes
    (grade <- "8", coat <- "brindled")

This kind of distinction critical in many applications (e.g. statistical modelling)

  • Factors are special vectors that represent categorical data
    • Stored as vectors of labelled integers
    • Cannot be treated as strings/text

INTERACTIVE DEMO

Challenge 06 (5min)

Create a new factor, defining control and case experiments, and inspect the result:

f <- factor(c("case", "control", "case", "control", "case"))
str(f)
##  Factor w/ 2 levels "case","control": 1 2 1 2 1

In some statistical analyses in R it is important that the control level is numbered 1

  • Using the help available to you in RStudio, can you create a factor with the same values, but where the control level is numbered 1?

Lists

  • lists are like vectors, but can hold any combination of datatype
    • elements in a list are denoted by [[]] and can be named

INTERACTIVE DEMO

# create a list
l <- list(1, 'a', TRUE, matrix(0, nrow = 2, ncol = 2), f)
l_named <- list(a = "SWC", b = 1:4)

Logical Indexing

  • We have used indexing, slicing and names to get data by ‘location’
> animal[c(2,4,6)]
[1] "o" "k" "y"
> l_named$b
[1] 1 2 3 4
  • Logical indexes select data that meets certain criteria

INTERACTIVE DEMO

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
mask <- c(TRUE, FALSE, TRUE, FALSE, TRUE)
x[mask]
x[x > 7]

6. Dataframes

Let’s look at a data.frame

  • The cats data is a data.frame

INTERACTIVE DEMO

> class(cats)
[1] "data.frame"
> cats
    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

What is a data.frame?

  • The standard R data structure for storing tabular, rectangular data
  • A named list of vectors having identical lengths.
    • Each column is a vector
    • Each vector can be a different data type
  • This is very much LIKE a spreadsheet, but…
    • Columns are constrained to a type
    • Columns are all the same length

Creating a data.frame

INTERACTIVE DEMO

# Create a data frame
df <- data.frame(a=c(1,2,3), b=c('eeny', 'meeny', 'miney'),
                 c=c(TRUE, FALSE, TRUE))
summary(df)
##        a            b                 c          
##  Min.   :1.0   Length:3           Mode :logical  
##  1st Qu.:1.5   Class :character   FALSE:1        
##  Median :2.0   Mode  :character   TRUE :2        
##  Mean   :2.0                                     
##  3rd Qu.:2.5                                     
##  Max.   :3.0

Saving a data.frame to file

INTERACTIVE DEMO

write.table(df, "data/df_example.tab", sep="\t")

We need to provide

  • the data.frame
  • the path to the file being written
  • a column separator

Loading a data.frame

INTERACTIVE DEMO

The link is available on the course Etherpad

gapminder <- read.table("data/gapminder-FiveYearData.csv", sep=",", header=TRUE)
  • R can also read data direct from the internet
url <- paste("https://raw.githubusercontent.com/resbaz/",
             "r-novice-gapminder-files/master/data/",
             "gapminder-FiveYearData.csv", sep = '')
gapminder <- read.table(url, sep=",", header=TRUE)

Investigating gapminder

INTERACTIVE DEMO

str(gapminder)              # structure of the data.frame
typeof(gapminder$year)      # data type of a column
length(gapminder)           # length of the data.frame
nrow(gapminder)             # number of rows in data.frame
ncol(gapminder)             # number of columns in data.frame
dim(gapminder)              # number of rows and columns in data.frame
colnames(gapminder)         # column names from data.frame
head(gapminder)             # first few rows of dataframe
summary(gapminder)          # summary of data in data.frame columns

7. Packages

Packages

In R:

  • a package is a collection (or library) of reusable code
  • many useful and specialist tools are distributed as packages
  • over 10,000 packages are available at CRAN
  • you can distribute your own code as a package

INTERACTIVE DEMO

installed.packages()               # see installed packages
install.packages("packagename")    # install a new package
update.packages()                  # update installed packages
library(packagename)               # import a package for use in your code

CRAN - the Comprehensive R Archive Network: https://cran.r-project.org/

Challenge 07 (5min)

Can you check if the following packages are installed on your system, and install them if necessary?

dplyr
ggplot2
knitr

8. Creating Publication-
Quality Graphics

Visualisation is Critical!

The Grammar of Graphics

  • ggplot2 is part of the Tidyverse, a collection of packages for data science
    • ggplot2 is the graphics package
  • Implements the “Grammar of Graphics”
    • Separates data from its representation
    • Helps iteratively update/refine plots
    • Helps build complex, effective visualisations from simple elements
  • data
  • aesthetics
  • geoms
  • layers

A Basic Scatterplot

  • You can use ggplot2 like base graphics
    • qplot() ≈ plot()

INTERACTIVE DEMO

library(ggplot2)
plot(gapminder$lifeExp, gapminder$gdpPercap, col=gapminder$continent)
qplot(lifeExp, gdpPercap, data=gapminder, colour=continent)

What is a Plot? aesthetics

  • Each observation in the data is a point
  • A point’s aesthetics determine how it is rendered
    • co-ordinates on the image; size; shape; colour
  • aesthetics can be constant or mapped to variables
  • Many different plots can be generated from the same data by changing aesthetics

What is a Plot? geoms

geom (short for geometry) defines the “type” of representation

  • If data are drawn as points: scatterplot
  • If data are drawn as lines: line plot
  • If data are drawn as bars: bar chart
  • ggplot2 provides several geom types

What is a Plot? geoms

The same data and aesthetics can be shown with different geoms

INTERACTIVE DEMO

# Generate plot of GDP per capita against life Expectancy
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent))
p + geom_point()
p + geom_line()

Challenge 08 (2min)

Can you create another figure in your script showing how life expectancy changes as a function of time, as a scatterplot?

What is a Plot? layers

  • We’ve just used another “Grammar of Graphics” concept: layers
    • ggplot2 plots are built as layers

All layers have two components

  1. data and aesthetics
  2. a geom
  • Data and aesthetics can be defined in a base ggplot object
    • values from the base are inherited by the other layers
    • the base can be overridden in other layers

What is a Plot? layers

  • Data and aesthetics can be defined in a base ggplot object
    • values from the base are inherited by the other layers
    • the base can be overridden in other layers
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, colour=continent))
p + geom_point()

What is a Plot? layers

  • Data and aesthetics can be defined in a base ggplot object
    • values from the base are inherited by the other layers
    • the base can be overridden in other layers
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, colour=continent))
p + geom_line(aes(group=country))

INTERACTIVE DEMO

What is a Plot? layers

  • We can use several layers of geoms to build a plot
    • alpha controls opacity for a layer
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent))
p + geom_line(aes(group=country)) + geom_point(alpha=0.4)

INTERACTIVE DEMO

Challenge 09 (5min)

Can you create another figure in your script showing how life expectancy changes as a function of time, coloured by continent, with two layers:

  • a line plot, grouping points by country
  • a scatterplot showing each data point, with 35% opacity

Transformations and scales

  • Data transformations are handled with scale layers
  • axis scaling (log scales)
  • colour scaling (changing palettes)

INTERACTIVE DEMO

Statistics layers

  • Some geom layers transform the dataset
    • Usually this is a data summary (e.g. smoothing or binning)

INTERACTIVE DEMO

Multi-panel figures

  • So far all our plots have all data in a single figure
  • Comparisons can be clearer with multiple panels:
    • facets
    • “small multiples plots”

Use the facet_wrap() layer to generate grids of plots

INTERACTIVE DEMO

# Compare life expectancy over time by continent
p <- ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=continent,
                                group=country))
p <- p + geom_line() + scale_y_log10()
p + facet_wrap(~continent)

Challenge 10 (5min)

Can you create a scatterplot and contour densities of GDP per capita against population size, with colour filled by continent?

ADVANCED: Transform the x axis to better visualise data spread, and use facets to panel density plots by year.

9. Data Cleaning/Tidy Data

Why Tidy Data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way”

  • Data Cleaning is not just a first step
    • repeated when new data turns up, new ideas arrive, etc.
  • About 80% of the effort of data analysis is cleaning and preparing data

Principles of Tidy Data provide a standard way to organise data values within a dataset

A Messy Dataset (1)

df1 <- data.frame(treatment=c("treatmenta", "treatmentb"),
                  John.Smith=c(NA, 2),
                  Jane.Doe=c(16, 11),
                  Mary.Johnson=c(3, 1))
colnames(df1) = c("", "John Smith", "Jane Doe", "Mary Johnson")
John Smith Jane Doe Mary Johnson
treatmenta NA 16 3
treatmentb 2 11 1

A Messy Dataset (2)

df2 <- data.frame(name=c("John Smith", "Jane Doe", "Mary Johnson"),
                 treatmenta=c(NA, 16, 3),
                 treatmentb=c(2, 11, 1))
colnames(df2) = c("", "treatmenta", "treatmentb")
treatmenta treatmentb
John Smith NA 2
Jane Doe 16 11
Mary Johnson 3 1

Data Semantics

  • A dataset is a collection of VALUES
John Smith Jane Doe Mary Johnson
treatmenta NA 16 3
treatmentb 2 11 1
  • Each value belongs to a variable and an observation
  • VARIABLES: can change or vary
    • values that are *measured** or decided by a researcher: height, temperature, duration, treatment, etc.
  • OBSERVATIONS: values measured across all variables for the same individual/unit/group: person, reactor, religion, company, etc.

Challenge 11 (2min)

In the table below, what is the correct assignment for rows and columns?

John Smith Jane Doe Mary Johnson
treatmenta NA 16 3
treatmentb 2 11 1

  • Rows=OBERVATIONS, Columns=OBSERVATIONS
  • Rows=OBERVATIONS, Columns=NEITHER
  • Rows=NEITHER, Columns=OBSERVATIONS
  • Rows=NEITHER, Columns=NEITHER

USE CHALLENGE LINK ON ETHERPAD

A Tidy Dataset (1)

  • The dataset contains 18 values:
    • three variables
    • six observations
  • VARIABLES
    • person (John, Mary, Jane)
    • treatment (a and b)
    • result (NA, 16, 3, 2, 11, 1)

Each OBSERVATION includes all three variables

A Tidy Dataset (2)

name treatment result
John Smith a NA
Jane Doe a 16
Mary Johnson a 3
John Smith b 2
Jane Doe b 11
Mary Johnson b 1

Tidy Data

  • Tidy data is a standard way of structuring a dataset
    • not the only way
    • makes it easy to extract needed varaibles
    • well suited to R (supports vectorisation)
  1. Each VARIABLE forms a column
  2. Each OBSERVATION forms a row
  • The real data you receive may often need to be cleaned to be in Tidy form
  • Messy forms are sometimes useful

INTERACTIVE DEMO

10. Working With Tidy Data

Learning Objectives

  • How to manipulate data.frames with the six verbs of dplyr
    • a ‘grammar of data manipulation’
  • select()
  • filter()
  • group_by()
  • summarize()
  • mutate()
  • %>% (pipe)

What and Why is dplyr?

  • dplyr is another package in the Tidyverse
  • Facilitates analysis by groups in Tidy Data
    • Helps avoid repetition
> mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
[1] 2193.755
> mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
[1] 7136.11
> mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
[1] 7902.15

Avoiding repetition (though automation) makes code

  • robust
  • reproducible

Split-Apply-Combine

select() - Interactive Demo

library(dplyr)
select(gapminder, year, country, gdpPercap)
gapminder %>% select(year, country, gdpPercap)

filter()

  • filter() selects rows on the basis of some condition

INTERACTIVE DEMO

filter(gapminder, continent=="Europe")

# Select gdpPercap by country and year, only for Europe
eurodata <- gapminder %>%
              filter(continent == "Europe") %>%
              select(year, country, gdpPercap)

Challenge 12 (5min)

Can you write a single line (which may span multiple lines in your script by including pipes) to produce a dataframe from gapminder containing:

  • life expectancy, country, and year data
  • only for African nations

How many rows does the dataframe have

group_by()

INTERACTIVE DEMO

group_by(gapminder, continent)
gapminder %>% group_by(continent)

summarize()

INTERACTIVE DEMO

# Produce table of mean GDP by continent
gapminder %>%
    group_by(continent) %>%
    summarize(meangdpPercap=mean(gdpPercap))

Challenge 13 (5min)

  • Can you calculate the average life expectancy per country in the gapminder data?
  • Which nation has longest life expectancy, and which the shortest?

count() and n()

  • Two useful functions related to summarize()
    • count()/tally(): a function that reports a table of counts by group
    • n(): a function used within summarize(), filter() or mutate() to represent count by group

INTERACTIVE DEMO

gapminder %>%
  filter(year == 2002) %>%
  count(continent, sort = TRUE)

gapminder %>%
  group_by(continent) %>%
  summarize(se_lifeExp = sd(lifeExp)/sqrt(n()))

mutate()

  • mutate() is a function allowing creation of new variables

INTERACTIVE DEMO

# Calculate GDP in $billion
gdp_bill <- gapminder %>%
  mutate(gdp_billion = gdpPercap * pop / 10^9)

# Calculate total/sd of GDP by continent and year
gdp_bycontinents_byyear <- gapminder %>%
    mutate(gdp_billion=gdpPercap*pop/10^9) %>%
    group_by(continent,year) %>%
    summarize(mean_gdpPercap=mean(gdpPercap),
              sd_gdpPercap=sd(gdpPercap),
              mean_gdp_billion=mean(gdp_billion),
              sd_gdp_billion=sd(gdp_billion))

11. Programming in R

Learning Objectives

  • How to make data-dependent choices in R
  • Use if() and else()
  • Repeat operations in R
  • Use for() loops
  • vectorisation to avoid repeating operations
  • Writing functions to avoid repetition and make code reusable

if() … else

  • We often want to perform operations (or not) conditional on whether something is TRUE
    • The if() and else construct is useful for this
# if
if (condition is true) {
  PERFORM ACTION
}

# if ... else
if (condition is true) {
  PERFORM ACTION
} else {  # i.e. if the condition is false,
  PERFORM ALTERNATIVE ACTION
}

INTERACTIVE DEMO

Challenge 14 (2min)

Can you use an if() statement to report whether there are any records from 2002 in the gapminder dataset?

Can you do the same for 2012?

HINT: Look at the help for the any() function

for() loops

  • for() loops are a very common construct in programming
    • for each <item> in a group, <do something (with the item)>
  • Not as useful in R as in some other languages
for(iterator in set of values){
  do a thing
}

INTERACTIVE DEMO

while() loops

  • while() loops are useful when you need to do something while some condition is true
while(this condition is true){
  do a thing
}

INTERACTIVE DEMO

Challenge 15 (2min)

Can you use a for() loop and an if() statement to print whether each letter in the alphabet is a vowel?

HINT: Use R’s help for letters and %in%

Vectorisation

  • for() and while() loops can be useful, but are not efficient
  • Most functions in R are vectorised
    • When applied to a vector, apply to all elements in that vector
    • No need to loop

You’ve already seen and used much of this behaviour

INTERACTIVE DEMO

x < 1:4
x * 2
y <- 6:9
x + y

Challenge 16 (2min)

We want to sum the following series of fractions

\(\frac{1}{1^2} + \frac{1}{2^2} + \frac{1}{3^2} + \ldots + \frac{1}{n^2}\)

for large values of \(n\)

Can you do this using vectorisation for \(n = 10,000\)?

12. Functions

Why Functions?

  • Functions let us run a complex series of commands in one go
    • under a memorable/descriptive name
    • invoked with that name
    • with a defined set of inputs and outputs
    • to perform a logically coherent task

Functions are the building blocks of programming

  • Small functions with one obvious, clearly-defined task are best

Defining a Function

  • You will often need to write your own functions
  • They take a standard form
<function_name> <- function(<arg1>, <arg2>) {
  <do something>
  return(<result>)
}

INTERACTIVE DEMO

my_sum <- function(a, b) {
  the_sum <- a + b
  return(the_sum)
}

Documentation

  • So far, you’ve been able to use R’s built-in help to see function documentation
    • This isn’t available for your functions unless you write it

Your future self will thank you!

(and so will your colleagues)

Write programs for people, not for computers

  • State what the code does (and why)
  • Define inputs and outputs
  • Give an example

INTERACTIVE DEMO

Function Arguments

  • We can define functions that take multiple arguments
  • We can also define default values for arguments

INTERACTIVE DEMO

# Calculate total GDP in gapminder data
calcGDP <- function(data, year_in=NULL, country_in=NULL) {
  gdp <- data %>% mutate(gdp=(pop * gdpPercap))
  if (!is.null(year_in)) {
    gdp <- gdp %>% filter(year %in% year_in)
  }
  if (!is.null(country_in)) {
    gdp <- gdp %>% filter(country %in% country_in)
  }
  return(gdp)
}

Challenge 17 (10min)

Can you write a function that takes an optional argument called letter, which:

  • Plots the life expectancy per year for each country
  • Only for countries whose name starts with a letter in letter
  • Uses facet_wrap() to produce a grid of output graphs
  • ADVANCED: Make the facet wrapping optional

HINT: The following code may be useful

starts.with <- substr(gapminder$country, start = 1, stop = 1)
az.countries <- gapminder[starts.with %in% c("A", "Z"), ]

13. Dynamic Reports

Literate Programming

  • A programming paradigm introduced by Donald Knuth
  • The program (or analysis) is explained in natural language
    • The source code is interspersed
  • The whole document is executable

We can produce these documents in RStudio

Create an R Markdown file

  • R Markdown files embody Literate Programming in R
  • File \(\rightarrow\) New File \(\rightarrow\) R Markdown
  • Enter a title
  • Save the file (gets the extension .Rmd)

Components of an R Markdown file

  • Header information is fenced by ---
---
title: "Literate Programming"
author: "Leighton Pritchard"
date: "04/12/2017"
output: html_document
---
  • Natural language is written as plain text
This is an R Markdown document. Markdown is a simple formatting syntax
  • R code (which is executable) is fenced by backticks (```)

Click on Knit

Creating a Report

14. Conclusion

You have learned:

  • About R, RStudio and how to set up a project
  • How to load data into R and produce summary statistics and plots with base tools
  • All the data types in R, the most important data structures
  • How to install and use packages
  • How to use the Tidyverse to manipulate and plot data
  • How to use program flow control and functions
  • How to create dynamic reports in R

WELL DONE!!

The End Is The Beginning

Where Next?