11-12/5/2021
Research Workflows
Pipeline
Workflow
R
and RStudio
R
)tidyverse
ggplot2
RMarkdown
R
?R
is:
R
languageWhy use R
?
RStudio
?Please start RStudio
RStudio
is an integrated development environment (IDE)R
(console/‘scratchpad’); Graphics/visualisation/HelpExcel
?”Excel
is good for some thingsR
is excellent for analysis and reproducibility…R
can be run on supercomputers, with extremely large datasets…RStudio
overview - INTERACTIVE DEMOVariables are like named boxes
Name
)x <- 1 / 40 x
## [1] 0.025
x ^ 2
## [1] 0.000625
log(x)
## [1] -3.688879
name <- "Samia" name
## [1] "Samia"
Variable names are documentation
current_temperature = 28.6 subjectID = "GCF_00001236452.1" GPS_Location = "54N, 36E"
[a-zA-z0-9_.]
)x2
is allowed, 2x
is not)Weight
is not the same as weight
)lower_snake
, UPPER_SNAKE
, lowerCamelCase
, UpperCamelCase
Functions (log()
, sin()
etc.) ≈ “canned script”
sqrt()
, lm()
, plot()
)R
INTERACTIVE DEMO
args(fname) # arguments for fname ?fname # help page for fname help(fname) # help page for fname ??fname # any mention of fname help.search("text") # any mention of "text" vignette(fname) # worked examples for fname vignette() # show all available vignettes
What will be the value of each variable after each statement in the following program?
mass <- 47.5 age <- 122 mass <- mass * 2.3 age <- age - 20
mass = 47.5, age = 102
mass = 109.25, age = 102
mass = 47.5, age = 122
mass = 109.25, age = 122
USE CHALLENGE LINK ON ETHERPAD
R
THERE IS NO ONE TRUE WAY (only principles)
data
?)clean_data
?)R
Projects (more advanced) https://chrisvoncsefalvay.com/2018/08/09/structuring-r-projects/ RStudio
RStudio
tries to help you manage your projects
R Project
concept - files and subdirectory structureRStudio
Let’s create a project in RStudio
INTERACTIVE DEMO
RStudio
projects: https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects
RStudio
We can write code in several ways in RStudio
We’re going to create a new dataset and R
script.
INTERACTIVE DEMO
RStudio
Download the file from the following link to your data/
directory, and extract it
(the link is also available on the course Etherpad page)
Data files can be inspected in RStudio
read.csv(file = "data/inflammation-01.csv", header = FALSE)
Someone gives you a data file that has:
,
) as the decimal point character;
) as the field separatorHow would you open it, using read.csv()
Use the help function and documentation
USE CHALLENGE LINK ON ETHERPAD
INTERACTIVE DEMO
[]
[row, column]
data[1, 1] # First value in dataset data[30, 20] # Middle value of dataset
:
separator (meaning ‘to’)data[1:4, 1:4] # rows 1 to 4; columns 1 to 4
data[5, ] # row 5 data[, 16] # column 16
INTERACTIVE DEMO
R
provides useful functions to summarise datamax(data) # largest value in dataset max(data[2, ]) # largest value for row (patient) 2 min(data[, 7]) # smallest value on column (day) 7 mean(data[, 7]) # mean value on day 7 sd(data[, 7]) # standard deviation of values on day 7
INTERACTIVE DEMO
Computers exist to do tedious things for us
So apply
a function (mean
) to each row in the data
:
R
has several ways to automate this process
apply(X = data, MARGIN = 1, FUN = mean)
MARGIN = 1
: rowsMARGIN = 2
: columnsrowMeans(data) colMeans(data)
“The purpose of computing is insight, not numbers.” - Richard Hamming
R
has many available graphics packages
INTERACTIVE DEMO
plot(avg_inflammation_patient) max_day_inflammation <- apply(dat, 2, max) plot(max_day_inflammation) plot(apply(dat,2,min)) # 3 functions in one!
Can you add plots to your script showing:
R
R
R
R
dataR
’s data types and structures relate to your own dataR
R
is mostly used for data analysisR
has special types and structures to help you work with dataINTERACTIVE DEMO
Understanding data types, their uses, and how they relate to your own data is key to successful analysis with R
(it’s not just about programming)
What data types would you expect to see?
What examples of data types can you think of from your own experience?
Please write them into the chat
R
R
are atomic
TRUE
, FALSE
3
, 2L
, 123456
3.0
, -23.45
, pi
3+0i
, 1+4i
"a"
, 'SWC'
, "This is not a string"
INTERACTIVE DEMO
Create examples of data with the following characteristics:
answer
, type: logical
height
, type: numeric
dog_name
, type: character
For each variable, test that it has the data type you intended
R
Data Structuresvector
factor
list
data.frame
INTERACTIVE DEMO
Vectors are atomic: they can contain only a single data type
What data type are the following vectors (xx
, yy
, zz
)?
xx <- c(1.7, "a") yy <- c(TRUE, 2) zz <- c("a", TRUE)
Options: logical
, integer
, numeric
, character
USE CHALLENGE LINK ON ETHERPAD
R
will perform implicit coercion on vectors to make them atomiclogical
\(\rightarrow\) integer
\(\rightarrow\) double
\(\rightarrow\) complex
\(\rightarrow\) character
If there are formatting problems with your data, you might not have the type you expect when you import into R
as.<type_name>()
INTERACTIVE DEMO
Data comes as one of two types:
weight <- 17.2
; rooms <- 7
)grade <- "8"
, coat <- "brindled"
)This kind of distinction critical in many applications (e.g. statistical modelling)
INTERACTIVE DEMO
Create a new factor
, defining control
and case
experiments, and inspect the result:
f <- factor(c("case", "control", "case", "control", "case")) str(f)
## Factor w/ 2 levels "case","control": 1 2 1 2 1
In some statistical analyses in R
it is important that the control
level is numbered 1
RStudio
, can you create a factor with the same values, but where the control
level is numbered 1
?list
s are like vectors, but can hold any combination of datatype
list
are denoted by [[]]
and can be namedINTERACTIVE DEMO
# create a list l <- list(1, 'a', TRUE, matrix(0, nrow = 2, ncol = 2), f) l_named <- list(a = "SWC", b = 1:4)
> animal[c(2,4,6)] [1] "o" "k" "y" > l_named$b [1] 1 2 3 4
INTERACTIVE DEMO
x <- c(5.4, 6.2, 7.1, 4.8, 7.5) mask <- c(TRUE, FALSE, TRUE, FALSE, TRUE) x[mask] x[x > 7]
data.frame
cats
data is a data.frame
INTERACTIVE DEMO
> class(cats) [1] "data.frame" > cats coat weight likes_string 1 calico 2.1 1 2 black 5.0 0 3 tabby 3.2 1
data.frame
?R
data structure for storing tabular, rectangular datalist
of vector
s having identical lengths.
vector
vector
can be a different data typedata.frame
INTERACTIVE DEMO
# Create a data frame df <- data.frame(a=c(1,2,3), b=c('eeny', 'meeny', 'miney'), c=c(TRUE, FALSE, TRUE)) summary(df)
## a b c ## Min. :1.0 Length:3 Mode :logical ## 1st Qu.:1.5 Class :character FALSE:1 ## Median :2.0 Mode :character TRUE :2 ## Mean :2.0 ## 3rd Qu.:2.5 ## Max. :3.0
data.frame
to fileINTERACTIVE DEMO
write.table(df, "data/df_example.tab", sep="\t")
We need to provide
data.frame
data.frame
INTERACTIVE DEMO
data/
directoryThe link is available on the course Etherpad
gapminder <- read.table("data/gapminder-FiveYearData.csv", sep=",", header=TRUE)
R
can also read data direct from the interneturl <- paste("https://raw.githubusercontent.com/resbaz/", "r-novice-gapminder-files/master/data/", "gapminder-FiveYearData.csv", sep = '') gapminder <- read.table(url, sep=",", header=TRUE)
gapminder
INTERACTIVE DEMO
str(gapminder) # structure of the data.frame typeof(gapminder$year) # data type of a column length(gapminder) # length of the data.frame nrow(gapminder) # number of rows in data.frame ncol(gapminder) # number of columns in data.frame dim(gapminder) # number of rows and columns in data.frame colnames(gapminder) # column names from data.frame head(gapminder) # first few rows of dataframe summary(gapminder) # summary of data in data.frame columns
In R
:
CRAN
INTERACTIVE DEMO
installed.packages() # see installed packages install.packages("packagename") # install a new package update.packages() # update installed packages library(packagename) # import a package for use in your code
CRAN - the Comprehensive R
Archive Network: https://cran.r-project.org/
Can you check if the following packages are installed on your system, and install them if necessary?
dplyr ggplot2 knitr
ggplot2
is part of the Tidyverse, a collection of packages for data science
ggplot2
is the graphics packagegeom
sggplot2
like base graphics
qplot()
≈ plot()
INTERACTIVE DEMO
library(ggplot2) plot(gapminder$lifeExp, gapminder$gdpPercap, col=gapminder$continent) qplot(lifeExp, gdpPercap, data=gapminder, colour=continent)
geom
sgeom
(short for geometry) defines the “type” of representation
ggplot2
provides several geom
typesgeom
sThe same data and aesthetics can be shown with different geom
s
INTERACTIVE DEMO
# Generate plot of GDP per capita against life Expectancy p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent)) p + geom_point() p + geom_line()
Can you create another figure in your script showing how life expectancy changes as a function of time, as a scatterplot?
ggplot2
plots are built as layersAll layers have two components
geom
ggplot
object
ggplot
object
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, colour=continent)) p + geom_point()
ggplot
object
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, colour=continent)) p + geom_line(aes(group=country))
INTERACTIVE DEMO
geom
s to build a plot
alpha
controls opacity for a layerp <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent)) p + geom_line(aes(group=country)) + geom_point(alpha=0.4)
INTERACTIVE DEMO
Can you create another figure in your script showing how life expectancy changes as a function of time, coloured by continent, with two layers:
scale
sscale
layersINTERACTIVE DEMO
geom
layers transform the dataset
INTERACTIVE DEMO
Use the facet_wrap()
layer to generate grids of plots
INTERACTIVE DEMO
# Compare life expectancy over time by continent p <- ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=continent, group=country)) p <- p + geom_line() + scale_y_log10() p + facet_wrap(~continent)
Can you create a scatterplot and contour densities of GDP per capita against population size, with colour filled by continent?
ADVANCED: Transform the x axis to better visualise data spread, and use facets to panel density plots by year.
“Tidy datasets are all alike, but every messy dataset is messy in its own way”
Principles of Tidy Data provide a standard way to organise data values within a dataset
R
for Data Science: http://r4ds.had.co.nz/dplyr
: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.htmlR
and RStudio
: https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/df1 <- data.frame(treatment=c("treatmenta", "treatmentb"), John.Smith=c(NA, 2), Jane.Doe=c(16, 11), Mary.Johnson=c(3, 1)) colnames(df1) = c("", "John Smith", "Jane Doe", "Mary Johnson")
John Smith | Jane Doe | Mary Johnson | |
---|---|---|---|
treatmenta | NA | 16 | 3 |
treatmentb | 2 | 11 | 1 |
df2 <- data.frame(name=c("John Smith", "Jane Doe", "Mary Johnson"), treatmenta=c(NA, 16, 3), treatmentb=c(2, 11, 1)) colnames(df2) = c("", "treatmenta", "treatmentb")
treatmenta | treatmentb | |
---|---|---|
John Smith | NA | 2 |
Jane Doe | 16 | 11 |
Mary Johnson | 3 | 1 |
John Smith | Jane Doe | Mary Johnson | |
---|---|---|---|
treatmenta | NA | 16 | 3 |
treatmentb | 2 | 11 | 1 |
In the table below, what is the correct assignment for rows and columns?
John Smith | Jane Doe | Mary Johnson | |
---|---|---|---|
treatmenta | NA | 16 | 3 |
treatmentb | 2 | 11 | 1 |
USE CHALLENGE LINK ON ETHERPAD
NA
, 16, 3, 2, 11, 1)Each OBSERVATION includes all three variables
name | treatment | result |
---|---|---|
John Smith | a | NA |
Jane Doe | a | 16 |
Mary Johnson | a | 3 |
John Smith | b | 2 |
Jane Doe | b | 11 |
Mary Johnson | b | 1 |
R
(supports vectorisation)INTERACTIVE DEMO
data.frame
s with the six verbs of dplyr
select()
filter()
group_by()
summarize()
mutate()
%>%
(pipe)R
for Data Science: http://r4ds.had.co.nz/dplyr
: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.htmlR
and RStudio
: https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/dplyr
?dplyr
is another package in the Tidyverse> mean(gapminder[gapminder$continent == "Africa", "gdpPercap"]) [1] 2193.755 > mean(gapminder[gapminder$continent == "Americas", "gdpPercap"]) [1] 7136.11 > mean(gapminder[gapminder$continent == "Asia", "gdpPercap"]) [1] 7902.15
Avoiding repetition (though automation) makes code
select()
- Interactive Demolibrary(dplyr)
select(gapminder, year, country, gdpPercap) gapminder %>% select(year, country, gdpPercap)
filter()
filter()
selects rows on the basis of some conditionINTERACTIVE DEMO
filter(gapminder, continent=="Europe") # Select gdpPercap by country and year, only for Europe eurodata <- gapminder %>% filter(continent == "Europe") %>% select(year, country, gdpPercap)
Can you write a single line (which may span multiple lines in your script by including pipes) to produce a dataframe from gapminder
containing:
How many rows does the dataframe have
group_by()
INTERACTIVE DEMO
group_by(gapminder, continent) gapminder %>% group_by(continent)
summarize()
INTERACTIVE DEMO
# Produce table of mean GDP by continent gapminder %>% group_by(continent) %>% summarize(meangdpPercap=mean(gdpPercap))
gapminder
data?count()
and n()
summarize()
count()
/tally()
: a function that reports a table of counts by groupn()
: a function used within summarize()
, filter()
or mutate()
to represent count by groupINTERACTIVE DEMO
gapminder %>% filter(year == 2002) %>% count(continent, sort = TRUE) gapminder %>% group_by(continent) %>% summarize(se_lifeExp = sd(lifeExp)/sqrt(n()))
mutate()
mutate()
is a function allowing creation of new variablesINTERACTIVE DEMO
# Calculate GDP in $billion gdp_bill <- gapminder %>% mutate(gdp_billion = gdpPercap * pop / 10^9) # Calculate total/sd of GDP by continent and year gdp_bycontinents_byyear <- gapminder %>% mutate(gdp_billion=gdpPercap*pop/10^9) %>% group_by(continent,year) %>% summarize(mean_gdpPercap=mean(gdpPercap), sd_gdpPercap=sd(gdpPercap), mean_gdp_billion=mean(gdp_billion), sd_gdp_billion=sd(gdp_billion))
We can produce these documents in RStudio
R Markdown
fileR Markdown
files embody Literate Programming in R
File
\(\rightarrow\) New File
\(\rightarrow\) R Markdown
.Rmd
)R Markdown
file---
--- title: "Literate Programming" author: "Leighton Pritchard" date: "04/12/2017" output: html_document ---
This is an R Markdown document. Markdown is a simple formatting syntax
R
code (which is executable) is fenced by backticks (```)Click on Knit
R Markdown
report on the gapminder
dataINTERACTIVE DEMO
R Markdown
documentation: http://rmarkdown.rstudio.com/R Markdown
cheat sheet: http://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdfR Markdown
: https://www.rstudio.com/resources/webinars/getting-started-with-r-markdown/R
, RStudio
and how to set up a projectR
and produce summary statistics and plots with base toolsR
, the most important data structuresR
WELL DONE!!
R
R
if()
and else()
R
for()
loopsif()
… else
TRUE
if()
and else
construct is useful for this# if if (condition is true) { PERFORM ACTION } # if ... else if (condition is true) { PERFORM ACTION } else { # i.e. if the condition is false, PERFORM ALTERNATIVE ACTION }
INTERACTIVE DEMO
Can you use an if()
statement to report whether there are any records from 2002 in the gapminder
dataset?
Can you do the same for 2012?
HINT: Look at the help for the any()
function
for()
loopsfor()
loops are a very common construct in programming
for
each <item>
in a group, <do something (with the item)>
R
as in some other languagesfor(iterator in set of values){ do a thing }
INTERACTIVE DEMO
while()
loopswhile()
loops are useful when you need to do something while some condition is truewhile(this condition is true){ do a thing }
INTERACTIVE DEMO
Can you use a for()
loop and an if()
statement to print whether each letter in the alphabet is a vowel?
HINT: Use R
’s help for letters
and %in%
for()
and while()
loops can be useful, but are not efficientR
are vectorised
You’ve already seen and used much of this behaviour
INTERACTIVE DEMO
x < 1:4 x * 2 y <- 6:9 x + y
We want to sum the following series of fractions
\(\frac{1}{1^2} + \frac{1}{2^2} + \frac{1}{3^2} + \ldots + \frac{1}{n^2}\)
for large values of \(n\)
Can you do this using vectorisation for \(n = 10,000\)?
Functions are the building blocks of programming
<function_name> <- function(<arg1>, <arg2>) { <do something> return(<result>) }
INTERACTIVE DEMO
my_sum <- function(a, b) { the_sum <- a + b return(the_sum) }
R
’s built-in help to see function documentation
Your future self will thank you!
(and so will your colleagues)
Write programs for people, not for computers
INTERACTIVE DEMO
INTERACTIVE DEMO
# Calculate total GDP in gapminder data calcGDP <- function(data, year_in=NULL, country_in=NULL) { gdp <- data %>% mutate(gdp=(pop * gdpPercap)) if (!is.null(year_in)) { gdp <- gdp %>% filter(year %in% year_in) } if (!is.null(country_in)) { gdp <- gdp %>% filter(country %in% country_in) } return(gdp) }
Can you write a function that takes an optional argument called letter
, which:
letter
facet_wrap()
to produce a grid of output graphsHINT: The following code may be useful
starts.with <- substr(gapminder$country, start = 1, stop = 1) az.countries <- gapminder[starts.with %in% c("A", "Z"), ]