25-26/10/2022
Research Workflows
Pipeline
Workflow
R and RStudioR)tidyverseggplot2RMarkdownR?R is:
R languageWhy use R?
RStudio?Please start RStudio
RStudio is an integrated development environment (IDE)R (console/‘scratchpad’); Graphics/visualisation/HelpExcel?”Excel is good for some thingsR is excellent for analysis and reproducibility…R can be run on supercomputers, with extremely large datasets…RStudio overview - INTERACTIVE DEMOVariables are like named boxes
Name)x <- 1 / 40 x
## [1] 0.025
x ^ 2
## [1] 0.000625
log(x)
## [1] -3.688879
name <- "Samia" name
## [1] "Samia"
Variable names are documentation
current_temperature = 28.6 subjectID = "GCF_00001236452.1" GPS_Location = "54N, 36E"
[a-zA-z0-9_.])x2 is allowed, 2x is not)Weight is not the same as weight)lower_snake, UPPER_SNAKE, lowerCamelCase, UpperCamelCaseFunctions (log(), sin() etc.) ≈ “canned script”
sqrt(), lm(), plot())RINTERACTIVE DEMO
args(fname) # arguments for fname
?fname # help page for fname
help(fname) # help page for fname
??fname # any mention of fname
help.search("text") # any mention of "text"
vignette(fname) # worked examples for fname
vignette() # show all available vignettes
What will be the value of each variable after each statement in the following program?
mass <- 47.5 age <- 122 mass <- mass * 2.3 age <- age - 20
mass = 47.5, age = 102mass = 109.25, age = 102mass = 47.5, age = 122mass = 109.25, age = 122RTHERE IS NO ONE TRUE WAY (only principles)
data?)clean_data?)R Projects (more advanced) https://chrisvoncsefalvay.com/2018/08/09/structuring-r-projects/ RStudioRStudio tries to help you manage your projects
R Project concept - files and subdirectory structureRStudioLet’s create a project in RStudio
INTERACTIVE DEMO
RStudio projects: https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects
RStudioWe can write code in several ways in RStudio
We’re going to create a new dataset and R script.
INTERACTIVE DEMO
RStudioDownload the file from the following link to your data/ directory, and extract it
Data files can be inspected in RStudio
read.csv(file = "data/inflammation-01.csv", header = FALSE)
Someone gives you a data file that has:
,) as the decimal point character;) as the field separatorHow would you open it, using read.csv()
Use the help function and documentation
INTERACTIVE DEMO
[][row, column]data[1, 1] # First value in dataset data[30, 20] # Middle value of dataset
: separator (meaning ‘to’)data[1:4, 1:4] # rows 1 to 4; columns 1 to 4
data[5, ] # row 5 data[, 16] # column 16
INTERACTIVE DEMO
R provides useful functions to summarise datamax(data) # largest value in dataset max(data[2, ]) # largest value for row (patient) 2 min(data[, 7]) # smallest value on column (day) 7 mean(data[, 7]) # mean value on day 7 sd(data[, 7]) # standard deviation of values on day 7
INTERACTIVE DEMO
Computers exist to do tedious things for us
So apply a function (mean) to each row in the data:
R has several ways to automate this process
apply(X = data, MARGIN = 1, FUN = mean)
MARGIN = 1: rowsMARGIN = 2: columnsrowMeans(data) colMeans(data)
“The purpose of computing is insight, not numbers.” - Richard Hamming
R has many available graphics packages
INTERACTIVE DEMO
plot(avg_inflammation_patient) max_day_inflammation <- apply(dat, 2, max) plot(max_day_inflammation) plot(apply(dat,2,min)) # 3 functions in one!
Can you add plots to your script showing:
RRRR dataR’s data types and structures relate to your own dataRR is mostly used for data analysisR has special types and structures to help you work with dataINTERACTIVE DEMO
Understanding data types, their uses, and how they relate to your own data is key to successful analysis with R
(it’s not just about programming)
What data types would you expect to see?
What examples of data types can you think of from your own experience?
Please write them into the chat
RR are atomic
TRUE, FALSE3, 2L, 1234563.0, -23.45, pi3+0i, 1+4i"a", 'SWC', "This is not a string"INTERACTIVE DEMO
Create examples of data with the following characteristics:
answer, type: logicalheight, type: numericdog_name, type: characterFor each variable, test that it has the data type you intended
R Data Structuresvectorfactorlistdata.frameINTERACTIVE DEMO
Vectors are atomic: they can contain only a single data type
What data type are the following vectors (xx, yy, zz)?
xx <- c(1.7, "a")
yy <- c(TRUE, 2)
zz <- c("a", TRUE)
Options: logical, integer, numeric, character
R will perform implicit coercion on vectors to make them atomiclogical \(\rightarrow\) integer \(\rightarrow\) double \(\rightarrow\) complex \(\rightarrow\) character
If there are formatting problems with your data, you might not have the type you expect when you import into R
as.<type_name>()INTERACTIVE DEMO
Data comes as one of two types:
weight <- 17.2; rooms <- 7)grade <- "8", coat <- "brindled")This kind of distinction critical in many applications (e.g. statistical modelling)
INTERACTIVE DEMO
Create a new factor, defining control and case experiments, and inspect the result:
f <- factor(c("case", "control", "case", "control", "case"))
str(f)
## Factor w/ 2 levels "case","control": 1 2 1 2 1
In some statistical analyses in R it is important that the control level is numbered 1
RStudio, can you create a factor with the same values, but where the control level is numbered 1?lists are like vectors, but can hold any combination of datatype
list are denoted by [[]] and can be namedINTERACTIVE DEMO
# create a list l <- list(1, 'a', TRUE, matrix(0, nrow = 2, ncol = 2), f) l_named <- list(a = "SWC", b = 1:4)
> animal[c(2,4,6)] [1] "o" "k" "y" > l_named$b [1] 1 2 3 4
INTERACTIVE DEMO
x <- c(5.4, 6.2, 7.1, 4.8, 7.5) mask <- c(TRUE, FALSE, TRUE, FALSE, TRUE) x[mask] x[x > 7]
data.framecats data is a data.frameINTERACTIVE DEMO
> class(cats)
[1] "data.frame"
> cats
coat weight likes_string
1 calico 2.1 1
2 black 5.0 0
3 tabby 3.2 1
data.frame?R data structure for storing tabular, rectangular datalist of vectors having identical lengths.
vectorvector can be a different data typedata.frameINTERACTIVE DEMO
# Create a data frame
df <- data.frame(a=c(1,2,3), b=c('eeny', 'meeny', 'miney'),
c=c(TRUE, FALSE, TRUE))
summary(df)
## a b c ## Min. :1.0 eeny :1 Mode :logical ## 1st Qu.:1.5 meeny:1 FALSE:1 ## Median :2.0 miney:1 TRUE :2 ## Mean :2.0 ## 3rd Qu.:2.5 ## Max. :3.0
data.frame to fileINTERACTIVE DEMO
write.table(df, "data/df_example.tab", sep="\t")
We need to provide
data.framedata.frameINTERACTIVE DEMO
data/ directorygapminder <- read.table("data/gapminder-FiveYearData.csv", sep=",", header=TRUE)
R can also read data direct from the interneturl <- paste("https://raw.githubusercontent.com/resbaz/",
"r-novice-gapminder-files/master/data/",
"gapminder-FiveYearData.csv", sep = '')
gapminder <- read.table(url, sep=",", header=TRUE)
gapminderINTERACTIVE DEMO
str(gapminder) # structure of the data.frame typeof(gapminder$year) # data type of a column length(gapminder) # length of the data.frame nrow(gapminder) # number of rows in data.frame ncol(gapminder) # number of columns in data.frame dim(gapminder) # number of rows and columns in data.frame colnames(gapminder) # column names from data.frame head(gapminder) # first few rows of dataframe summary(gapminder) # summary of data in data.frame columns
In R:
CRANINTERACTIVE DEMO
installed.packages() # see installed packages
install.packages("packagename") # install a new package
update.packages() # update installed packages
library(packagename) # import a package for use in your code
CRAN - the Comprehensive R Archive Network: https://cran.r-project.org/
Can you check if the following packages are installed on your system, and install them if necessary?
dplyr ggplot2 knitr
ggplot2 is part of the Tidyverse, a collection of packages for data science
ggplot2 is the graphics packagegeomsggplot2 like base graphics
qplot() ≈ plot()INTERACTIVE DEMO
library(ggplot2) plot(gapminder$lifeExp, gapminder$gdpPercap, col=gapminder$continent) qplot(lifeExp, gdpPercap, data=gapminder, colour=continent)
geomsgeom (short for geometry) defines the “type” of representation
ggplot2 provides several geom typesgeomsThe same data and aesthetics can be shown with different geoms
INTERACTIVE DEMO
# Generate plot of GDP per capita against life Expectancy p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent)) p + geom_point() p + geom_line()
Can you create another figure in your script showing how life expectancy changes as a function of time, as a scatterplot?
ggplot2 plots are built as layersAll layers have two components
geomggplot object
ggplot object
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, colour=continent)) p + geom_point()
ggplot object
p <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, colour=continent)) p + geom_line(aes(group=country))
INTERACTIVE DEMO
geoms to build a plot
alpha controls opacity for a layerp <- ggplot(data=gapminder, aes(x=lifeExp, y=gdpPercap, color=continent)) p + geom_line(aes(group=country)) + geom_point(alpha=0.4)
INTERACTIVE DEMO
Can you create another figure in your script showing how life expectancy changes as a function of time, coloured by continent, with two layers:
scalesscale layersINTERACTIVE DEMO
geom layers transform the dataset
INTERACTIVE DEMO
Use the facet_wrap() layer to generate grids of plots
INTERACTIVE DEMO
# Compare life expectancy over time by continent
p <- ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=continent,
group=country))
p <- p + geom_line() + scale_y_log10()
p + facet_wrap(~continent)
Can you create a scatterplot and contour densities of GDP per capita against population size, with colour filled by continent?
ADVANCED: Transform the x axis to better visualise data spread, and use facets to panel density plots by year.
“Tidy datasets are all alike, but every messy dataset is messy in its own way”
Principles of Tidy Data provide a standard way to organise data values within a dataset
R for Data Science: http://r4ds.had.co.nz/dplyr: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.htmlR and RStudio: https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/df1 <- data.frame(treatment=c("treatmenta", "treatmentb"),
John.Smith=c(NA, 2),
Jane.Doe=c(16, 11),
Mary.Johnson=c(3, 1))
colnames(df1) = c("", "John Smith", "Jane Doe", "Mary Johnson")
| John Smith | Jane Doe | Mary Johnson | |
|---|---|---|---|
| treatmenta | NA | 16 | 3 |
| treatmentb | 2 | 11 | 1 |
df2 <- data.frame(name=c("John Smith", "Jane Doe", "Mary Johnson"),
treatmenta=c(NA, 16, 3),
treatmentb=c(2, 11, 1))
colnames(df2) = c("", "treatmenta", "treatmentb")
| treatmenta | treatmentb | |
|---|---|---|
| John Smith | NA | 2 |
| Jane Doe | 16 | 11 |
| Mary Johnson | 3 | 1 |
| John Smith | Jane Doe | Mary Johnson | |
|---|---|---|---|
| treatmenta | NA | 16 | 3 |
| treatmentb | 2 | 11 | 1 |
In the table below, what is the correct assignment for rows and columns?
| John Smith | Jane Doe | Mary Johnson | |
|---|---|---|---|
| treatmenta | NA | 16 | 3 |
| treatmentb | 2 | 11 | 1 |
NA, 16, 3, 2, 11, 1)Each OBSERVATION includes all three variables
| name | treatment | result |
|---|---|---|
| John Smith | a | NA |
| Jane Doe | a | 16 |
| Mary Johnson | a | 3 |
| John Smith | b | 2 |
| Jane Doe | b | 11 |
| Mary Johnson | b | 1 |
R (supports vectorisation)INTERACTIVE DEMO
data.frames with the six verbs of dplyr
select()filter()group_by()summarize()mutate()%>% (pipe)R for Data Science: http://r4ds.had.co.nz/dplyr: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.htmlR and RStudio: https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/dplyr?dplyr is another package in the Tidyverse> mean(gapminder[gapminder$continent == "Africa", "gdpPercap"]) [1] 2193.755 > mean(gapminder[gapminder$continent == "Americas", "gdpPercap"]) [1] 7136.11 > mean(gapminder[gapminder$continent == "Asia", "gdpPercap"]) [1] 7902.15
Avoiding repetition (though automation) makes code
select() - Interactive Demolibrary(dplyr)
select(gapminder, year, country, gdpPercap) gapminder %>% select(year, country, gdpPercap)
filter()filter() selects rows on the basis of some conditionINTERACTIVE DEMO
filter(gapminder, continent=="Europe")
# Select gdpPercap by country and year, only for Europe
eurodata <- gapminder %>%
filter(continent == "Europe") %>%
select(year, country, gdpPercap)
Can you write a single line (which may span multiple lines in your script by including pipes) to produce a dataframe from gapminder containing:
How many rows does the dataframe have
group_by()INTERACTIVE DEMO
group_by(gapminder, continent) gapminder %>% group_by(continent)
summarize()INTERACTIVE DEMO
# Produce table of mean GDP by continent
gapminder %>%
group_by(continent) %>%
summarize(meangdpPercap=mean(gdpPercap))
gapminder data?count() and n()summarize()
count()/tally(): a function that reports a table of counts by groupn(): a function used within summarize(), filter() or mutate() to represent count by groupINTERACTIVE DEMO
gapminder %>% filter(year == 2002) %>% count(continent, sort = TRUE) gapminder %>% group_by(continent) %>% summarize(se_lifeExp = sd(lifeExp)/sqrt(n()))
mutate()mutate() is a function allowing creation of new variablesINTERACTIVE DEMO
# Calculate GDP in $billion
gdp_bill <- gapminder %>%
mutate(gdp_billion = gdpPercap * pop / 10^9)
# Calculate total/sd of GDP by continent and year
gdp_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion=gdpPercap*pop/10^9) %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_gdp_billion=mean(gdp_billion),
sd_gdp_billion=sd(gdp_billion))
We can produce these documents in RStudio
R Markdown fileR Markdown files embody Literate Programming in RFile \(\rightarrow\) New File \(\rightarrow\) R Markdown.Rmd)R Markdown file------ title: "Literate Programming" author: "Leighton Pritchard" date: "04/12/2017" output: html_document ---
This is an R Markdown document. Markdown is a simple formatting syntax
R code (which is executable) is fenced by backticks (```)Click on Knit
R Markdown report on the gapminder dataINTERACTIVE DEMO
R Markdown documentation: http://rmarkdown.rstudio.com/R Markdown cheat sheet: http://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdfR Markdown: https://www.rstudio.com/resources/webinars/getting-started-with-r-markdown/R, RStudio and how to set up a projectR and produce summary statistics and plots with base toolsR, the most important data structuresRWELL DONE!!
RRif() and else()Rfor() loopsif() … elseTRUE
if() and else construct is useful for this# if
if (condition is true) {
PERFORM ACTION
}
# if ... else
if (condition is true) {
PERFORM ACTION
} else { # i.e. if the condition is false,
PERFORM ALTERNATIVE ACTION
}
INTERACTIVE DEMO
Can you use an if() statement to report whether there are any records from 2002 in the gapminder dataset?
Can you do the same for 2012?
HINT: Look at the help for the any() function
for() loopsfor() loops are a very common construct in programming
for each <item> in a group, <do something (with the item)>R as in some other languagesfor(iterator in set of values){
do a thing
}
INTERACTIVE DEMO
while() loopswhile() loops are useful when you need to do something while some condition is truewhile(this condition is true){
do a thing
}
INTERACTIVE DEMO
Can you use a for() loop and an if() statement to print whether each letter in the alphabet is a vowel?
HINT: Use R’s help for letters and %in%
for() and while() loops can be useful, but are not efficientR are vectorised
You’ve already seen and used much of this behaviour
INTERACTIVE DEMO
x < 1:4 x * 2 y <- 6:9 x + y
We want to sum the following series of fractions
\(\frac{1}{1^2} + \frac{1}{2^2} + \frac{1}{3^2} + \ldots + \frac{1}{n^2}\)
for large values of \(n\)
Can you do this using vectorisation for \(n = 10,000\)?
Functions are the building blocks of programming
<function_name> <- function(<arg1>, <arg2>) {
<do something>
return(<result>)
}
INTERACTIVE DEMO
my_sum <- function(a, b) {
the_sum <- a + b
return(the_sum)
}
R’s built-in help to see function documentation
Your future self will thank you!
(and so will your colleagues)
Write programs for people, not for computers
INTERACTIVE DEMO
INTERACTIVE DEMO
# Calculate total GDP in gapminder data
calcGDP <- function(data, year_in=NULL, country_in=NULL) {
gdp <- data %>% mutate(gdp=(pop * gdpPercap))
if (!is.null(year_in)) {
gdp <- gdp %>% filter(year %in% year_in)
}
if (!is.null(country_in)) {
gdp <- gdp %>% filter(country %in% country_in)
}
return(gdp)
}
Can you write a function that takes an optional argument called letter, which:
letterfacet_wrap() to produce a grid of output graphsHINT: The following code may be useful
starts.with <- substr(gapminder$country, start = 1, stop = 1)
az.countries <- gapminder[starts.with %in% c("A", "Z"), ]