John Smith | Jane Doe | Mary Johnson | |
---|---|---|---|
treatment.A | NA | 16 | 3 |
treatment.B | 2 | 11 | 1 |
R
?R
?R
R
data types)R
about the type and structure of an objectdplyr
.tidyr
.R
R
is mostly used for data analysisR
has special types and structures to help you work with dataR
is on tabular data (data frames)What data types would you expect to see?
What examples of data types can you think of from your own experience?
Please write in the chat
Add your suggestions of data types at:
R
Data types in R
are atomic
TRUE
, FALSE
3
, 2L
, 123456
3.0
, -23.45
, pi
"a"
, 'SWC'
, "This is not a string"
A column in a dataframe can contain only one data type
"2.3 or 2.4"
) to the column of double
s changed its data type from double
to character
.R
Implicit coercion hierarchy
logical
\(\rightarrow\) integer
\(\rightarrow\) double
\(\rightarrow\) complex
\(\rightarrow\) character
Check your inputs
If there are formatting problems with your data, you might not have the type you expect when you import into R
list
s are like vectors, but can hold any combination of datatypeWorking with list data
elements in a list
are denoted by [[]]
and individual elements can also be named
cats
data is a data.frame
A data frame is a special named list of vectors with the same length
Where to find the data
Save the data to the data
subfolder (where you put feline-data.csv
)
gapminder
INTERACTIVE DEMO
str(gapminder) # structure of the data.frame
typeof(gapminder$year) # data type of a column
length(gapminder) # length of the data.frame
nrow(gapminder) # number of rows in data.frame
ncol(gapminder) # number of columns in data.frame
dim(gapminder) # number of rows and columns in data.frame
colnames(gapminder) # column names from data.frame
head(gapminder) # first few rows of dataframe
summary(gapminder) # summary of data in data.frame columns
dplyr
data.frame
s with the six verbs of dplyr
dplyr
verbs
select()
filter()
group_by()
summarize()
mutate()
%>%
(pipe)dplyr
?dplyr
is another package in the TidyverseTip
Avoiding repetition (though automation) makes code
select()
filter()
filter()
selects rows on the basis of some condition
Challenge
Can you write a single line (which may span multiple lines in your script by including pipes) to produce a dataframe from gapminder
containing:
life expectancy, country, and year data
only for African nations
How many rows does the dataframe have?
group_by()
summarize()
Challenge
gapminder
data?count()
and n()
Two useful functions related to summarize()
count()
/tally()
: a function that reports a table of counts by groupn()
: a function used within summarize()
, filter()
or mutate()
to represent count by groupmutate()
Tip
mutate()
is a function allowing creation of new variables
INTERACTIVE DEMO
# Calculate GDP in $billion
gdp_bill <- gapminder %>%
mutate(gdp_billion = gdpPercap * pop / 10^9)
# Calculate total/sd of GDP by continent and year
gdp_bycontinents_byyear <- gdp_bill %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_gdp_billion=mean(gdp_billion),
sd_gdp_billion=sd(gdp_billion))
ifelse()
Tip
ifelse()
allows us to restrict apply filtering when we use mutate()
ifelse([CONDITION], [VALUE IF TRUE], [VALUE IF FALSE]
)“Tidy datasets are all alike, but every messy dataset is messy in its own way”
Data Cleaning is not just a first step
Principles of Tidy Data provide a standard way to organise data values within a dataset
What’s wrong with this data?
John Smith | Jane Doe | Mary Johnson | |
---|---|---|---|
treatment.A | NA | 16 | 3 |
treatment.B | 2 | 11 | 1 |
Is this better?
treatment.A | treatment.B | |
---|---|---|
John Smith | NA | 2 |
Jane Doe | 16 | 11 |
Mary Johnson | 3 | 1 |
We need to talk about data semantics
A dataset is a collection of VALUES
John Smith | Jane Doe | Mary Johnson | |
---|---|---|---|
treatment.A | NA | 16 | 3 |
treatment.B | 2 | 11 | 1 |
Each value belongs to a variable and an observation
In the table below, what are the rows and columns?
John Smith | Jane Doe | Mary Johnson | |
---|---|---|---|
treatment.A | NA | 16 | 3 |
treatment.B | 2 | 11 | 1 |
USE CHALLENGE SECTION ON ETHERPAD
The dataset contains 18 values
There are three variables and six observations
VARIABLES
NA
, 16, 3, 2, 11, 1)Important
Each OBSERVATION includes all three variables
This is the dataset in tidy form
(aka long form)
name | treatment | result |
---|---|---|
John Smith | a | NA |
Jane Doe | a | 16 |
Mary Johnson | a | 3 |
John Smith | b | 2 |
Jane Doe | b | 11 |
Mary Johnson | b | 1 |
Tidy data is a standard way of structuring a dataset
makes it easy to extract needed variables, and well suited to R
(supports vectorisation)
Many R
functions are designed to expect long data
gapminder
Wide DatasetWhere to find the data
separate()
Tip
separate()
allows us to split the contents of one column into several columns
separate([COLUMN], into=[NEW_COLUMN_NAMES], sep=[SEPARATOR])
Challenge
Using the gap_long
dataset, calculate the mean life expectancy, population, and GDP per capita for each continent.
Hint: use the group_by()
and summarize()
functions we learned in the dplyr
episode.