RStudio
/R
is open with the
appropriate packages installed (tidyverse
)This lesson might sound like it’s going to be a bit dry: “Why should I care about data types and structures?”
IF YOU UNDERSTAND YOUR DATA, AND YOU UNDERSTAND HOW
R
SEES YOUR DATA, YOUR ANALYSIS WILL BE MUCH EASIER AND
MORE EFFECTIVE
Some questions you might have had before this session may include those on the slide:
R
?R
?R
?This session aims to help you answer those
R
:
R
’s data types and
structures relate to the types of data that you work with,
yourself.R
R
and, when you use
R
, you will see a lot of them.R
R
is MOSTLY USED FOR DATA
ANALYSIS
R
is set up with key, core data types designed to
help you work with your own data
A lot of the time, R
focuses on tabular
data.
R
is very powerful when dealing with tabular
data
INTERACTIVE DEMO
SWITCH TO THE CONSOLE
Let’s start by making a toy dataset.
We’ll eventually save this in your data/
directory,
with the name feline-data.csv
Use the console in RStudio
We’ll create a new variable called cats
data.frame
<- data.frame(coat = c("calico", "black", "tabby"),
cats weight = c(2.1, 5.0, 3.2),
likes_catnip = c(1, 0, 1))
cats
as a
CSV (comma-separated variable) file.
write.csv()
functionwrite.csv(cats, file = "data/feline-data.csv", row.names = FALSE)
We’re writing the cats
data out to file
file
specifies the filename we’re writing torow.names
tells R
not to give the rows in
the table a nameUSE THE FILES TAB/VIEW FILES TO VIEW THE CONTENTS OF THE NEW FILE
This is a plain text file, and you could have created it using a text editor like Nano or Notepad.
We can load the data from a CSV file in a similar way to how we write it
read.csv()
function<- read.csv(file = "data/feline-data.csv")
cats
cats
coat weight likes_catnip1 calico 2.1 1
2 black 5.0 0
3 tabby 3.2 1
coat
is text; weight
is some real value (in kg or pounds, maybe), and
likes_string
looks like it should be
TRUE
/FALSE
but is represented as 1s and
0s$
notation in the console> cats$weight
1] 2.1 5.0 3.2
[> cats$coat
1] "calico" "black" "tabby" [
R
RETURN?
R
is
largely built so that operations on vectors are central to data
analysis.> cats$weight + 2
1] 4.1 7.0 5.2 [
> paste("My cat is", cats$coat)
1] "My cat is calico" "My cat is black" "My cat is tabby" [
> cats$weight + cats$coat
in cats$weight + cats$coat :
Error -numeric argument to binary operator non
R
’s data types reflect the ways in which data is
expected to interactR
’s DATA
TYPES IS KEY
R
sees your data
(you want R
to see your data the same way you
do)R
come down to incompatibilities
between data and data types.R
R
’s data types are atomic: they
are FUNDAMENTAL AND EVERYTHING ELSE IS BUILT UP FROM
THEM, the same way matter is built up from atoms
There are only FIVE DATA TYPES in R
(though one is split into two…) - three that matter here and two we
won’t deal with
1
/0
)integer
and
double
(real)The other two are less relevant to you and we’ll not deal with them much today
LET’S LEARN A BIT MORE ABOUT THEM IN THE DEMO
We can ask what type of data something is with the
typeof()
function
> typeof(cats$weight)
1] "double"
[> typeof(TRUE)
1] "logical"
[> typeof(3.14)
1] "double"
[> typeof(1)
1] "double" [
1
is an integer
R
treats all numbers as double
s
(i.e. “Real” numbers) by defaultR
to see a value as an integer, we need to add
an L
suffix> typeof(1L)
1] "integer"
[> typeof(cats$coat)
1] "character"
[> typeof("Irn Bru")
1] "character" [
R
is interpreted as one of these basic data types.
R
- like most computer languages - is strict about data
types, and this has important consequences.> additional_cat <- data.frame(coat = "tabby", weight = "2.3 or 2.4", likes_catnip = 1)
> additional_cat
coat weight likes_catnip1 tabby 2.3 or 2.4
rbind
> cats2 <- rbind(cats, additional_cat)
> cats2
coat weight likes_catnip1 calico 2.1 1
2 black 5 0
3 tabby 3.2 1
4 tabby 2.3 or 2.4 1
cats$weight
in the two
dataframes> typeof(cats$weight)
1] "double"
[> typeof(cats2$weight)
1] "character" [
weight
column, and this means we can no longer do the same weight adjustment
(adding on 2kg) that we did before:> cats2$weight + 2
in cats2$weight + 2 : non-numeric argument to binary operator Error
Any column in a dataframe can contain only one datatype.
Initially, the datatype of cats$weight
was
double
When we added the new cat, the data in that column was
character
- i.e. a string
Because we can always represent a number as a string, but we
can’t always represent a string as a number, R
chose to
coerce the datatype from double
to
character
for that column.
INTERACTIVE DEMO
We can look at the structure of a dataframe using the
str()
function
> str(cats)
'data.frame': 3 obs. of 3 variables:
$ coat : chr "calico" "black" "tabby"
$ weight : num 2.1 5 3.2
$ likes_catnip: int 1 0 1
> str(cats2)
'data.frame': 4 obs. of 3 variables:
$ coat : chr "calico" "black" "tabby" "tabby"
$ weight : chr "2.1" "5" "3.2" "2.3 or 2.4"
$ likes_catnip: num 1 0 1 1
str()
function tells us that cats
and cats2
are both data.frame
s - a very common
kind of data structure in R
data.frame
s, a bit better, let’s meet a
different data structure called a vectorThese are the MOST COMMON DATA STRUCTURE
INTERACTIVE DEMO
Let’s define an ATOMIC VECTOR OF NUMBERS
c()
FUNCTION (c()
is combine
; use
?c
)> x <- c(10, 12, 45, 33)
> x
1] 10 12 45 33 [
> typeof(x)
1] "double"
[> length(x)
1] 4
[> str(x)
1:4] 10 12 45 33 num [
typeof()
that the vector contains the
double
typestr()
tells us:
numeric
(num
) vector - which
includes double
and integer
datatypes[1:4]
in the vectorstr()
tells us that the cats$coat
column
is a vector, too:> str(cats$coat)
1:3] "calico" "black" "tabby" chr [
INTERACTIVE DEMO
Given what we’ve done so far, what do you think the following
will produce when we use typeof()
?
> quiz_vector <- c(2,6,'3')
> typeof(quiz_vector)
1] "character" [
Here, R
has enforced that the type of the vector is
character
(string), because we can always represent numbers
as strings, but we can’t always represent strings as numbers
PAUSE: NEXT CALLOUTS
This is called type coercion and can cause surprises in your code
R
interprets them.INTERACTIVE DEMO
Consider these two vectors
> coercion_vector <- c('a', TRUE)
> coercion_vector
1] "a" "TRUE"
[> another_coercion_vector <- c(0, TRUE)
> another_coercion_vector
1] 0 1 [
> typeof(coercion_vector)
1] "character"
[> typeof(another_coercion_vector)
1] "double" [
R
thinks it needs to, it will COERCE DATA
IMPLICITLY without telling youlogical
can be coerced to integer
, but
integer
cannot be coerced to logical
integer
can describe all
logical
values, but not vice versacharacter
, so that’s
the fallback position for R
R
MIGHT CONVERT THE TYPE TO COPE
R
will choose the simplest data type that can represent
all items in the vectoras.<type>()
> x
1] 10 12 45 33
[> as.character(x)
1] "10" "12" "45" "33"
[> as.complex(x)
1] 10+0i 12+0i 45+0i 33+0i
[> as.logical(x)
1] TRUE TRUE TRUE TRUE [
R
forces one data
type into another.cats
dataframe again> cats
coat weight likes_catnip1 calico 2.1 1
2 black 5.0 0
3 tabby 3.2 1
> typeof(cats$likes_catnip)
1] "integer" [
likes_catnip
column is recorded as an
integer, but we want logical
TRUE
/FALSE
values.
as.logical()
function to change
this> cats$likes_catnip <- as.logical(cats$likes_catnip)
> cats
coat weight likes_catnip1 calico 2.1 TRUE
2 black 5.0 FALSE
3 tabby 3.2 TRUE
> str(cats)
'data.frame': 3 obs. of 3 variables:
$ coat : chr "calico" "black" "tabby"
$ weight : num 2.1 5 3.2
$ likes_catnip: logi TRUE FALSE TRUE
list
s are data structures like vectors,
EXCEPT THEY CAN HOLD ANY DATA TYPE
list()
function> l <- list(1, 'a', TRUE, seq(2, 5))
> length(l)
1] 4
[> l
1]]
[[1] 1
[
2]]
[[1] "a"
[
3]]
[[1] TRUE
[
4]]
[[1] 2 3 4 5 [
[[1]]
is the number 1
[[2]]
is the character "a"
[[3]]
is the logical value TRUE
[[4]]
is the vector of values from 2 to 5> l[[1]]
1] 1
[> l[[4]]
1] 2 3 4 5 [
str()
function we can see the datatypes of
all the elements in the list> str(l)
4
List of $ : num 1
$ : chr "a"
$ : logi TRUE
$ : int [1:4] 2 3 4 5
> l_named <- list(a = "SWC", b = 1:4)
> l_named
$a
1] "SWC"
[
$b
1] 1 2 3 4 [
$
notation:> l_named$a
1] "SWC"
[> l_named$b
1] 1 2 3 4 [
> l_named[[1]]
1] "SWC"
[> l_named[[2]]
1] 1 2 3 4
[> str(l_named)
2
List of $ a: chr "SWC"
$ b: int [1:4] 1 2 3 4
cats
dataframe we
created at the start of this episode.
> cats
coat weight likes_catnip1 calico 2.1 TRUE
2 black 5.0 FALSE
3 tabby 3.2 TRUE
> typeof(cats)
1] "list"
[> cats[[2]]
1] 2.1 5.0 3.2
[> typeof(cats$weight)
1] "double" [
So cats
is a list
, and each element in
the list is a vector
PAUSE: NEXT CALLOUT
But cats
is a special kind of list
- a data.frame
- where all the vectors have the same
length.
class()
function> class(cats)
1] "data.frame"
[> class(l)
1] "list" [
data.frame
represents a standard way of
organising data:
data/
gapminder
read.table
# Load gapminder data from a local file
<- read.table("data/gapminder_data.csv", sep=",", header=TRUE) gapminder
Source
)Environment
TAB
gapminder
in Environment
tab.gapminder
str()
function.> str(gapminder)
'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num 779 821 853 836 740 ...
summary()
function can also be useful
> summary(gapminder)
country year pop continent lifeExp: 12 Min. :1952 Min. :6.001e+04 Africa :624 Min. :23.60
Afghanistan: 12 1st Qu.:1966 1st Qu.:2.794e+06 Americas:300 1st Qu.:48.20
Albania : 12 Median :1980 Median :7.024e+06 Asia :396 Median :60.71
Algeria : 12 Mean :1980 Mean :2.960e+07 Europe :360 Mean :59.47
Angola : 12 3rd Qu.:1993 3rd Qu.:1.959e+07 Oceania : 24 3rd Qu.:70.85
Argentina : 12 Max. :2007 Max. :1.319e+09 Max. :82.60
Australia :1632
(Other)
gdpPercap: 241.2
Min. 1st Qu.: 1202.1
: 3531.8
Median : 7215.3
Mean 3rd Qu.: 9325.5
:113523.1 Max.
> typeof(gapminder$year)
1] "integer"
[> typeof(gapminder$country)
1] "integer" [
str()
> str(gapminder$country)
/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ... Factor w
country
column has been read in as a
factor
factor
is a data structure that represents
categorical data.str()
output we got, as well:
both country
and continent
were read in as
factors.dim()
> dim(gapminder)
1] 1704 6 [
length()
produce:> length(gapminder)
1] 6 [
list
of
vector
columns, so its length is the number of elements in
the list.
head()
to summarise the dataframe.> nrow(gapminder)
1] 1704
[> ncol(gapminder)
1] 6
[> colnames(gapminder)
1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
[> head(gapminder)
country year pop continent lifeExp gdpPercap1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4 Afghanistan 1967 11537966 Asia 34.020 836.1971
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
dplyr
You’re going to learn to manipulate
data.frame
s with the six verbs of
dplyr
select()
filter()
group_by()
summarize()
mutate()
%>%
(pipe)
dplyr
?dplyr
is a package in the TIDYVERSE;
it exists to enable rapid analysis of data by groups
gapminder
data by continent, we’d use
dplyr
The general principle dplyr
supports is called SPLIT-APPLY-COMBINE
We have a dataset with several groups in a
variable (column x
)
We want to perform the same operation on each group,
independently - take a mean of y
for each group,
for example
x
select()
- Interactive
Demo**dplyr
> library(dplyr)
select()
verb SELECTS
COLUMNS
gapminder
> head(select(gapminder, year, country, gdpPercap))
year country gdpPercap1 1952 Afghanistan 779.4453
2 1957 Afghanistan 820.8530
3 1962 Afghanistan 853.1007
4 1967 Afghanistan 836.1971
5 1972 Afghanistan 739.9811
6 1977 Afghanistan 786.1134
%>%
> gapminder %>% select(year, country, gdpPercap) %>% head()
year country gdpPercap1 1952 Afghanistan 779.4453
2 1957 Afghanistan 820.8530
3 1962 Afghanistan 853.1007
4 1967 Afghanistan 836.1971
5 1972 Afghanistan 739.9811
6 1977 Afghanistan 786.1134
filter()
filter()
selects rows on the basis of some condition,
or combination of conditions
> head(filter(gapminder, continent=="Europe"))
country year pop continent lifeExp gdpPercap1 Albania 1952 1282697 Europe 55.23 1601.056
2 Albania 1957 1476505 Europe 59.28 1942.284
3 Albania 1962 1728137 Europe 64.82 2312.889
4 Albania 1967 1984060 Europe 66.22 2760.197
5 Albania 1972 2263554 Europe 67.69 3313.422
6 Albania 1977 2509048 Europe 68.93 3533.004
gapminder.R
)
R
knows that there’s a continuationRun
the lines and check the output in
Environment
# Select gdpPercap by country and year, only for Europe
<- gapminder %>%
eurodata filter(continent == "Europe") %>%
select(year, country, gdpPercap)
# Select life expectancy by country and year, only for Africa
> afrodata <- gapminder %>%
filter(continent == "Africa") %>%
select(year, country, lifeExp)
> nrow(afrodata)
1] 624 [
group_by()
group_by()
verb SPLITS
data.frame
s INTO GROUPS ON A VARIABLE/COLUMN
PROPERTYtibble
- a table with
extra metadata describing the groups in the table> group_by(gapminder, continent)
# A tibble: 1,704 x 6
# Groups: continent [5]
country year pop continent lifeExp gdpPercap<fctr> <int> <dbl> <fctr> <dbl> <dbl>
1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4 Afghanistan 1967 11537966 Asia 34.020 836.1971
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
6 Afghanistan 1977 14880372 Asia 38.438 786.1134
7 Afghanistan 1982 12881816 Asia 39.854 978.0114
8 Afghanistan 1987 13867957 Asia 40.822 852.3959
9 Afghanistan 1992 16317921 Asia 41.674 649.3414
10 Afghanistan 1997 22227415 Asia 41.763 635.3414
# ... with 1,694 more rows
summarize()
The combination of group_by()
and
summarize()
is very powerful
Here, we’ve split the original table into three groups, and now
CREATE A NEW VARIABLE mean_b
THAT IS FILLED BY
CALCULATING THE MEAN OF b
DEMO IN SCRIPT
> # Produce table of mean GDP by continent
> gapminder %>%
+ group_by(continent) %>%
+ summarize(meangdpPercap=mean(gdpPercap))
# A tibble: 5 x 2
continent meangdpPercap<fctr> <dbl>
1 Africa 2193.755
2 Americas 7136.110
3 Asia 7902.150
4 Europe 14469.476
5 Oceania 18621.609
# Find average life expectancy by nation
<- gapminder %>%
avg_lifexp_country group_by(country) %>%
summarize(meanlifeExp=mean(lifeExp))
> avg_lifexp_country %>% filter(meanlifeExp == min(meanlifeExp))
# A tibble: 1 × 2
country meanlifeExp<chr> <dbl>
1 Sierra Leone 36.8
> avg_lifexp_country %>% filter(meanlifeExp == max(meanlifeExp))
# A tibble: 1 × 2
country meanlifeExp<chr> <dbl>
1 Iceland 76.5
count()
and n()
Two other useful functions are related to
summarize()
count()
reports a new table of counts by
groupn()
is used to represent the count of rows,
when calculating new values in summarize()
DEMO IN CONSOLE
NOTE: standard error is (std dev)/sqrt(n)
> gapminder %>% filter(year == 2002) %>% count(continent, sort = TRUE)
# A tibble: 5 x 2
continent n<fctr> <int>
1 Africa 52
2 Asia 33
3 Europe 30
4 Americas 25
5 Oceania 2
> gapminder %>% group_by(continent) %>% summarize(se_lifeExp = sd(lifeExp)/sqrt(n()))
# A tibble: 5 x 2
continent se_lifeExp<fctr> <dbl>
1 Africa 0.3663016
2 Americas 0.5395389
3 Asia 0.5962151
4 Europe 0.2863536
5 Oceania 0.7747759
mutate()
mutate()
CALCULATES NEW VARIABLES (COLUMNS) ON
THE BASIS OF EXISTING COLUMNSgapminder
data,
plus an extra column# Calculate GDP in $billion
<- gapminder %>%
gdp_bill mutate(gdp_billion = gdpPercap * pop / 10^9)
summarize()
commandmutate()
in the
summarize()
command# Calculate total/sd of GDP by continent and year
<- gapminder %>%
gdp_bycontinents_byyear mutate(gdp_billion=gdpPercap*pop/10^9) %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_gdp_billion=mean(gdp_billion),
sd_gdp_billion=sd(gdp_billion))
ifelse()
ifelse()
IS A FILTER THAT CAN BE USED WITH
MUTATE TO CALCULATES NEW VARIABLES (COLUMNS) ON THE BASIS OF EXISTING
COLUMNS ONLY IF SOME CONDITION IS MET<- gapminder %>%
gdp_billion_large_countries mutate(gdp_billion_large = ifelse(pop > 10e6,
* pop / 10^9,
gdpPercap NA))
gapminder
data,
plus an extra column<- gapminder %>%
gdp_future_bycontinents_byyear_high_lifeExp mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
mean_gdpPercap_expected = mean(gdp_futureExpectation))
Data cleaning/processing is not just a first step - it must be repeated many time over the course of an analysis
About 80% of the effort of data analysis is cleaning and preparing data for analysis
The principles of tidy data provide a standard way to organise data values within a dataset
Here’s a dataset like you might receive it from a colleague
It’s a RECTANGULAR TABLE
Each row describes a treatment
Each column gives the results for a different individual, for each treatment
Thinking about how our dataframes are structured, there should be one row per observation.
So we’ve transposed the rows and columns of the table
The data is the same
The layout is different
Now we have one row per observation, and we have one column per variable (treatments A and B), and that’s what we want, isn’t it?
PAUSE: NEW CALLOUT
But the data doesn’t have to be structured this way.
To understand what this means, AND WHY IT IS UNTIDY DATA, we need to consider some data semantics.
We need to define three terms
A dataset, like the one shown, is a collection of VALUES
Each value belongs to a variable, and to an observation
A VARIABLE is something that can change or vary
An OBSERVATION is a collection of values measured across all variables for the same individual or unit
NA
, 16, 3, 2, 11, 1Here is the dataset represented in tidy form
You can see each variable has its own column
Each observation has one value per column
R
because it supports
vectorisation (which you saw earlier)gapminder
data is in an intermediate
formatgapminder
dataset is a variable> head(gapminder)
country year pop continent lifeExp gdpPercap1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4 Afghanistan 1967 11537966 Asia 34.020 836.1971
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
6 Afghanistan 1977 14880372 Asia 38.438 786.1134
We often refer to datasets as being “long” or “wide”
LONG datasets have one row per observation, and one column per variable
WIDE datasets might have multiple arrangements
Wide datasets tend to be easier for humans to read
BUT MANY R FUNCTIONS ARE DESIGNED TO WORK WITH LONG DATA
group_by()
), and it can be more convenient to have an
intermediate form like the gapminder
data.gapminder
Wide Datasetgapminder
data to
make our lives easier.gapminder
data and take a
look at it.> gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
> str(gap_wide)
'data.frame': 142 obs. of 38 variables:
$ continent : chr "Africa" "Africa" "Africa" "Africa" ...
$ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
$ gdpPercap_1952: num 2449 3521 1063 851 543 ...
$ gdpPercap_1957: num 3014 3828 960 918 617 ...
$ gdpPercap_1962: num 2551 4269 949 984 723 ...
$ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
$ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
$ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
$ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
$ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
$ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
$ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
$ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
$ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
$ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
$ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
$ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
$ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
$ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
$ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
$ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
$ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
$ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
$ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
$ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
$ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
$ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
$ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
$ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
$ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
$ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
$ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
$ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
$ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
$ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
$ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
$ pop_2002 : int 31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
$ pop_2007 : int 33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
pivot_longer()
FUNCTION
FROM dplyr
/tidyverse
cols
variable
pivot_longer()
function effectively splits the
table by these columnsnames_to
and values_to
.gapminder
data, we want to pivot all columns
that start with pop
, lifeExp
or
gdpPercent
obstype_year
obs_values
> gap_long <- gap_wide %>%
+ pivot_longer(
+ cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
+ names_to = "obstype_year", values_to = "obs_values"
+ )
> str(gap_long)
5,112 × 4] (S3: tbl_df/tbl/data.frame)
tibble [$ continent : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
$ country : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
$ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
$ obs_values : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...
obstype_year
COLUMN
year
into its own column.separate()
separate()
FUNCTION SPLITS THE CONTENTS OF
A SINGLE COLUMN INTO MULTIPLE NEW COLUMNS
obstype_year
column into its own new column.separate()
function needs to know:
obstype_year
column into two columns
called obs_type
and year
, on the underscore
separator "_"
.> gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
as.integer
and take a look at the new
dataframe.> gap_long$year <- as.integer(gap_long$year)
> str(gap_long)
5,112 × 5] (S3: tbl_df/tbl/data.frame)
tibble [$ continent : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
$ country : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
$ obs_type : chr [1:5112] "pop" "pop" "pop" "pop" ...
$ year : int [1:5112] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ obs_values: num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...
obs_type
and obs_values
and it would be
convenient to split them into new columnspivot_longer()
called
pivot_wider()
that we use for this.
pivot_longer()
pivot_wider()
FUNCTION SPLITS THE TABLE UP
BY VALUES IN THE COLUMN OF “NAMES”
gapminder
data by using the
obs_type
column for the names, and the
obs_values
column for the values.> gap_normal <- gap_long %>% pivot_wider(names_from = obs_type, values_from = obs_values)
> str(gap_normal)
1,704 × 6] (S3: tbl_df/tbl/data.frame)
tibble [$ continent: chr [1:1704] "Africa" "Africa" "Africa" "Africa" ...
$ country : chr [1:1704] "Algeria" "Algeria" "Algeria" "Algeria" ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num [1:1704] 9279525 10270856 11000948 12760499 14760787 ...
$ lifeExp : num [1:1704] 43.1 45.7 48.3 51.4 54.5 ...
$ gdpPercap: num [1:1704] 2449 3014 2551 3247 4183 ...
%>% group_by(continent, obs_type) %>%
gap_long summarize(means=mean(obs_values))