Data Structures, Data Frames, and Tidy Data

0. What are we doing?

Learning Questions

How can I read data in R?
What are the basic data types in R?
How can I manipulate a data frame?
How can I manipulate data frames without repeating myself?

Learning Objectives

To be able to identify the main data types in R
To explore data frames (and see how they are built from R data types)
To be able to ask R about the type and structure of an object
To be able to use the six main data frame manipulation ‘verbs’ with pipes in dplyr.
To understand the concepts of ‘longer’ and ‘wider’ data frame formats and be able to convert between them with tidyr.

1. Data Types

Data Types and Structures in `R`

R is mostly used for data analysis
R has special types and structures to help you work with data
Much of the focus in R is on tabular data (data frames)

Demo
Code

Interactive Demo

toy dataset: feline-data.csv

cats <- data.frame(coat = c("calico", "black", "tabby"),
                   weight = c(2.1, 5.0, 3.2),
                   likes_catnip = c(1, 0, 1))

What Data Types Do You Expect?

What data types would you expect to see?

What examples of data types can you think of from your own experience?

Please write in the chat

Add your suggestions of data types at:

https://pad.carpentries.org/2025-06-16-strathclyde

Data Types in `R`

Data types in R are atomic

logical: TRUE, FALSE
numeric:
- integer: 3, 2L, 123456
- double (decimal): 3.0, -23.45, pi
character (text): "a", 'SWC', "This is not a string"

Demo
Code

INTERACTIVE DEMO

Inspecting data types

> typeof(1L)
[1] "integer"
> typeof(cats$coat)
[1] "character"
> typeof("Irn Bru")
[1] "character"

A Quick Note About Dataframes

A column in a dataframe can contain only one data type

Adding a character string ("2.3 or 2.4") to the column of doubles changed its data type from double to character.

Demo
Code

INTERACTIVE DEMO

Coercion in dataframes

> str(cats2)
'data.frame':   4 obs. of  3 variables:
 $ coat        : chr  "calico" "black" "tabby" "tabby"
 $ weight      : chr  "2.1" "5" "3.2" "2.3 or 2.4"
 $ likes_catnip: num  1 0 1 1

2. Data Structures

Vectors

Vectors are the most common data structure in R
An ordered collection of data
Can contain only a single datatype

Demo
Code

INTERACTIVE DEMO

Understanding vectors

> x <- c(10, 12, 45, 33)
> x
[1] 10 12 45 33

INTERACTIVE DEMO

A quiz

> quiz_vector <- c(2,6,'3')
> typeof(quiz_vector)

We need to be aware of datatypes and how R interprets them, otherwise we can meet some surprises

Demo
Code

INTERACTIVE DEMO

Vector coercion

> coercion_vector <- c('a', TRUE)
> coercion_vector
[1] "a"    "TRUE"

Coercion

Implicit coercion hierarchy

logical \(\rightarrow\) integer \(\rightarrow\) double \(\rightarrow\) complex \(\rightarrow\) character

Check your inputs

If there are formatting problems with your data, you might not have the type you expect when you import into R

Demo
Code

INTERACTIVE DEMO

Manual coercion

[1] 10 12 45 33
> as.character(x)
[1] "10" "12" "45" "33"
> as.complex(x)
[1] 10+0i 12+0i 45+0i 33+0i

Lists

lists are like vectors, but can hold any combination of datatype

Demo
Code

INTERACTIVE DEMO

Working with lists

l <- list(1, 'a', TRUE, seq(2, 5))

Working with list data

elements in a list are denoted by [[]] and individual elements can also be named

3. Data Frames

Let’s Look At A Data Frame

The cats data is a data.frame

Demo
Code

INTERACTIVE DEMO

Data frame data structures

> typeof(cats)
[1] "list"
> cats[[2]]
[1] 2.1 5.0 3.2

A data frame is a special named list of vectors with the same length

It’s like a spreadsheet, but each column must have a given type, and all columns must be the same length.

Load Episode Data

Where to find the data

https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv
Link available on the Etherpad (https://pad.carpentries.org/2025-06-16-strathclyde)

Save the data to the data subfolder (where you put feline-data.csv)

Demo
Code

INTERACTIVE DEMO

Load gapminder data

gapminder <- read.table("data/gapminder_data.csv", sep=",", header=TRUE)

Investigating `gapminder`

Demo
Code

INTERACTIVE DEMO

What is the structure of the data?
How many rows and columns?
Summarising the data?

str(gapminder)              # structure of the data.frame
typeof(gapminder$year)      # data type of a column
length(gapminder)           # length of the data.frame
nrow(gapminder)             # number of rows in data.frame
ncol(gapminder)             # number of columns in data.frame
dim(gapminder)              # number of rows and columns in data.frame
colnames(gapminder)         # column names from data.frame
head(gapminder)             # first few rows of dataframe
summary(gapminder)          # summary of data in data.frame columns

4. Data Frame Manipulation With `dplyr`

Learning Objectives

How to manipulate data.frames with the six verbs of dplyr
- a ‘grammar of data manipulation’

dplyr verbs

select()
filter()
group_by()
summarize()
mutate()
%>% (pipe)

What and Why is `dplyr`?

dplyr is another package in the Tidyverse
Facilitates analysis by groups in Tidy Data
- Helps avoid repetition

Tip

Avoiding repetition (though automation) makes code

robust
reproducible

Split-Apply-Combine

`select()`

Demo
Code

INTERACTIVE DEMO

Select specific columns for the whole dataset

head(select(gapminder, year, country, gdpPercap))
gapminder %>% select(year, country, gdpPercap) %>% head()

`filter()`

filter() selects rows on the basis of some condition

Demo
Code

INTERACTIVE DEMO

Select specific columns for a subset of rows

filter(gapminder, continent=="Europe")

# Select gdpPercap by country and year, only for Europe
eurodata <- gapminder %>%
              filter(continent == "Europe") %>%
              select(year, country, gdpPercap)

Challenge (5min)

Challenge

Can you write a single line (which may span multiple lines in your script by including pipes) to produce a dataframe from gapminder containing:

life expectancy, country, and year data
only for African nations
How many rows does the dataframe have?

`group_by()`

Demo
Code

INTERACTIVE DEMO

Split the dataset into groups on the basis of an ID value

group_by(gapminder, continent)
gapminder %>% group_by(continent)

`summarize()`

Demo
Code

INTERACTIVE DEMO

Produce a summary for each group

# Produce table of mean GDP by continent
gapminder %>% group_by(continent) %>%
              summarize(meangdpPercap=mean(gdpPercap))

Challenge (5min)

Challenge

Can you calculate the average life expectancy per country in the gapminder data?
Which nation has longest life expectancy, and which the shortest?

`count()` and `n()`

Two useful functions related to summarize()

count()/tally(): a function that reports a table of counts by group
n(): a function used within summarize(), filter() or mutate() to represent count by group

Demo
Code

INTERACTIVE DEMO

Count number of rows meeting set criteria
Calculate new summary statistic for a group

gapminder %>%
  filter(year == 2002) %>%
  count(continent, sort = TRUE)

gapminder %>%
  group_by(continent) %>%
  summarize(se_lifeExp = sd(lifeExp)/sqrt(n()))

`mutate()`

Tip

mutate() is a function allowing creation of new variables

Demo
Code

INTERACTIVE DEMO

Generate new columns with processed data

# Calculate GDP in $billion
gdp_bill <- gapminder %>%
  mutate(gdp_billion = gdpPercap * pop / 10^9)

# Calculate total/sd of GDP by continent and year
gdp_bycontinents_byyear <- gdp_bill %>%
    group_by(continent,year) %>%
    summarize(mean_gdpPercap=mean(gdpPercap),
              sd_gdpPercap=sd(gdpPercap),
              mean_gdp_billion=mean(gdp_billion),
              sd_gdp_billion=sd(gdp_billion))

`ifelse()`

Tip

ifelse() allows us to restrict apply filtering when we use mutate()

ifelse([CONDITION], [VALUE IF TRUE], [VALUE IF FALSE])

Demo
Code

INTERACTIVE DEMO

Apply calculations differently to rows that meet specified criteria

gdp_billion_large_countries <- gapminder %>%
  mutate(gdp_billion_large = ifelse(pop > 10e6, 
                                    gdpPercap * pop / 10^9,
                                    NA))

5. Tidy Data

Why Tidy Data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way”

Data Cleaning is not just a first step

Cleaning is needed whenever new data turns up, new ideas arrive, etc.
About 80% of the effort of data analysis is cleaning and preparing data

Principles of Tidy Data provide a standard way to organise data values within a dataset

An Untidy Dataset (1)

What’s wrong with this data?

	John Smith	Jane Doe	Mary Johnson
treatment.A	NA	16	3
treatment.B	2	11	1

An Untidy Dataset (2)

Is this better?

	treatment.A	treatment.B
John Smith	NA	2
Jane Doe	16	11
Mary Johnson	3	1

We need to talk about data semantics

Data Semantics

A dataset is a collection of VALUES

	John Smith	Jane Doe	Mary Johnson
treatment.A	NA	16	3
treatment.B	2	11	1

Each value belongs to a variable and an observation

VARIABLES (can change or vary): values that are measured or decided by a researcher: height, temperature, duration, treatment, etc.
OBSERVATIONS: values measured across all variables for the same individual/unit/group: person, reactor, religion, company, etc.

Challenge (2min)

In the table below, what are the rows and columns?

	John Smith	Jane Doe	Mary Johnson
treatment.A	NA	16	3
treatment.B	2	11	1

USE CHALLENGE SECTION ON ETHERPAD

Rows=OBERVATIONS, Columns=VARIABLES
Rows=OBERVATIONS, Columns=NEITHER
Rows=NEITHER, Columns=VARIABLES
Rows=NEITHER, Columns=NEITHER

A Tidy Dataset (1)

The dataset contains 18 values

There are three variables and six observations

VARIABLES

person (three people: John, Mary, Jane)
treatment (two treatments: a and b)
result (six measurements: NA, 16, 3, 2, 11, 1)

Important

Each OBSERVATION includes all three variables

A Tidy Dataset (2)

This is the dataset in tidy form

(aka long form)

name	treatment	result
John Smith	a	NA
Jane Doe	a	16
Mary Johnson	a	3
John Smith	b	2
Jane Doe	b	11
Mary Johnson	b	1

Tidy Data

Tidy data is a standard way of structuring a dataset

Each VARIABLE forms a column
Each OBSERVATION forms a row

makes it easy to extract needed variables, and well suited to R (supports vectorisation)

Demo
Code

INTERACTIVE DEMO

Inspect gapminder data

head(gapminder)

Long v Wide

Many R functions are designed to expect long data

Long data: each column is a variable, each row is an observation
Wide data: each row might be a subject, each column an observation variable

`gapminder` Wide Dataset

Where to find the data

https://swcarpentry.github.io/r-novice-gapminder/data/gapminder_wide.csv
Link available on the Etherpad (https://pad.carpentries.org/2025-06-16-strathclyde)

Demo
Code

INTERACTIVE DEMO

Load and inspect the data

gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
str(gap_wide)

Pivot: Wide to Long

Demo
Code

INTERACTIVE DEMO

Pivot wide to long

gap_long <- gap_wide %>%
  pivot_longer(
    cols = c(starts_with('pop'), starts_with('lifeExp'),
             starts_with('gdpPercap')),
    names_to = "obstype_year", values_to = "obs_values"
  )

`separate()`

Tip

separate() allows us to split the contents of one column into several columns

separate([COLUMN], into=[NEW_COLUMN_NAMES], sep=[SEPARATOR])

Demo
Code

INTERACTIVE DEMO

Split one column into two

gap_long %>% separate(obstype_year,
                      into = c('obs_type', 'year'),
                      sep = "_")

Pivot: Long to Wide

Demo
Code

INTERACTIVE DEMO

Pivot wide to long

gap_normal <- gap_long %>%
  pivot_wider(names_from = obs_type,
              values_from = obs_values)

Challenge (5min)

Challenge

Using the gap_long dataset, calculate the mean life expectancy, population, and GDP per capita for each continent.

Hint: use the group_by() and summarize() functions we learned in the dplyr episode.

Data Structures, Data Frames, and Tidy Data

0. What are we doing?

Learning Questions

Learning Objectives

1. Data Types

Data Types and Structures in R

What Data Types Do You Expect?

Data Types in R

A Quick Note About Dataframes

2. Data Structures

Vectors

Coercion

Coercion

Lists

3. Data Frames

Let’s Look At A Data Frame

Load Episode Data

Investigating gapminder

4. Data Frame Manipulation With dplyr

Learning Objectives

What and Why is dplyr?

Split-Apply-Combine

select()

filter()

Challenge (5min)

group_by()

summarize()

Challenge (5min)

count() and n()

mutate()

ifelse()

5. Tidy Data

Why Tidy Data?

An Untidy Dataset (1)

An Untidy Dataset (2)

Data Semantics

Challenge (2min)

A Tidy Dataset (1)

A Tidy Dataset (2)

Tidy Data

Long v Wide

gapminder Wide Dataset

Pivot: Wide to Long

separate()

Pivot: Long to Wide

Challenge (5min)

Data Types and Structures in `R`

Data Types in `R`

Investigating `gapminder`

4. Data Frame Manipulation With `dplyr`

What and Why is `dplyr`?

`select()`

`filter()`

`group_by()`

`summarize()`

`count()` and `n()`

`mutate()`

`ifelse()`

`gapminder` Wide Dataset

`separate()`