Data Structures, Data Frames, and Tidy Data

0. What are we doing?

Learning Questions

  • How can I read data in R?
  • What are the basic data types in R?
  • How can I manipulate a data frame?
  • How can I manipulate data frames without repeating myself?

Learning Objectives

  • To be able to identify the main data types in R
  • To explore data frames (and see how they are built from R data types)
  • To be able to ask R about the type and structure of an object
  • To be able to use the six main data frame manipulation ‘verbs’ with pipes in dplyr.
  • To understand the concepts of ‘longer’ and ‘wider’ data frame formats and be able to convert between them with tidyr.

1. Data Types

Data Types and Structures in R

  • R is mostly used for data analysis
  • R has special types and structures to help you work with data
  • Much of the focus in R is on tabular data (data frames)

Interactive Demo

  • toy dataset: feline-data.csv
cats <- data.frame(coat = c("calico", "black", "tabby"),
                   weight = c(2.1, 5.0, 3.2),
                   likes_catnip = c(1, 0, 1))

What Data Types Do You Expect?

What data types would you expect to see?

What examples of data types can you think of from your own experience?

Please write in the chat

Add your suggestions of data types at:

https://pad.carpentries.org/2025-06-16-strathclyde

Data Types in R

Data types in R are atomic

  1. logical: TRUE, FALSE
  2. numeric:
    • integer: 3, 2L, 123456
    • double (decimal): 3.0, -23.45, pi
  3. character (text): "a", 'SWC', "This is not a string"

INTERACTIVE DEMO

  • Inspecting data types
> typeof(1L)
[1] "integer"
> typeof(cats$coat)
[1] "character"
> typeof("Irn Bru")
[1] "character"

A Quick Note About Dataframes

A column in a dataframe can contain only one data type

  • Adding a character string ("2.3 or 2.4") to the column of doubles changed its data type from double to character.

INTERACTIVE DEMO

  • Coercion in dataframes
> str(cats2)
'data.frame':   4 obs. of  3 variables:
 $ coat        : chr  "calico" "black" "tabby" "tabby"
 $ weight      : chr  "2.1" "5" "3.2" "2.3 or 2.4"
 $ likes_catnip: num  1 0 1 1

2. Data Structures

Vectors

  • Vectors are the most common data structure in R
  • An ordered collection of data
  • Can contain only a single datatype

INTERACTIVE DEMO

  • Understanding vectors
> x <- c(10, 12, 45, 33)
> x
[1] 10 12 45 33

Coercion

INTERACTIVE DEMO

  • A quiz
> quiz_vector <- c(2,6,'3')
> typeof(quiz_vector)

We need to be aware of datatypes and how R interprets them, otherwise we can meet some surprises

INTERACTIVE DEMO

  • Vector coercion
> coercion_vector <- c('a', TRUE)
> coercion_vector
[1] "a"    "TRUE"

Coercion

Implicit coercion hierarchy

logical \(\rightarrow\) integer \(\rightarrow\) double \(\rightarrow\) complex \(\rightarrow\) character

Check your inputs

If there are formatting problems with your data, you might not have the type you expect when you import into R

INTERACTIVE DEMO

  • Manual coercion
[1] 10 12 45 33
> as.character(x)
[1] "10" "12" "45" "33"
> as.complex(x)
[1] 10+0i 12+0i 45+0i 33+0i

Lists

  • lists are like vectors, but can hold any combination of datatype

INTERACTIVE DEMO

  • Working with lists
l <- list(1, 'a', TRUE, seq(2, 5))

Working with list data

elements in a list are denoted by [[]] and individual elements can also be named

3. Data Frames

Let’s Look At A Data Frame

  • The cats data is a data.frame

INTERACTIVE DEMO

  • Data frame data structures
> typeof(cats)
[1] "list"
> cats[[2]]
[1] 2.1 5.0 3.2

A data frame is a special named list of vectors with the same length

  • It’s like a spreadsheet, but each column must have a given type, and all columns must be the same length.

Load Episode Data

Save the data to the data subfolder (where you put feline-data.csv)

INTERACTIVE DEMO

  • Load gapminder data
gapminder <- read.table("data/gapminder_data.csv", sep=",", header=TRUE)

Investigating gapminder

INTERACTIVE DEMO

  • What is the structure of the data?
  • How many rows and columns?
  • Summarising the data?
str(gapminder)              # structure of the data.frame
typeof(gapminder$year)      # data type of a column
length(gapminder)           # length of the data.frame
nrow(gapminder)             # number of rows in data.frame
ncol(gapminder)             # number of columns in data.frame
dim(gapminder)              # number of rows and columns in data.frame
colnames(gapminder)         # column names from data.frame
head(gapminder)             # first few rows of dataframe
summary(gapminder)          # summary of data in data.frame columns

4. Data Frame Manipulation With dplyr

Learning Objectives

  • How to manipulate data.frames with the six verbs of dplyr
    • a ‘grammar of data manipulation’

dplyr verbs

  • select()
  • filter()
  • group_by()
  • summarize()
  • mutate()
  • %>% (pipe)

What and Why is dplyr?

  • dplyr is another package in the Tidyverse
  • Facilitates analysis by groups in Tidy Data
    • Helps avoid repetition

Tip

Avoiding repetition (though automation) makes code

  • robust
  • reproducible

Split-Apply-Combine

select()

INTERACTIVE DEMO

  • Select specific columns for the whole dataset
head(select(gapminder, year, country, gdpPercap))
gapminder %>% select(year, country, gdpPercap) %>% head()

filter()

filter() selects rows on the basis of some condition

INTERACTIVE DEMO

  • Select specific columns for a subset of rows
filter(gapminder, continent=="Europe")

# Select gdpPercap by country and year, only for Europe
eurodata <- gapminder %>%
              filter(continent == "Europe") %>%
              select(year, country, gdpPercap)

Challenge (5min)

Challenge

Can you write a single line (which may span multiple lines in your script by including pipes) to produce a dataframe from gapminder containing:

  • life expectancy, country, and year data

  • only for African nations

  • How many rows does the dataframe have?

group_by()

INTERACTIVE DEMO

  • Split the dataset into groups on the basis of an ID value
group_by(gapminder, continent)
gapminder %>% group_by(continent)

summarize()

INTERACTIVE DEMO

  • Produce a summary for each group
# Produce table of mean GDP by continent
gapminder %>% group_by(continent) %>%
              summarize(meangdpPercap=mean(gdpPercap))

Challenge (5min)

Challenge

  • Can you calculate the average life expectancy per country in the gapminder data?
  • Which nation has longest life expectancy, and which the shortest?

count() and n()

Two useful functions related to summarize()

  • count()/tally(): a function that reports a table of counts by group
  • n(): a function used within summarize(), filter() or mutate() to represent count by group

INTERACTIVE DEMO

  • Count number of rows meeting set criteria
  • Calculate new summary statistic for a group
gapminder %>%
  filter(year == 2002) %>%
  count(continent, sort = TRUE)

gapminder %>%
  group_by(continent) %>%
  summarize(se_lifeExp = sd(lifeExp)/sqrt(n()))

mutate()

Tip

mutate() is a function allowing creation of new variables

INTERACTIVE DEMO

  • Generate new columns with processed data
# Calculate GDP in $billion
gdp_bill <- gapminder %>%
  mutate(gdp_billion = gdpPercap * pop / 10^9)

# Calculate total/sd of GDP by continent and year
gdp_bycontinents_byyear <- gdp_bill %>%
    group_by(continent,year) %>%
    summarize(mean_gdpPercap=mean(gdpPercap),
              sd_gdpPercap=sd(gdpPercap),
              mean_gdp_billion=mean(gdp_billion),
              sd_gdp_billion=sd(gdp_billion))

ifelse()

Tip

ifelse() allows us to restrict apply filtering when we use mutate()

  • ifelse([CONDITION], [VALUE IF TRUE], [VALUE IF FALSE])

INTERACTIVE DEMO

  • Apply calculations differently to rows that meet specified criteria
gdp_billion_large_countries <- gapminder %>%
  mutate(gdp_billion_large = ifelse(pop > 10e6, 
                                    gdpPercap * pop / 10^9,
                                    NA))

5. Tidy Data

Why Tidy Data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way”

Data Cleaning is not just a first step

  • Cleaning is needed whenever new data turns up, new ideas arrive, etc.
  • About 80% of the effort of data analysis is cleaning and preparing data

Principles of Tidy Data provide a standard way to organise data values within a dataset

An Untidy Dataset (1)

What’s wrong with this data?

John Smith Jane Doe Mary Johnson
treatment.A NA 16 3
treatment.B 2 11 1

An Untidy Dataset (2)

Is this better?

treatment.A treatment.B
John Smith NA 2
Jane Doe 16 11
Mary Johnson 3 1

We need to talk about data semantics

Data Semantics

A dataset is a collection of VALUES

John Smith Jane Doe Mary Johnson
treatment.A NA 16 3
treatment.B 2 11 1

Each value belongs to a variable and an observation

  • VARIABLES (can change or vary): values that are measured or decided by a researcher: height, temperature, duration, treatment, etc.
  • OBSERVATIONS: values measured across all variables for the same individual/unit/group: person, reactor, religion, company, etc.

Challenge (2min)

In the table below, what are the rows and columns?

John Smith Jane Doe Mary Johnson
treatment.A NA 16 3
treatment.B 2 11 1

USE CHALLENGE SECTION ON ETHERPAD

  • Rows=OBERVATIONS, Columns=VARIABLES
  • Rows=OBERVATIONS, Columns=NEITHER
  • Rows=NEITHER, Columns=VARIABLES
  • Rows=NEITHER, Columns=NEITHER

A Tidy Dataset (1)

The dataset contains 18 values

There are three variables and six observations

VARIABLES

  • person (three people: John, Mary, Jane)
  • treatment (two treatments: a and b)
  • result (six measurements: NA, 16, 3, 2, 11, 1)

Important

Each OBSERVATION includes all three variables

A Tidy Dataset (2)

This is the dataset in tidy form

(aka long form)

name treatment result
John Smith a NA
Jane Doe a 16
Mary Johnson a 3
John Smith b 2
Jane Doe b 11
Mary Johnson b 1

Tidy Data

Tidy data is a standard way of structuring a dataset

  1. Each VARIABLE forms a column
  2. Each OBSERVATION forms a row

makes it easy to extract needed variables, and well suited to R (supports vectorisation)

INTERACTIVE DEMO

  • Inspect gapminder data
head(gapminder)

Long v Wide

Many R functions are designed to expect long data

  • Long data: each column is a variable, each row is an observation
  • Wide data: each row might be a subject, each column an observation variable

gapminder Wide Dataset

INTERACTIVE DEMO

  • Load and inspect the data
gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
str(gap_wide)

Pivot: Wide to Long

INTERACTIVE DEMO

  • Pivot wide to long
gap_long <- gap_wide %>%
  pivot_longer(
    cols = c(starts_with('pop'), starts_with('lifeExp'),
             starts_with('gdpPercap')),
    names_to = "obstype_year", values_to = "obs_values"
  )

separate()

Tip

separate() allows us to split the contents of one column into several columns

  • separate([COLUMN], into=[NEW_COLUMN_NAMES], sep=[SEPARATOR])

INTERACTIVE DEMO

  • Split one column into two
gap_long %>% separate(obstype_year,
                      into = c('obs_type', 'year'),
                      sep = "_")

Pivot: Long to Wide

INTERACTIVE DEMO

  • Pivot wide to long
gap_normal <- gap_long %>%
  pivot_wider(names_from = obs_type,
              values_from = obs_values)

Challenge (5min)

Challenge

Using the gap_long dataset, calculate the mean life expectancy, population, and GDP per capita for each continent.

Hint: use the group_by() and summarize() functions we learned in the dplyr episode.