6  Sampling From A Population

Important

The distinction between a sample and a population is key to understanding much of statistics.

As is often the case, these words have particular meanings in statistics that differ from everyday use. In statistics we usually extrapolate from a sample to draw inferences about a population.

In an experiment, we often take a single measurement from each member of a group, where all members of the group share some characteristic, like being a common phenotype, or a negative control. This group is our sample. Values we might measure could include:

TipSingle measurements of individual group members
  • body weight
  • survival time
  • organ weight
  • litter size

We assume that our group (i.e. our sample) of individuals represents all individuals with those characteristics, and so we use the data we collect to draw conclusions about that much larger set of all similar individuals: the population.

Populations are typically much larger than samples. In some settings, a population might be a finite size. For instance, when taking a political poll of a country the population size is the number of voters in that country.

In biomedical research, however, we usually assume that our population is infinite in size (or at least effectively infinite).

6.1 Random sampling

Important

Much of statistics is based on the assumption that samples are randomly selected from a population.

There are many ways in which samples might not be randomly selected (e.g. take the ten largest individuals; use a brood from a single female).

Callout-questionQuestion

Should we expect a sample to be exactly representative of the whole population?

For instance, will a subgroup of the population always have exactly the same average body weight as the average body weight of the whole population?

What factors might influence how representative a subsample is of the larger population?

  • Sampling error: by chance the sample you choose may have a higher or lower mean value than that of the population
  • Selection bias: if the choice of sample members is not random, individuals with particular values might be prefentially selected and unrepresentative of the population
  • Experimental bias: imperfections in experimental design may introduce systematic differences between the sample and the population

6.2 Simulation

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 1400

library(shiny)
library(bslib)
library(dplyr)
library(tidyr)
library(DT)
library(ggplot2)

# Workaround for ggplot2 graphics
if (FALSE) {
  library(munsell)
}

# Formatting settings
figbg = "whitesmoke"

# Shiny UI
ui <- page_navbar(
  title = "Sampling Simulator",
  bg = "#5d9732",
  inverse = TRUE,
  nav_panel(
    title = "Population", 
    
    p("The live plot below shows a frequency histogram (blue) of values in a random, uniformly-distributed population.",
      "The thick dashed line represents the true mean of the population, and the dotted lines represent the mean +/- one and two standard deviations"),
    p("The sliders below the plot control the number and range of values.",
      "The `Generate new population` button will generate a new random population."),
    strong("Click on the menu icon and select `Samples` to see samples from this population."),
    
    # Population-level controls
    plotOutput("plot1"),
    sliderInput("n_points",
                "Population size:",
                min = 200,
                max = 500,
                value = 350,
                width = "100%" ),
    sliderInput("min_max",
                "Population value range:",
                min = -1000,
                max = 1000,
                value = c(-200, 200),
                width = "100%" ),
    actionButton("resample", "Generate new population"),
  ),
  
  nav_panel(
    title = "Samples", 
    
    p("The live plot below shows the mean values (orange squares) of samples taken from the simulated population.",
      "The thick dashed line represents the grand mean of the samples, and the dotted lines represents the population mean that's being estimated"),
    p("Sliders below the plot control the number of datapoints in each sample, and the number of samples taken.",
      "The `Resample from population` button will generate a new set of random samples from the population."),
    p("Select the `Show sample datapoints` to see individual values for each sample."),
    strong("Select `Sample tables` from the menu to see the raw data."),
    
    # Sample-level controls
    plotOutput("plot2"),
    checkboxInput("chkDatapoints",
                  label="Show sample datapoints",
                  value=FALSE),
    sliderInput("n_subsamples",
                "Number of samples:",
                min = 1,
                max = 20,
                value = 1,
                step = 1,
                width = "100%" ),
    sliderInput("subsample_size",
                "Datapoints in each sample:",
                min = 3,
                max = 20,
                value = 3,
                step = 1,
                width = "100%" ),
    actionButton("re_subsample", "Resample from population"),
  ),
  
  nav_panel(
    title = "Sample tables",
    
    p("The table below shows sampled values from the population.",
      "Each column represents a different sample."),
    
    dataTableOutput("table")
  )
)

server <- function(input, output, session) {
  
  dist_data <- eventReactive(c(input$resample, input$n_points, input$min_max),
                            data.frame(vals=runif(input$n_points, input$min_max[1], input$min_max[2])),
                            ignoreNULL=FALSE)
  
  subsamples <- eventReactive(c(input$n_subsamples, input$subsample_size, input$re_subsample,
                               input$resample, input$n_points, input$min_max),
                             data.frame(sample=replicate(input$n_subsamples, sample(dist_data()$vals,
                                                                                  input$subsample_size))),
                             ignoreNULL=FALSE)  
    
  output$plot1 <- renderPlot({
    n_points = input$n_points
    data = dist_data()
    mu = mean(data$vals)
    std = sd(data$vals)
    
    # set up the breakpoints between bars in the histogram
    breaks = seq(mu - 3 * std, mu + 3 * std, by=max(1, std/8))      
    
    ytop = 1.3 * max(density(data$vals)$y)
    
    p = ggplot(data, aes(x=vals))                             # set up the ggplot with data
    p = p + geom_histogram(aes(y=..density..),                # add a histogram layer
                           breaks=breaks,
                           fill="cornflowerblue")
    p = p + annotate("segment",                               # show the mean as a dashed line
                     x=mu, xend=mu, y=0, yend=0.8 * ytop,
                     colour="darkorange1", size=1, linetype="dashed")
    p = p + annotate("segment",                               # show standard deviations as dotted lines
                     x=c(mu-2*std, mu-std, mu+std, mu+2*std),
                     xend=c(mu-2*std, mu-std, mu+std, mu+2*std),
                     y=c(0), 
                     yend=c(0.70 * ytop, 0.75 * ytop, 0.85 * ytop, 0.90 * ytop),
                     colour="goldenrod", size=1, linetype="dotted")
    p = p + annotate("text",                                  # annotate the lines
                     x=c(mu-2*std, mu-std, mu, mu+std, mu+2*std),
                     y=c(0.75 * ytop, 0.80 * ytop, 0.85 * ytop, 0.90 * ytop, 0.95 * ytop),
                     colour="darkorange3",
                     label=c(paste("ยต - 2ฯƒ=", format(round(mu - 2*std, 2), nsmall=2)),
                             "ยต - ฯƒ", 
                             paste("ยต=", format(round(mu, 2), nsmall=2)), 
                             "ยต + ฯƒ", 
                             paste("ยต + 2ฯƒ=", format(round(mu + 2*std, 2), nsmall=2))
                     ))
    p = p + annotate("text",
                     x=0, y=0, colour="darkorange1",
                     label=paste("Population mean, ยต=", format(round(mu, 2), nsmall=2)))
    p = p + xlim(-1000, 1000)             # set x-axis limits
    p = p + xlab("measured variable") + ylab("frequency")                # add axis labels
    p = p + theme(plot.background = element_rect(fill = figbg,           # colour background
                                                 color = figbg))                   # zero the figure margins
    p
  })
  
  output$plot2 <- renderPlot({
    popdata <- dist_data()
    popmu <- mean(popdata$vals)
    
    sampledata <- pivot_longer(subsamples(),
                         starts_with("sample"),
                         names_to = "samples",
                         values_to = "values"
    )
    samplemeans <- sampledata %>% group_by(samples) %>% summarise(means=mean(values))
    
    grandmean <- mean(sampledata$values)
    
    p <- ggplot(sampledata, aes(x=values, y=samples, color=samples))
    if (input$chkDatapoints) {
      p <- p + geom_point(alpha=0.5, size=4)
    }
    p <- p + stat_summary(geom = "point", fun.x = "mean", shape=15,
                          size=6, color="darkorange3", alpha=0.8)
    p <- p + geom_vline(xintercept = popmu,
                        colour="goldenrod", size=1, linetype="dotted")
    p <- p + annotate("text",
                      x = popmu, y = -Inf,
                      vjust = -1,
                      label = paste("Population ยต=", 
                                    format(round(popmu, 2), nsmall=2)),
                      colour="goldenrod",)
    p <- p + geom_vline(xintercept = grandmean,
                        colour="darkorange3", size=1, linetype="dashed")
    p <- p + annotate("text",
                      x = grandmean, y = -Inf,
                      vjust = -3,
                      label = paste("Sample grand ยต=", 
                                    format(round(grandmean, 2), nsmall=2)),
                      colour="darkorange3",)
    p <- p + theme(legend.position="none")
    p <- p + xlim(input$min_max[1], input$min_max[2])
    p
  })
  
  output$table <- renderDataTable({
    # Uncomment to check means
    # sampledata <- pivot_longer(subsamples(),
    #                      starts_with("sample"),
    #                      names_to = "samples",
    #                      values_to = "values"
    # )
    # samplemeans <- sampledata %>% group_by(samples) %>% summarise(means=mean(values))
    # datatable(samplemeans)
    
    datatable(subsamples())
  })
}

# Create Shiny app ----
shinyApp(ui = ui, server = server)
Figure 6.1: Interactive simulation of a population of random values drawn from a uniform distribution, and random sampling from that population.

6.3 Exercise

The first graph in Figure 6.1 simulates measured values for a population of individuals, and calculates the exact population mean.

The second window simulates taking a number of samples (of multiple individuals) from that population, and calculating a sample mean for each of them.

In this exercise, you will use the sliders in Figure 6.1 to explore how sample mean values relate to the population mean, as you vary the number of individuals in each sample.

6.3.1 Generate new random populations

  1. Read the description in Figure 6.1.
  2. In Figure 6.1, click the Generate new population button to see that the histogram and population mean both update, representing a new random population.
  3. Move the Population size slider to see how this affects the histogram and reported mean value.
  4. Move the Population value range sliders to see how they affect the histogram and reported mean value.

6.3.2 Generate samples from a population

  1. In the first graph of Figure 6.1, set Population size to 500 and Population value range to {-200, +200}.
  2. Click on the menu button in Figure 6.1 (three horizontal lines at the top right of the figure).
  3. Click on Samples and read the description.
  4. Click on Resample from population a few times to see that the sample mean value updates.
Callout-questionsQuestions

Before you start: ensure that the number of samples is set to 1, and the number of datapoints in each sample is set to 3.

  1. Increase Number of samples to 20 to represent multiple resamplings from the population.
  2. How variable is the spread of mean values around the population mean?
  3. What is the largest difference between the population mean and any single sample mean?
  4. Click on Resample from population to see how this changes with multiple resamplings.
  5. Increase Datapoints in each sample to 20.
  6. How variable is the spread of mean values around the population mean, now?
  7. What is the largest difference between the population mean and any single sample mean?
  8. Click on Resample from population to see how this changes with multiple resamplings.

6.4 Summary

TipSample size affects estimation accuracy

When making a measurement in an experiment we are often working with individuals as a sample from a larger population. We usually assume that the sample is broadly representative of the population, and calculate a sample mean value as an estimate of the population mean.

The precision of our estimate varies depending on the number of individuals in our sample. The larger the sample size, the more accurate we expect our estimate to be.