A major point, on which I cannot yet hope for universal agreement, is that our focus must be on questions, not models […] Models can - and will - get us into deep troubles if we expect them to tell us what the unique proper questions are. - J.W. Tukey (1977)
The goal of this part of the workshop is to introduce, or reintroduce, common statistical tools and ideas, and help you develop insight into how they work and when they should - and perhaps also when they ought not to - be applied. This is not a formal mathematical presentation of statistical methods, or statistical theory. Rather, this workshop is a primer that describes and discusses how and why statistical methods are applied to answer research questions, and prompts you to critically interpret their application and interpretation in the literature and your own work.
Although we will not dive into theory or mathematics in detail, you should understand that statistical methods are mathematical in nature, and are not constructed in an ad hoc way. There is sound theory behind them, even when we might criticise particular application of a method.
The implementation of statistical methods is handled almost exclusively by computers, these days. This is true to the extent that a purely mathematical - or an intuition-led (like this workshop) - approach is insufficient. The same statistical model might be implemented computationally in more than one way, and the differences between those approaches can matter very greatly.
The best way to explore and understand statistics is via computing tools. R
is the most common environment for statistical analysis in scientific research. Gaining some familiarity with R
will help your research, and maybe even your future employability. In any case, it encourages the kind of good practice we would like to see the scientists of tomorrow adopt:
R
each action is implemented as a command in a script, and this documents itself to the extent that an analysis should be replicable just by re-running the same script on the same data.R
is, itself, open-source - as are nearly all the statistical packages you can use with R
. This has two main benefits. Firstly, many eyes can inspect the code for errors or problems, and improve it over time. The tools become more robust over time, unlike proprietary tools where errors and mistakes may not be easily spotted, or ever fixed. Secondly, it encourages a wider user-base and community which, as it grows, includes more people who can help improve tools like R
and its packages, but also help out with your use of R
.RStudio
(though we’ll not talk about it in detail, here).In the spirit of openness, each notebook will have a section where you can read and inspect the R
code used in it. In addition, all this material is developed openly and can be downloaded from GitHub
, and run on your own computer. If you like, you can suggest modifications and extra material through the website linked below:
GitHub
repositoryStatistics is a science that aims to solve problems and answer questions mathematically, by representing and analysing data.
Statistics is applicable to any field that collects data, including medicine, biology, chemistry, engineering, and social sciences. The nature of data can be complex, and there are always some practical questions that can only be answered by the application of statistical methods and models to that data.
What is a model, though?
In all these cases, a model is a simplified representation of something else. We do not expect every Swiss to eat exactly 22.36lb of chocolate a year, but it could greatly simplify some things to imagine that they do.
We might want to simplify the “real thing” to make it easier for others to understand (like the tube map), but in statistics we often do this to make calculations easier, or possible at all.
We are often concerned with the “truth” of some answer to our scientific questions. For instance, is it “true” that our drug, when administered, produces a measurable effect? What is the “truth” about how much active compound there is in each mass-produced medication capsule?
But statistical models are neither true nor false. They are tools, or constructs, that we have developed to enable us to answer our questions. They are powerful tools, for sure, able to carry out exact calculations (nearly) the same way every time. But they are not wise. They do not know when the context they are being used in is inappropriate. They do not know what your question was or even what question you really meant to ask. They are procedures that operate on data and they will give you answers, but they may not give you good answers, or even the answer to the question you thought you were asking. A large part of the skill of statistics is framing an appropriate question, and understanding which tool is appropriate for that question.Many courses will teach a “Hypothesis Testing Decision Tree,” or something similar (see below). This may look familiar to you from courses you have studied before.
These flowcharts do not encourage much in the way of thought about the underlying structures, models and processes of these methods. Some people might presenmt this flowchart as if the small collection of tests shown in it can give good answers in all circumstances, but this is not the case!.
The classical statistical methods shown in these flowcharts are often the most appropriate - that’s why we teach them. But they can sometimes be inflexible and fragile, unable to adapt to novel contexts not met by their assumptions. They can fail in unpredictable ways when they meet contexts that don’t fit the underlying assumptions of the method. These novel contexts may arise in your research, and flowcharts like these don’t tell you where the boundaries of applicability are. This is why it is important to understand your question, your data, and the methods themselves.
Statistics is a toolbox containing many tools that will make your life easier if you know when they should be used.
In this workshop, we aim to help you develop some intuition, insight and tools to help you understand common statistical methods you might encounter, why they are used and - occasionally - why they perhaps should not have been used. It’s perhaps true that you can’t truly understand a statistical approach unless you understand the mathematics behind it - and sometimes the computational implementation, too - but you can get a long way by exploring examples.
One of our goals is to encourage you to look beyond a rote application of “The Null Ritual” (coined in Gigerenzer (2004) J. Socio-Econ. doi:10.1016/j.socec.2004.09.033), as though it’s a hammer and every statistical problem is a nail to be hammered down flat (to p<0.05).
The Null Ritual runs like this:
You will see this ritual practised mechanically in many published papers. It is a bad habit in science.
The Null Ritual is frequently observed in publications where Null Hypothesis Significance Testing (NHST) has been performed. We explore NHST in more detail in some supplementary material.
Ths workshop does not attempt to teach you the “right” way to do statistics, in the sense of “use tool X for problem Y”.
There is no recipe for the “right” way to do statistics. And there is no shortcut for understanding the question you wish to ask with your experiment, or whether your chosen statistical method is likely to actually answer that question.
We are instead trying to help explore some statistical thinking, in terms of understanding the bases of some common statistical tools, their applications, and their limitations. This will help you determine whether a particular approach can answer a stated question. Sometimes, you might find you need to learn how to use a new tool.
Whichever tool you use, understanding your own experiment and data or, if you’re critically assessing someone else’s research, understanding their experiment and data, should be the foundational starting point for all analyses.
If I (LP) were to recommend three tools to learn, and a book to learn them from, I would point you to Richard McElreath’s Statistical Rethinking, which covers: