Lesson 1-3 - Notes

Lesson 1. Introduction to R and R Studio

Slide 1

  • Do you have the latest version of R and R Studio installed?
  • Housekeeping notes

Slide 2

Why use R and R Studio?

  • You probably have some data that you want to analyse (could be any type of data).
  • The raw data by themselves may not tell you very much - you likely want to do some sort of analysis, perhaps some statistics, or some type of visualisation. Data on its own doesn’t mean much unless it is accurately interpreted and clearly communicated to the correct audience.
  • There are many different tools you can use to analyse your data - how do we choose the most appropriate tool? why should we choose R?

Some of the advantages to using R/R Studio:

  • Powerful statistical tool (can easily do most of the analyses that you’ll want to do)
  • Documentation and reproducibility (can write scripts to document your work, and other scientists can then reproduce what you have done)
  • Free and open-source (extensive community support and availability of “packages”)

Slide 3

A note on terminology - R (the programming language) versus R Studio (the integrated development environment or IDE).

Slide 4: Live demo:

Introduction to the R Studio console

  1. Open R studio

You will see 3 panes: Console/Terminal/Jobs, Environment/History/Connections/Tutorial, and Files/Plots/Packages/Help/Viewer/Presentation. You can change between panes and resize them to suit your needs. (Later on, when you are working with scripts, you will see 4 panes.)

  1. Familiarising ourselves with the console

You can use the ‘>’ prompt in the console to type commands and press “Enter” to execute them.

Note

If you see a “+” at the prompt in the console, this indicates that you have put in an incomplete command, and R Studio is waiting for you to complete it. You can hit “Esc” to cancel a command and this should restore the “+” prompt.

Commands you can run include things like:

  • Performing basic mathematical operations
  • Using different functions
  • Managing your R Studio environment
  • And much more!
Note

We will be working in the console for now, but you can also use these commands in R scripts, and you will see later how this can be very useful for your work.

Basic mathematical operations

R can carry out basic mathematical operations (just like a calculator or Excel or many other software). It will do so using the rules that normally apply to maths [parentheses; exponents; multiply; divide; add; subtract].

1 + 100
[1] 101
Note

You want to make your code as readable as possible.

Use parentheses as necessary (to change from the default order of evaluation, or if needed to make your code clearer for a reader).

Use comments (with the # notation) to explain your code. You may not remember what you were trying to do, when you come back to look at your own code - and you want other people to be able to read and use your code too!

Ue scientific notation where necessary (Very large or very small numbers can be hard for humans to read.)

#Demonstrate the order of operations
3 + 5 * 2 # This will multiply 5 *2 first, following maths rules
[1] 13
(3 + 5) * 2 # This will execute the expression inside the parentheses first
[1] 16
#Demonstrate how to use parentheses and comments
3 + 5 * 2 ^ 2       # This is clear, if you remember the rules
[1] 23
(3 + (5 * (2 ^ 2))) # This uses parentheses, but is hard to read
[1] 23
3 + 5 * (2 ^ 2)     # This uses parentheses in a way that might help you remember the rules
[1] 23
#Format numbers readably
2/10000 # Large numbers are hard to read
[1] 2e-04
2e-04 # We can format numbers using scientific notation
[1] 2e-04
# Remember, e is shorthand for “multiplied by 10^XX”

5e3 # e can be used with positive numbers too
[1] 5000

Assigning and using variables

Sometimes we will want to store values or information for reuse later on in our code. We can do this by using variables. You can assign a value to a variable using the operator <- or =.

Note

You should be consistent - use the same assignment operator regularly through your code. (This makes your code much easier to read.)

Also note, you do not want to confuse the assignment operator = with the comparison operator == (which is used to test whether two values or variables are equal to one another).

x <- 1/40 # Assign a variable called x to the value 1/40
Important

What do you expect to happen when you enter the command above into the console?

What actually happens, and why?

You will not see the result of the calculation printed in the console. Instead, you’ll see the value for x listed in your environment tab. R Studio is storing the result of the calculation for you (in this case, a floating point number) - but you haven’t actually asked it to print out x, so it won’t.

You can enter a variable into the console, and it will print out the value of that variable - or you can use a function, print() to print the value of the variable.

Note

Note, you can reassign a variable, and this will update the value.

The right hand side of the assignment can be any valid R expression, and it will be fully evaluated before the assignment occurs.

You can also assign vectors1 to variables and functions. This is quite useful and will come in handy later.

x <- 1:5 # using the : operator to generate a sequence of numbers 1-5, when are then assigned to the variable x
x <- 5:1 # using the : operator in the reverse direction of above
x <- c(1, 8, 59) # You can also use a function called c() to assign a vector to a function - we will learn more about functions shortly.
print(x) #prints the current value of x
[1]  1  8 59
Variable names

You want to be mindful about choosing names for variables. Good variable names are short and readable (corresponding to their function in the code). While there is nothing that stops me from naming a variable that tracks the growth rate of my bacteria, as something like apples or rainbow or any other valid variable name - it will be easier for anyone who reads my code, or for me in the future, to understand the variable if this variable is named GrowthRate.

You can make long variable names more readable by using periods, underscores, or camelcase: periods.between.words, underscores_between_words, camelCaseToSeparateWords (again, be consistent).

Rules for naming variables:

  • They can contain letters, numbers, underscores, periods

  • They cannot contain spaces

  • They must start with a letter, or a period followed by a letter. (Variables beginning with a period are hidden variables.)

  • They cannot start with a number or an underscore

Using basic functions

Functions are groups of organised instructions that will carry out a specific task. They can be built-in (part of base R), provided by packages, or written by you.

Many functions take one or more arguments (which tell the function how to carry out its task.

Functions can be simple mathematical calculations (e.g., logarithms, trigonometry functions), or they can be used to help manage your environment in R.

#Using functions
log(1)  # Natural logarithm
getwd() # Returns an absolute filepath
ls() # Lists all the variables and functions stored in the global environment (your working R session)
ls(all.names=TRUE) # This argument alters the behaviour of the function so that it lists hidden variables and functions as well
Note

You can use the autocomplete function in R to help you remember the names of functions, or you can look them up (Google or help pages). We will talk more about how to get help in R later.

As part of managing your environment, you might want to delete objects (either a specific object, or all the objects in your environment).

rm() # This is a function that allows you to delete objects from your environment - you need to pass an argument to it, to tell it what to delete
rm(x) # You can delete a specific variable (x)
rm(list = ls()) # You can pass the results of the ls() function to the rm() function - this will delete everything in your environment
Note

Note how we combined two functions above - this can be a very powerful tool when you are working in R. Anything in the innermost parentheses will be evaluated first when you combine functions like this.

Note that when you are assigning values to arguments by name, you must use the = operator (NOT <-). You will get an error message, or possibly unintended side effects, if you use <-.

Using R packages

In many cases you will want to expand beyond the functions available in base R. One of the ways of doing this, is by adding a package - these extend the R language and can be downloaded via a repository like CRAN.

You can manage packages in the console, or using the “Packages” pane.

installed.packages() # Shows what packages are installed
install.packages("packagename") # Installs a new package, where packagename is the name of the package (must be in quotes)
update.packages() # Updates installed packages
remove.packages("packagename") # Removes a package, where packagename is the name of the package
library(packagename) # Loads the package, packagename, for use - note, no quote marks

We will see in later lessons, how we can use the functions available in different packages, like ggplot2.

Challenges

Slides 6-11

Slide 12 - Lesson 1 key points

Lesson 2. Project Management with R Studio

Slide 13

Housekeeping notes:

Slide 14

If we don’t manage our projects effectively and efficiently, we make our lives harder than they need to be. We don’t want to waste a lot of time searching for the right file - or worse, using the wrong file.

Ethically speaking, we have an obligation to work with data in particular ways, to maintain good scientific practices and ensure the integrity of our data.

Also, as scientists, we very often work collaboratively - so we need to use a project management system that works with our colleagues, and that they can understand and use effectively too. We also usually need to upload our data, and the scripts used to generate our analyses, as supplementary information when we submit manuscripts. Reviewers need to see that you have analysed your data correctly.

Slide 15

Unfortunately, even if we start out with good intentions, sometimes scientific projects look something like this comic. What is wrong with the system shown here?

  • Project structure is unclear: this is a “data” folder inside a “research” folder - but “research” is quite a vague folder name. Is this one project, or are there several projects commingled here?

  • The “data_date_description.dat” filenames have some issues:

    • “data” is a vague start to the file names - it would be better to be more precise about the type of data. A more informative file name would help the user find what they are looking for more quickly.

    • The file names contain value judgements (“huh??”, “WTF”, “crap”) - this is mixing the raw data with the analysis/interpretation of the data, and should be avoided

    • The file names contain nonstandard characters, which is probably best avoided

  • There is a mixture of different kinds of files - this folder is named “data”, but it also contains analysis, a thesis outline, and notes from a meeting. This makes it harder to find the files when you need them.

  • The subfolder named “JUNK…” does not have an informative name

  • The folder is missing any sort of README.txt file describing the data, how they were generated, and any associated metadata

Slide 16

What features or design elements make a project layout good or not so good?

Note

Every project is different, and there are many different sensible ways to organise most projects. It is important that you choose an organisational system that works for you, and for the project.

Slides 17-19

Some tips for best practices

  • Preserve the original data, and make sure that you can easily distinguish between the original data and modified forms of the data.

    • Original data (and sometimes, cleaned data) should be treated as read-only

    • Any data you generate (your output) should be treated as disposable (you should be able to regenerate it from the original/cleaned data using your scripts)

  • Organise your project in a way that makes it easy for you to find things, and to relate the output (e.g., figures or tables) to the original data and the code used to generate them

  • Organise your projects in such a way that they are readable and shareable with other people (including your colleagues, manuscript reviewers, or your future self)

  • Separate the definition of a function from its application - as you put reusable chunks of code into functions, you might store these in separate folders, so the reusable functions can be used across different analyses and projects, and the analysis scripts are stored in their own directory.

  • Follow the Good Enough Practices for Scientific Computing:

    • Put each project in its own directory, which is named after the project.

    • Put text documents associated with the project in the doc directory.

    • Put raw data and metadata in the data directory, and files generated during cleanup and analysis in a results directory.

    • Put source for the project’s scripts and programs in the src directory, and programs brought in from elsewhere or compiled locally in the bin directory.

    • Name all files to reflect their content or function.

  • Include a descriptive README.txt file in each folder, containing a clear description of the folder’s contents (this is useful for your future self, who doesn’t remember what you were doing, or for collaborators - who aren’t psychic and can’t read your mind)

  • Use version control to manage your projects

  • Consider FAIR principles (Findable, Accessible, Interoperable, and Reusable) principles from the beginning when planning your projects

    Slide 18: Live demo

    Create a new project

    Challenge #1: We’re going to create a new project in RStudio:

    1. Click the “File” menu button, then “New Project”.

    2. Click “New Directory”.

    3. Click “New Project”.

    4. Type in the name of the directory to store your project, e.g. “my_project”.

    5. If available, select the checkbox for “Create a git repository.”

    6. Click the “Create Project” button.

      Challenge 2: Opening an RStudio project through the file system

      1. Exit RStudio.

      2. Navigate to the directory where you created a project in Challenge 1.

      3. Double click on the .Rproj file in that directory.

    Save your data into the project

    Download the gapminder data from this link to a csv file.

    1. Download the file (right mouse click on the link above -> “Save link as” / “Save file as”, or click on the link and after the page loads, press Ctrl+S or choose File -> “Save page as”)

    2. Make sure it’s saved under the name gapminder_data.csv

    3. Save the file in the data/ folder within your project.

    We will load and inspect these data later.

    Commit your changes with a short informative commit message.

    Change your working directory

You can check the current working directory with the getwd() command, or by using the menus in RStudio.

  1. In the console, type getwd() (“wd” is short for “working directory”) and hit Enter.

  2. In the Files pane, double click on the data folder to open it (or navigate to any other folder you wish). To get the Files pane back to the current working directory, click “More” and then select “Go To Working Directory”.

You can change the working directory with setwd(), or by using RStudio menus.

  1. In the console, type setwd("data") and hit Enter. Type getwd() and hit Enter to see the new working directory.

  2. In the menus at the top of the RStudio window, click the “Session” menu button, and then select “Set Working Directory” and then “Choose Directory”. Next, in the windows navigator that opens, navigate back to the project directory, and click “Open”. Note that a setwd command will automatically appear in the console.

More information about using version control with R studio: https://swcarpentry.github.io/git-novice/14-supplemental-rstudio.html

Slide 21 - Lesson 2 Key Points

Lesson 3. Seeking help

Slide 22

Housekeeping and Intro

Slide 23 - sources of help

Getting help within R itself - help files and vignettes

#you can use ?function_name or help(function_name) to get help with a specific function
help(write.table)
?write.table()
?write.csv()
#you can do a fuzzy search with ??function_name if you remember part of the function name

Slide 24 - using LLMs (caveat emptor)

LLMs can be very helpful when writing code - or they can be completely wrong and unhelpful. The best way to make them more helpful is to know a fair bit about coding yourself so that you have better judgement about their output.

Slide 25 - Lesson 3 Key Points

Footnotes

  1. A set of values, all of the same data type, in a certain order↩︎