Primer: Environment Setup

Setting up your computational environment for the AY2023-24 project

This page runs through the process of setting up a computing environment similar to that you will need for your AY2023-24 project.

Prerequisites

This primer assumes the following:

  • you are working on your own computer (e.g. a laptop/desktop machine)
  • you are able to start a terminal window (e.g. Terminal on a Mac or Linux machine, or in Windows Subsystem Linux)

You should either be familiar with the command-line or have worked through the Software Carpentries shell lesson:

The primer walks through the following steps:

  1. Downloading the SIPBSCompBiol project template repository
  2. Creating and activating a conda environment for your project
  3. Installing the ncbi-genome-download package and using it to download a set of genomes

1 Set up a project folder from a template

Note

Please read the Noble (2009) and Sandve et al. (2013) papers at the Ten Great Papers page, and the README.md file at the template repository to understand the principles behind the organisation of this template.

1.1 Download and extract the project template

  1. Use your browser to visit the project template repository at https://github.com/sipbs-compbiol/template_bioinformatics_project.
  2. Click on the small triangle at the right of the green Code button to see the download options for the template.

Context menu for the Code download button
  1. Click on Download ZIP to download a .zip format compressed file containing the repository template to your own computer, in the usual way you would download a file using your browser. This will place a file called template_bioinformatics_project-master.zip into that location.

When you uncompress the repository template, it will expand into the full set of folders and subfolders. You can move this folder tree after you have uncompressed the file, or you can move the .zip file to the directory where you want to keep the project files (e.g. under Documents on a Mac), and uncompress it there. The choice is yours.

  1. Uncompress the file (e.g. by double-clicking on a Mac). This will produce a folder called template_bioinformatics_project-master in the same folder as the compressed file.
Make sure you acturally uncompress the file

Some operating systems, such as Windows, allow you to work with compressed files as if they had been uncompressed without actually uncompressing them.

You do need to fully uncompress the repository template to work with it.

  1. Start a terminal and navigate to the top-level folder in the repository template.
Tip

I saved the .zip file in the Documents folder on my computer, and uncompressed that file in the same location. My project template is therefore in the ~/Documents/template_bioinformatics_project-master directory. I can navigate to this location using the cd command, and check the contents using the command ls.

[NOTE: The % symbol is the command prompt, not part of the command itself. This may look different on your computer (e.g. it might be a dollar sign: $).]

% cd ~/Documents/template_bioinformatics_project-master
% ls
LICENSE      README.md    _config.yml  assets/      data/        docs/        notebooks/   results/     scripts/

2 Create and activate a conda environment for your project

Warning

Before beginning this part of the guide, you should make yourself familiar with the conda documentation at the link below:

You will need to have already installed the conda package for your operating system.

On Mac and Linux, this is usually a straightforward software installation, and you can follow the instructions at the link below:

On Windows, you must distinguish between installing Anaconda under Windows, and installing it under WSL (Windows Subsystem Linux). You will be using WSL for your project, and you must install conda within WSL (WSL cannot see the version of conda installed under the Windows operating system; in any case the software that runs on a Linux OS will not typically run on a Windows OS). There are several online guides to help with this.

  1. Create a new conda environment by issuing the command below.
% conda create --name bm954_project -y

As above, the % symbol is the command prompt - not part of the command itself. You do not type this when entering and running your command.

  • conda: you are running the conda program and instructing it to carry out an action
  • create: this is the action you are instructing conda to undertake, to create a new environment
  • --name bm954_project: this option tells conda what name the new environment is to be known by; you can choose any name you like, and do not have to use bm954_project if you would prefer to use something else. However, it is usually a good idea to name the environment such that you can tell immediately what it is used for
  • -y: this answers all the optional questions about installation which conda might ask with “yes
  1. Activate the new conda environment you have just created by issuing the command below (if you called your environment something other than bm954_project, then use that name instead, here).
% conda activate bm954_project

You should notice that the left-most part of your command prompt changes from (base) to (bm954_project) (or whatever name you used for your environment).

You can use the command conda info --envs (demonstrated below) to list all the environments that your installation of conda is aware of:

% conda info --envs
# conda environments:
#
base                     /Users/lpritc/opt/anaconda3
2022_david               /Users/lpritc/opt/anaconda3/envs/2022_david
2022_nora                /Users/lpritc/opt/anaconda3/envs/2022_nora
algo_2022_py39           /Users/lpritc/opt/anaconda3/envs/algo_2022_py39
aoc2021                  /Users/lpritc/opt/anaconda3/envs/aoc2021
aoc2022                  /Users/lpritc/opt/anaconda3/envs/aoc2022
aoc2023                  /Users/lpritc/opt/anaconda3/envs/aoc2023
bm432_py310              /Users/lpritc/opt/anaconda3/envs/bm432_py310
bm954_project         *  /Users/lpritc/opt/anaconda3/envs/bm954_proj
[...]

3 Install ncbi-genome-download and recover some genomes

One of the main advantages of conda is that it handles technical issues of package management (the process of installing and configuring software tools) and makes using them much easier. The typical command we might use would be conda install <PROGRAM NAME> where <PROGRAM NAME> is replaced by the name of the software you want to install.

In particular, conda is a common software distribution route for bioinformatics software, through the conda channel called bioconda.

Important

In order to use bioconda on your computer, you will need to configure the channel.

3.1 Configuring the bioconda channel

Warning

Before installing bioconda you should be awware of the information at the link below:

To set up the bioconda channel on your computer, you should follow the instructions at the bioconda website, and issue the four commands below, in sequence:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

You only need to do this once on your computer, and bioconda will then be available in any environment you create.

3.2 Installing ncbi-genome-download

To install the ncbi-genome-download package, we use the conda install command, as noted above. This will take a short while to download and configure all the dependencies of the tool (packages that need to be installed for the software to run). The process you see in the terminal should resemble that shown below.

 % conda install ncbi-genome-download -y
 Channels:
 - conda-forge
 - bioconda
 - defaults
Platform: osx-64
Collecting package metadata (repodata.json):
[...]
The following NEW packages will be INSTALLED:

  appdirs            conda-forge/noarch::appdirs-1.4.4-pyh9f0ad1d_0
[...]
  xz                 conda-forge/osx-64::xz-5.2.6-h775f41a_0



Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Once this is complete, the ncbi-genome-download program should be available to you. You can test this by asking the software for its help guide, using the command ncbi-genome-download -h as demonstrated below:

% ncbi-genome-download -h
usage: ncbi-genome-download [-h] [-s {refseq,genbank}] [-F FILE_FORMATS] [-l ASSEMBLY_LEVELS] [-g GENERA] [--genus GENERA]
                            [--fuzzy-genus] [-S STRAINS] [-T SPECIES_TAXIDS] [-t TAXIDS] [-A ASSEMBLY_ACCESSIONS]
                            [--fuzzy-accessions] [-R REFSEQ_CATEGORIES] [--refseq-category REFSEQ_CATEGORIES] [-o OUTPUT]
                            [--flat-output] [-H] [-P] [-u URI] [-p N] [-r N] [-m METADATA_TABLE] [-n] [-N] [-v] [-d] [-V]
                            [-M TYPE_MATERIALS]
                            groups
[...]

3.3 Downloading a set of genome sequences

Warning

Before downloading genomes, you should familiarise yourself with the information at the links below:

For your project, you will want to download all complete genome sequences for a single genus of bacteria. This section of the guide will illustrate that process for the genus Dickeya.

The command to download all genomes of a single named genus of bacteria is ncbi-genome-download --genera "<GENUS>" bacteria, where <GENUS> is replaced by the actual name of the genus you want to download.

To restrict downloaded genomes only to those that are complete, we use the option --assembly-levels complete.

To download all available sequence data, including GBFF (annotated) and FASTA (sequence data only) we need to use the option --formats all

If we wanted the program to keep us informed about what it was doing, we would ask it to be verbose by using the option --verbose or -v

Putting this together to download genomes for the genus Dickeya we would use the command as demonstrated below:

% ncbi-genome-download --assembly-levels complete --genera "Dickeya" --formats all --verbose bacteria

This will place the downloaded genomes into individual subdirectories, under the subdirectory refseq. These genome are compressed using the gzip program. We can tell this because the filnames end in .gz.

 % tree refseq
refseq
└── bacteria
    ├── GCF_000147055.1
    │   ├── GCF_000147055.1_ASM14705v1_assembly_report.txt
    │   ├── GCF_000147055.1_ASM14705v1_assembly_stats.txt
    │   ├── GCF_000147055.1_ASM14705v1_cds_from_genomic.fna.gz
    │   ├── GCF_000147055.1_ASM14705v1_feature_table.txt.gz
    │   ├── GCF_000147055.1_ASM14705v1_genomic.fna.gz
    │   ├── GCF_000147055.1_ASM14705v1_genomic.gbff.gz
    │   ├── GCF_000147055.1_ASM14705v1_genomic.gff.gz
    │   ├── GCF_000147055.1_ASM14705v1_protein.faa.gz
    │   ├── GCF_000147055.1_ASM14705v1_protein.gpff.gz
    │   ├── GCF_000147055.1_ASM14705v1_rna_from_genomic.fna.gz
    │   ├── GCF_000147055.1_ASM14705v1_translated_cds.faa.gz
[...]    

The downloaded files are in a number of different formats, and you will not need to use all of them in your project.

4 Summary

This page provided a gude to setting up your project analogously to preparing your bench for laboratory work.

You first used a template to set up your project’s folder structure, which is a little like making sure you have a well-organised work area.

You then installed conda and set up a new environment for your project, which plays a similar role to making your your bench is clean and practising a kind of “aseptic technique” for making sure that installed bioinformatics tools don’t conflict with each other.

You then installed ncbi-genome-download and obtained a set of bacterial genomes directly from NCBI. This is similar to acquiring a set of strains from a collection, and having them ready to work with on your benchtop.

Your next steps using these genomes will be to prepare them so that they can be used in your experiments.