BM954 SOPs

SOPs for AY2023-24 BM954 BMS MSc Projects

This page outlines SOPs (Standard Operating Procedures) for the AY2023-24 BM954 BMS MSc Projects. Please use the menu in the sidebar to navigate to specific sections and activities.

In general, the SOPs will provide links to detailed instructions and documentation, and then provide a short walkthrough that can be used as an example from which to begin your own work. This is not a set of instructions for how to carry out your research project.

Important

All activities/procedures on this page assume that you are working on a Linux (including Windows Subsystem Linux) or macOS machine. These instructions have not been tested on a Windows OS.

1 Installing conda and bioconda

First, please see the instructions at the following locations:

1.1 Walkthrough (Linux)

1.1.1 Installing conda:

# Work in your home directory
$ cd ~
# Download the miniconda installer
$ wget https://repo.anaconda.com/miniconda/Miniconda3-py312_24.4.0-0-Linux-x86_64.sh
# Run the miniconda installer
$ bash Miniconda3-py312_24.4.0-0-Linux-x86_64.sh

After executing the last command in the example above, you will be prompted for information and responses. If you are uncertain what to respond with, use the defaults (i.e. press Return)

Tip

When the installation is finished, conda will be installed, but not immediately available.

To make conda available, close your terminal window, then open a new terminal window. You should see that your command prompt changes from something like:

$

to something like

(base) $

That is, you should see that the terminal now recognises you are working in a conda environment called base

1.1.2 Installing bioconda

# Add the conda channels needed for bioconda
(base) $ conda config --add channels defaults
(base) $ conda config --add channels bioconda
(base) $ conda config --add channels conda-forge
(base) $ conda config --set channel_priority strict

You may see messages from the software during this process. Unless there are obvious error messages, there should be nothing to be concerned about.

2 Creating and activating a conda environment for your project

First, please see the instructions at the following location

  • https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html

2.1 Walkthrough

# Create a new conda environment called "MSc_Project"
(base) $ conda create --name MSc_Project -y
(base) $ conda activate MSc_Project
Important

The name MSc_Project is chosen for the example, but you can call your project environment anything you like - but it is usually good practice to name it so that it’s obvious what you are working on.

Note

The -y argument used in the command above answers “yes” to all the yes/no questions conda may ask during the installaion.

Tip

After activating the MSc_Project environment, you should see that your command prompt changes from something like:

(base) $

to something like

(MSc_Project) $

That is, you should see that the terminal now recognises you are working in a conda environment called MSc_Project

Important

You should carry out all the work for your project while in your project environment.

3 Installing cazy_webscraper

cazy_webscraper is a software package that will help you download the complete CAZy online database to your computer, and

  1. extract data from the local database
  2. incorporate data from other online databases into your local database
Important

You should carry out all the work for your project while in your project environment.

3.1 Walkthrough

# cazy_webscraper is available through bioconda, so use conda to install it
(MSc_Project) $ conda install cazy_webscraper bioservices -y
Note

The -y argument used in the command above answers “yes” to all the yes/no questions conda may ask during the installaion.

4 Downloading a CAZy database

Please read the documentation at the following links:

4.1 Walkthrough

# Download the complete CAZy database
(MSc_Project) $ cazy_webscraper use.your.actual@email.address.here
# Download a CAZy database only for the genus Ochrobactrum
(MSc_Project) $ cazy_webscraper --species Ochrobactrum use.your.actual@email.address.here
Important

Be sure to replace the text in the example that reads use.your.actual@email.address.here with your actual email address.

If you want to download CAZy data for a different genus or species, then replace Ochrobactrum with your target taxon name.

This will create a new SQLite3 database in your current directory with a name like cazy_webscraper_2024-06-17_08-43-52.db (but substituting the date and time with the date and time that you ran the program). That’s quite a lot to type in each time you want to use it, so you can use a symbolic link to make an alias to the database with a shorter name, as below.

# Make a short name alias (`cazydb.db`) for the local CAZy database
(MSc_Project) $ ln -s cazy_webscraper_2024-06-17_08-43-52.db cazydb.db
Important

Be sure to replace the filename in the example (cazy_webscraper_2024-06-17_08-43-52.db) with your actual database’s filename.

Tip

You can check that the alias has been created using the ls command to list the contents of the current directory:

(MSc_Project) $ ls -l
total 368
-rw-r--r--  1 lpritc  staff   184K 17 Jun 08:45 cazy_webscraper_2024-06-17_08-43-52.db
lrwxr-xr-x  1 lpritc  staff    38B 17 Jun 08:46 cazydb.db@ -> cazy_webscraper_2024-06-17_08-43-52.db

5 Adding GenBank sequence data to the local CAZy database

cazy_webscraper provides commands that download sequence data from GenBank, for each sequence in the local CAZy database. Please read the documentation at:

5.1 Walkthrough

# Download the corresponding UniProt protein sequence for each protein in the local CAZy database
(MSc_Project) $ cw_get_genbank_seqs cazydb.db use.your.actual@email.address.here
Important

Be sure to replace the text in the example that reads use.your.actual@email.address.here with your actual email address.

Note

The walkthrough is using the name of the alias to the database file, rather than the actual database filename.

You may see several messages relating to Batch contains an accession no longer listed in NCBI or Querying NCBI returns the error - this may affect individual records but does not otherwise indicate a problem with the download.

6 Extracting GenBank sequence data from a local CAZy database

cazy_webscraper provides commands that allow you to convert data from the database to a format useful in downstream bioinformatics programs. Please read the documentation at:

To complete this part of the data extraction for multiple families in a reasonable amount of time you will need to get a list of the CAZy families present in the data you downloaded - this will need you to execute a small amount of SQL code in the databse. You will also need to use a shell script to automate extracting the sequence data for each family.

Note

The walkthrough below will guide you through the small amount of SQL necessary to get a list of families.

A shell script file will be provided to help you use the list to extract the sequence data, but there are no commands that you have not met already, if you worked through the Carpentries shell and SQL lessons.

6.1 Walkthrough

# Extract all GenBank sequences from the local CAZy database
(MSc_Project) $ cw_extract_db_seqs cazydb.db genbank --fasta_file seqs/all_sequences.fasta
# Extract only GH3 sequences from the local CAZy database
(MSc_Project) $ cw_extract_db_seqs cazydb.db genbank --fasta_file GH3/GH3_sequences.fasta --families GH3
Warning

Please be sure to include a directory (e.g. the seqs in seqs/all_sequences.fasta) when using the cw_extract_db_seqs command, or else it may attempt to delete the contents of the current directory.

To obtain the list of CAZy families present in the current database, you need to open your database file, e.g.

# Open the cazydb.db file in sqlite3
(MSc_Project) $ sqlite3 cazydb.db
SQLite version 3.43.2 2023-10-10 13:08:14
Enter ".help" for usage hints.
sqlite>

This will drop you into an SQL prompt for the database interpreter - that is what starts with sqlite> in the example above.

To extract the list of families you need to do two things:

  1. Tell sqlite3 the name of the file to write to
  2. Perform the database query (the result will be written to the file you named)
# Tell sqlite what file to write output to
sqlite> .output cazy_families.txt
# Select all families from the `CazyFamilies` table
sqlite> SELECT family FROM CazyFamilies;
# Quite sqlite3
sqlite> .exit

This will write the list of CAZy family names found in the database to the file cazy_families.txt, which you can confirm by looking at the first few lines.

# Peek at the first few lines of the cazy_families.txt file
(MSc_Project) $ head cazy_families.txt
GH26
GH94
GT1
GT121
GT32

We can use this file with a short script to extract all sequences from the database for each of the CAZy families. The content of this script file is:

FILE="./cazy_families.txt"

while read family
do
  echo "Writing sequences for ${family}"
  cmd="cw_extract_db_seqs cazydb.db genbank --fasta_file ${family}/${family}_sequences.fasta --families ${family}"
  ${cmd}
done < ${FILE}

You can download this script from the link below (right click on the link and choose Save Link As... or similar from the context menu)

Once the script is downloaded, you can run it with the command

# Run the downloaded Bash script to extract CAZy family sequences from the database
(MSc_Project) $ sh extract_families.sh

You should see the output from multiple extraction runs on your screen, as the sequences are extracted.