4 Making a Phylogenetic Tree – BM211 Workshop 5: Phylogenetic Trees

4.1 Introduction

In this section, you will start with a set of bacterial protein sequences and use these to construct and visualise a phylogenetic tree, using online bioinformatics software tools. To do this you will move through the following stages:

Acquire sequence data
Align sequences
Construct a tree from the alignment
Visualise the tree

Note

Although the principles and the order of actions taken here are the same as we would use at a research level, most research-level phylogenetic reconstruction you will see is likely to use a different set of tools than those employed here.

4.1.1 Scenario

You are a biomedical researcher interested in evolution of the pathogen Burkholderia pseudomallei, which causes melioidosis in humans (Chewapreecha et al. (2017)).

Figure 4.1: Large, septated hepatic abscess due to hepatic melioidosis (honeycomb structure in the liver). Case courtesy of Ian Bickle, Radiopaedia.org. From the case rID: 36526.

Depending on whether you live in a medically well-resourced, or poorly-resourced, area the risk of death from infection varies from 10%-40%. The pathogen itself has the capability to infect a range of human cells and evade the immune system, in part because it produces proteins called effectors: these are proteins adapted to interfere with host biology and make the environment more favourable for the bacterium.

The protein you are studying, BipC, is such an effector protein. Specifically, it is a translocator protein that makes holes in the host’s cells, allowing its own effector proteins to enter and manipulate host biochemistry, and also binds to actin microfilaments, disrupting host cell structure.

You would like to understand the evolution of this protein, to see whether the knowledge can help you design drug compounds that interfere with the action of BipC as a treatment for melioidosis. To do this, you will produce a phylogenetic tree for the protein.

4.2 Obtain sequences from UniProt

The UniProt database is the most comprehensive protein sequence database, combining protein sequence, structure, and annotation data. UniProt has an entry for a relative of the protein you are interested in, which has accession Q62B08.

Task

Go to the UniProt database link and find the record associated with accession number Q62B08.

I need a hint!

On the main UniProt landing page, enter the accession of the sequence you are searching for in the field under Find your protein and click Search

Figure 4.2: The main UniProt landing page

Check your result

You should find yourself at the page for Q62B08, BIPC_BURMA.

Figure 4.3: The UniProt `Q62B08` page header.

4.2.1 Exploring the UniProt record

The UniProt database can provide a great deal of information about the proteins it contains. This can be a great help in understanding more about your protein, and finding useful links to other sources of information. To help you get greater knowledge about BipC, please answer the formative questions below in the MyPlace quiz

Question 1

What organism does this sequence come from?

I need a hint!

Look at the header for the record page.

Question 2

How many amino acids does this protein contain (i.e. how long is the sequence)?

I need a hint!

Look at the header for the record page, or click on Sequence in the sidebar.

Question 3

According to the Gene Ontology (GO), what is the subcellular location of this protein?

I need a hint!

Click on Subcellular Location in the sidebar, and then look at the GO Annotation tab.

Question 4

How many identical (100% identity) sequences are there in the UniProt database?

I need a hint!

Click on Similar Proteins and look at the table.

Question 5

What are the accessions of the identical sequences in Burkholderia pseudomallei

I need a hint!

Can you find this organism in the Similar Proteins table?

4.2.2 Downloading sequence homologues

We only construct phylogenetic trees from protein sequences that are related to each other - it wouldn’t make much sense to try to infer relationships between unrelated sequences!. When we say that two sequences are related we mean that they share a recognisable common ancestor.

Note

Two sequences that share a common ancestor are said to be homologous.

On homology

Homology - the state of two sequences sharing a common ancestor - is an all-or-nothing state. Two sequences are either homologous to each other, or they are not. It is incorrect to talk about gradation or percentage of homology (though you will encounter people doing so).

UniProt provides a precomputed cluster of all the proteins it knows about that are potentially homologous to any other protein sequence in its database, and you can find this by clicking on the Similar Proteins link in the sidebar of the record page. This will take you to a table of similar protein sequences (Figure 4.4).

Figure 4.4: The `Similar Proteins` table for `Q62B08`

This table has three tabs: 100% identity, 90% identity, and 50% identity. Clicking on these will show you the clusters of sequences in UniProt sharing at least that level of sequence identity with the current record. In general, the number of sequences in the cluster increases as the percentage identity used to gather the cluster goes down.

Task

View all sequences in the 50% identity cluster for Q62B08.

I need a hint!

On the UniProt record page for Q62B08 click on the Similar Proteins link in the sidebar to get the similar proteins table.
Click on the 50% identity tab to see the first few proteins in the cluster.
Click on View All to see the full list of sequences in the cluster.

Check your result

You should find yourself at the UniRef cluster page with 50% identity to Q62B08.

Figure 4.5: The UniProt `Q62B08` 50% sequence identity cluster page header.

This set of sequences are all closely-enough related to each other for drawing a phylogenetic tree to be worthwhile. Several different species are represented, which should make it an informative tree in regards to the evolution of this protein effector. You will use this set of sequences to construct your tree.

Task

Download all sequences (not compressed) in the 50% identity cluster for Q62B08 to your computer.

I need a hint!

Click on the Download link at the top of the table. This will bring up a set of options (Figure 4.6).
Make sure Download all and Compressed - No are selected.
Click on Download and save the file somewhere you can find it again (e.g. on your Desktop).

Figure 4.6: The download options for UniProt sequence clusters.

Check your result

Your sequence file should be the same as that which you can download from this link.

Please answer the formative questions below in the MyPlace quiz

Question 6

How many sequences are there in this cluster?

I need a hint!

Look at the header for the cluster page.

Question 7

How long is the shortest sequence in this cluster?

I need a hint!

Look in the Length column of the cluster table.

Question 8

How many species are represented in the dataset?

I need a hint!

Look in the Organism column of the cluster table.

4.3 Make a multiple sequence alignment

The sequences in the cluster you have downloaded from UniProt are different lengths, and are not identical to each other.

If we were to compare animal body parts to understand their evolution, we would naturally try to compare only equivalent elements, such as forelimbs, with each other. It wouldn’t be reasonable to compare organ structure with skull shape, say.In the same way, for biological sequence data, we want to compare genetic changes at equivalent sites, which means we have to find which sites are equivalent before we build the tree.

We find equivalent sites by aligning sequences against each other in a Multiple Sequence Alignment (MSA). Detailed accounts of the methods involved in multiple sequence alignment are beyond the scope of the course, but they can be similar in some cases to the pairwise sequence alignment process that you have already seen in BM214.

You are going to employ a widely-used research tool called MUSCLE to make a multiple sequence alignment from the sequences in the cluster you downloaded. This program is availble to install on your own computer, but can also be used through an online service (Figure 4.7).

MUSCLE online service: https://www.ebi.ac.uk/jdispatcher/msa/muscle

Figure 4.7: `MUSCLE` online service landing page.

Task

Use the online MUSCLE service to generate a multiple sequence alignment from your cluster sequences.

I need a hint!

On the MUSCLE service landing page, click on Browse… and use the dialogue box that appears to navigate to your downloaded cluster sequences.
Scroll down the page and click Submit

Check your result

Your alignment should look similar to the one in Figure 4.8.

Figure 4.8: `MUSCLE` multiple sequence alignment.

Note

The output from the online tool can be very helpful. You can use your mouse to scroll around the alignment visualisation to see which regions are similar, and which differ. There are also options for colouring the residues in the alignment by a range of schemes, to highlight potentially conserved biochemical properties of the residues (Figure 4.9).

Figure 4.9: `MUSCLE` multiple sequence alignment highlighting polar residues.

Task

Download the CLUSTAL format MUSCLE sequence alignment file to your computer.

I need a hint!

On the sequence alignment page, click on the Result Files tab.
Find the Alignment in CLUSTAL format line in the table, click on the `Download button next to it, and use the dialogue box to save the file somewhere you can find it.

Check your result

Your alignment file should be the same as the one you can download from this link

4.4 Construct a phylogenetic tree

Now you have a sequence alignment, you are ready to build a phylogenetic tree.

Warning

For this workshop we are using a relatively simple tool that will give us a phylogenetic tree quickly. The approach we are using here is known to have systematic issues and would not be used in a modern research setting. However, it is a method that was very widely-used until the early 2000s and is still sometimes seen in current research.

You are going to use a tool called Simple Phylogeny to make a phylogenetic tree from your multiple sequence alignment (Figure 4.10). This tool implements tree generation methods from the ClustalW software tool, and is restricted to two relatively basic clustering approaches:

UPGMA (Unweighted Pair Group Method with Arithmetic Mean) link
Neighbour Joining link

We will use this program via its online service (Figure 4.10)

Simple Phylogeny online service: https://www.ebi.ac.uk/jdispatcher/phylogeny/simple_phylogeny)

Figure 4.10: `Simple Phylogeny` online service landing page.

Task

Use the online Simple Phylogeny service to generate a Neighbour-Joining phylogenetic tree from your alignment.

I need a hint!

Open the sequence alignment file you downloaded in a text editor (Use Notepad or similar - DO NOT USE MICROSOFT WORD).
Copy the data from the file and paste it into the Input Sequence field.
You can leave all the options as their default settings (Figure 4.11).
Scroll down the page and click Submit

Figure 4.11: Default settings for `Simple Phylogeny` tree generation.

Check your result

The output should look similar to the tree data in Figure 4.12. Note that this file is in .newick format, but puts each parenthesis (() on a new line.

Figure 4.12: `Simple Phylogeny` tree output.

Task

Download the .newick format tree file to your computer.

I need a hint!

On the Simple Phylogeny output page, click on the Result Files tab.
Find the Phylogenetic Tree line in the table, click on the `Download button next to it, and use the dialogue box to save the file somewhere you can find it.

Check your result

Your tree file should be the same as that which you can download from this link

4.5 Visualise the tree

You will now use iTol to visualise the tree you produced. Use what you have learned from Chapter 2 to view the tree.

Task

Use iToL to visualise the .newick tree file you generated with Simple Phylogeny.

I need a hint!

Click on the Upload button in iTol
Click on the Browse… button and use the dialogue box to navigate to the location of the tree file you downloaded.
Click the Upload button

Check your result

The tree file shown in iTol should resemble Figure 4.13.

Figure 4.13: The tree generated for BipC by `Simple Phylogeny`, visualised in `iTol`

4.6 Summary

Well Done!

After successfully working through this section you should be able to:

use UniProt to obtain annotation and functional information about a protein sequence
use UniProt to identify and download homologues of a protein sequence
generate a protein multiple sequence alignment using MUSCLE
produce a Neighbour-Joining phylogenetic tree using Simple Phylogeny
visualise a phylogenetic tree using iToL

Important

Please complete the workshop by answering the questions below in the formative quiz on MyPlace

MyPlace formative quiz