13  Building a Phylogenetic Tree

Your colleague has been very helpful again

The next step in the investigation would normally be to identify and align variant regions between each of the genomes in this study. That could take some time - more than we have today - and, luckily, your helpful colleague has already done this for you. They used the ParSNP tool (Kille et al. (2024)) to compare the draft genomes of 33 isolates from several rooms on the burns ward to a common reference (ERR531413_asm) and obtain all the identifiable sequence differences.

ParSNP is a tool in the Harvest suite of bioinformatics programs. It was written to very rapidly identify and align the core genomes1 of hundreds to thousands of bacterial sequences.

To work, ParSNP requires a reference genome to align to - here that reference was the assembly of the ERR531413 isolate from this study. It identified core regions shared between all the study isolates and the reference, and examined those core regions for sequence differences (variants) such as SNPs and indels, which it reported as a sequence alignment and a VCF file.

Using Galaxy, genome assembly would take too long for this workshop.

They have provided the results for you to download in two files, below - the VCF file of variants (.vcf file), and a sequence alignment containing all of these variants (.fasta file):

  1. Download the alignment and VCF file to the local folder containing your workshop files
  2. Upload both files to Galaxy
Video: Upload the sequence variant files

13.1 What isolates are being analysed?

The isolate genomes your colleague has assembled came from the beds and sources shown in Table 15.1.

Note
  • GenomeID is the accession number for the read data.
  • SampleID is the sample number from the study.
  • PatientID uniquely identifies a patient.
  • Bed indicates the corresponding bed for the patient.
  • Specimen indicates whether the isolate was sampled directly from the patient, from a water sample, or swabbed from an environmental surface.
  • Source names the kind of surface or substance that was sampled.
  • Days indicates the number of days after the study start at which the isolate was collected.
Table 13.1: Sequenced isolates

13.2 Inferring a phylogenetic tree with RaxML

Given a sequence alignment like the one in the study.snps.fasta file above, one can use many different methods to reconstruct a proposed evolutionary history leading to the observed alignment. Such histories are typically trees2, and methods to draw them include clustering approaches such as:

which are applications of general clustering methods to biology using pairwise similarity scores between sequences. These methods have almost completely been superseded by statistical inference methods including:

which adopt a statistical model (sometimes very complex models) of the likelihood of base or amino acid substitution, and attempt to find the tree that “best fits” the data, given that model3.

The tool you will use here is RaxML (Stamatakis (2014)), one of several packages that uses maximum likelihood to infer the “best tree” that fits the data.

RaxML is given an input sequence set and told a model of evolution to use (here it’s a simple GTRGAMMA substitution model). It initially creates a number of quickly-built approximate trees and runs through a process of optimising these by modifying and editing each tree, and checking to see if it is a “better fit” to the data than the tree it derived from. If it is a better fit, it is retained. This process continues until no improvement is seen, or some other stopping criterion is met.

  1. Navigate to RaxML in the Tools sidebar
  2. Set the source file to be the .fasta file you just uploaded
  3. Check that the Model type is Nucleotide with the GTRGAMMA model
  4. Click on Run Tool
Video: Inferring a tree using RaxML

13.3 Inspect the tree file and visualise the tree in Galaxy

  1. Click on the Eye icon next to the Best-scoring ML Tree output
QUESTIONS
  1. Does this look like a tree?
  2. Can you find your assembly and/or the reference listed in the tree?

The file produced by RaxML as the Best-scoring ML tree is in a format known as Newick which uses parentheses and commas to describe tree structure.

The numbers in the output represent the distance from each of the nodes in the tree (some are genomes, some are internal nodes interpreted as historical branching events) to its parent.

  1. Navigate to ETE tree viewer in the Tools sidebar
  2. In the Newick Tree to visualise field, select the Best-scoring ML Tree output
  3. Set Resolve Taxonomic IDs to No
  4. Click on Run Tool
  5. When the visualisation finishes, click on the Eye icon next to the output
Video: Visualising the tree in Galaxy
QUESTIONS
  1. Can you find your assembly and/or the reference listed in the tree?
  2. Your isolate ERR531380 was sampled from bed 11. Using Table 15.1 to identify the other bed 11 isolates, find where they are in the tree. What do you notice?

You will notice numbers (usually “1”) at internal nodes of the tree. These represent bootstrap values, a statistical method for estimating confidence in an outcome.

In phylogenetics, this often involves resampling the initial dataset (building a new alignment by randomly taking columns from the original input, and allowing repetition) many times and building a new tree each time. The bootstrap value is then the proportion of times a particular branching event is seen in the full set of all trees (always present = 1, never present except in the original tree = 0).

Here, we did not perform bootstrapping, so the value of “1” is meaningless for understanding the robustness or confidence we should have in our tree.

13.4 Next Steps

You should have seen that the bed 11 isolates all seem to cluster together. It’s difficult to keep track of this just on the basic tree, and it would be useful to be able to annotate the tree with a visual representation of the sample type and room for each isolate. You will do this in Chapter 14 using iTOL.


  1. The core genome is the set of sequences shared by all genomes in a group. It can be thought of as the regions of sequence that characterise what makes members of that group members of that group (like all birds have wings - wings are a core feature of birds).↩︎

  2. A tree is the name we give to a kind of graph (in the mathematical sense) where there are no loops (known as cycles). The biological interpretation of these is typically that bifurcations (i.e. branching points) represent sequence divergence at some point in history, and that edge length/branch length represents the degree of difference between two points on the graph (i.e. how different two sequences are).↩︎

  3. The concept is analogous to regression of a straight line through a set of points, but it’s regressing a tree through a set of sequences.↩︎