13 Building a Phylogenetic Tree
The next step in the investigation would normally be to identify and align variant regions between each of the genomes in this study. That could take some time - more than we have today - and, luckily, your helpful colleague has already done this for you. They used the ParSNP tool (Kille et al. (2024)) to compare the draft genomes of 33 isolates from several rooms on the burns ward to a common reference (ERR531413_asm) and obtain all the identifiable sequence differences.
ParSNP do?
ParSNP is a tool in the Harvest suite of bioinformatics programs. It was written to very rapidly identify and align the core genomes1 of hundreds to thousands of bacterial sequences.
To work, ParSNP requires a reference genome to align to - here that reference was the assembly of the ERR531413 isolate from this study. It identified core regions shared between all the study isolates and the reference, and examined those core regions for sequence differences (variants) such as SNPs and indels, which it reported as a sequence alignment and a VCF file.
Using Galaxy, genome assembly would take too long for this workshop.
They have provided the results for you to download in two files, below - the VCF file of variants (.vcf file), and a sequence alignment containing all of these variants (.fasta file):
- Download the alignment and VCF file to the local folder containing your workshop files
- Upload both files to
Galaxy
13.1 What isolates are being analysed?
The isolate genomes your colleague has assembled came from the beds and sources shown in Table 15.1.
GenomeIDis the accession number for the read data.SampleIDis the sample number from the study.PatientIDuniquely identifies a patient.Bedindicates the corresponding bed for the patient.Specimenindicates whether the isolate was sampled directly from the patient, from a water sample, or swabbed from an environmental surface.Sourcenames the kind of surface or substance that was sampled.Daysindicates the number of days after the study start at which the isolate was collected.
13.2 Inferring a phylogenetic tree with RaxML
Given a sequence alignment like the one in the study.snps.fasta file above, one can use many different methods to reconstruct a proposed evolutionary history leading to the observed alignment. Such histories are typically trees2, and methods to draw them include clustering approaches such as:
UPGMA(Unweighted Pair Group Method with Arithmetic mean)- Neighbour-Joining
which are applications of general clustering methods to biology using pairwise similarity scores between sequences. These methods have almost completely been superseded by statistical inference methods including:
which adopt a statistical model (sometimes very complex models) of the likelihood of base or amino acid substitution, and attempt to find the tree that “best fits” the data, given that model3.
The tool you will use here is RaxML (Stamatakis (2014)), one of several packages that uses maximum likelihood to infer the “best tree” that fits the data.
RaxML do?
RaxML is given an input sequence set and told a model of evolution to use (here it’s a simple GTRGAMMA substitution model). It initially creates a number of quickly-built approximate trees and runs through a process of optimising these by modifying and editing each tree, and checking to see if it is a “better fit” to the data than the tree it derived from. If it is a better fit, it is retained. This process continues until no improvement is seen, or some other stopping criterion is met.
- Navigate to
RaxMLin theToolssidebar - Set the source file to be the
.fastafile you just uploaded - Check that the
Model typeisNucleotidewith theGTRGAMMAmodel - Click on
Run Tool
RaxML
13.3 Inspect the tree file and visualise the tree in Galaxy
- Click on the
Eyeicon next to theBest-scoring ML Treeoutput
- Does this look like a tree?
- Can you find your assembly and/or the reference listed in the tree?
The file produced by RaxML as the Best-scoring ML tree is in a format known as Newick which uses parentheses and commas to describe tree structure.
The numbers in the output represent the distance from each of the nodes in the tree (some are genomes, some are internal nodes interpreted as historical branching events) to its parent.
- Navigate to
ETE tree viewerin theToolssidebar - In the
Newick Tree to visualisefield, select theBest-scoring ML Treeoutput - Set
Resolve Taxonomic IDstoNo - Click on
Run Tool - When the visualisation finishes, click on the
Eyeicon next to the output
Galaxy
- Can you find your assembly and/or the reference listed in the tree?
- Your isolate
ERR531380was sampled from bed 11. Using Table 15.1 to identify the other bed 11 isolates, find where they are in the tree. What do you notice?
You will notice numbers (usually “1”) at internal nodes of the tree. These represent bootstrap values, a statistical method for estimating confidence in an outcome.
In phylogenetics, this often involves resampling the initial dataset (building a new alignment by randomly taking columns from the original input, and allowing repetition) many times and building a new tree each time. The bootstrap value is then the proportion of times a particular branching event is seen in the full set of all trees (always present = 1, never present except in the original tree = 0).
Here, we did not perform bootstrapping, so the value of “1” is meaningless for understanding the robustness or confidence we should have in our tree.
13.4 Next Steps
You should have seen that the bed 11 isolates all seem to cluster together. It’s difficult to keep track of this just on the basic tree, and it would be useful to be able to annotate the tree with a visual representation of the sample type and room for each isolate. You will do this in Chapter 14 using iTOL.
The core genome is the set of sequences shared by all genomes in a group. It can be thought of as the regions of sequence that characterise what makes members of that group members of that group (like all birds have wings - wings are a core feature of birds).↩︎
A tree is the name we give to a kind of graph (in the mathematical sense) where there are no loops (known as cycles). The biological interpretation of these is typically that bifurcations (i.e. branching points) represent sequence divergence at some point in history, and that edge length/branch length represents the degree of difference between two points on the graph (i.e. how different two sequences are).↩︎
The concept is analogous to regression of a straight line through a set of points, but it’s regressing a tree through a set of sequences.↩︎