This page is intended as an example walkthrough of some of the practical steps in a project like the one you’ve been assigned. The details will be different from those of your project, but this page should give you an idea of how I might go about the work if I was doing something similar.
1 What’s my protein?
I was assigned PHI:3077 - Lsr2 from Mycobacterium tuberculosis
1.1 Checking PHI-Base
The reference describing this protein in PHI-Base is Bartek et al. (2014): “Mycobacterium tuberculosis Lsr2 Is a Global Transcriptional Regulator Required for Adaptation to Changing Oxygen Levels and Virulence.” The essential findings in an Lsr2 knockout are:
- Lsr2 is not required for DNA protection (this strain was equally susceptible as the wild type to DNA-damaging agents)
- The lsr2 mutant displayed severe growth defects under normoxic and hyperoxic conditions, but it was not required for growth under low-oxygen conditions.
- Lsr2 was required for adaptation to anaerobiosis. The defect in anaerobic adaptation led to a marked decrease in viability during anaerobiosis, as well as a lag in recovery from it.
- Gene expression profiling of the Δlsr2 mutant under aerobic and anaerobic conditions in conjunction with published DNA binding-site data indicates that Lsr2 is a global transcriptional regulator controlling adaptation to changing oxygen levels.
- The Δlsr2 strain was capable of establishing an early infection in the BALB/c mouse model; however, it was severely defective in persisting in the lungs and caused no discernible lung pathology.
This suggests that Lsr2 binds to DNA and acts within the pathogen to control its response to encountering the host as an environment (i.e. what we describe as pathogenicity). I would be on the lookout for suggestions in the literature, and from what I discover from databases and annotations, for elements of the sequence and structure associated with DNA-binding.
1.2 Checking UniProt
The PHI-Base entry links to this UniProt record: P9WIP7
1.2.1 Functional information
The UniProt record leads to further references supporting protein function (six publications under “Function”)
- DNA-bridging protein that has both architectural and regulatory roles (PubMed:18187505).
- Influences the organization of chromatin and gene expression by binding non-specifically to DNA, with a preference for AT-rich sequences, and bridging distant DNA segments (PubMed:20133735).
- Binds in the minor groove of AT-rich DNA (PubMed:21673140).
- Represses expression of multiple genes involved in a broad range of cellular processes, including major virulence factors or antibiotic-induced genes, such as iniBAC or efpA (PubMed:17590082), and genes important for adaptation of changing O2 levels (PubMed:24895305).
- May also activate expression of some gene (PubMed:24895305).
- May coordinate global gene regulation and virulence (PubMed:20133735).
- Also protects mycobacteria against reactive oxygen intermediates during macrophage infection by acting as a physical barrier to DNA degradation (PubMed:19237572); the physical protection has been questioned (PubMed:24895305).
- A strain overexpressing this protein consumes O2 more slowly than wild-type (PubMed:24895305).
The feature viewer gives me information about regions of the protein:
This indicates that there is mutagenesis evidence supporting structure-function interpretation:
- residues 97-99: Description Loss of DNA-binding, in fragment 66-112. Alternative sequence AGA Evidence Publication: 21673140 (PubMed EuropePMC) Cross-references UniProtKB P9WIP7
- residue 84: Description Loss of activity. Can form dimers but does not bind DNA. Alternative sequence A Evidence Publication: 18187505
- residue 45 Description Loss of activity. Alternative sequence A Evidence Publication: 18187505
- residue 28 Description Loss of activity. Alternative sequence A Evidence Publication: 18187505
There are ten papers linked from UniProt to follow up about function, and Uniprot also provides GO terms for functional annotation.
UniProt has so far provided leads about sequence-structure-function relationships, the key role of this protein in pathogenicity, and even highlighted individual residues with a potential functional role.
1.2.2 Structural information
There is an AlphaFold structure, indicated in the feature viewer, that includes a low-confidence region.
There are PDB records of solved structures: 2KNG, 4E1P, 41R, 6QKP, 6QKQ; 4E1P appears to be highest quality. Solved structures are generally more reliable than AlphaFold.
Evolutionary Trace analysis exists at http://mammoth.bcm.tmc.edu/cgi-bin/report_maker_ls/uniprotTraceServerResults.pl?identifier=P9WIP7 - this maps sequence variation onto structure.
It looks like there will be ample structural information for me to begin inferring structure-function relationships.
1.2.3 Sequence information
UniProt gives the protein sequence as:
>sp|P9WIP7|LSR2_MYCTU Nucleoid-associated protein Lsr2 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) OX=83332 GN=lsr2 PE=1 SV=1
MAKKVTVTLVDDFDGSGAADETVEFGLDGVTYEIDLSTKNATKLRGDLKQWVAAGRRVGG
RRRGRSGSGRGRGAIDREQSAAIREWARRNGHNVSTRGRIPADVIDAYHAAT
1.2.4 Homologues
UniProt provides multiple links to external resources listing homologues; these have varying numbers of homologues, because the tools work in different ways.
UniProt also lists the sequences it knows about that share a minimum level of similarity, at 100%, 90% and 50%, in the Similar Proteins
section of the page (Figure 1). At time of writing, there are 269 sequences sharing at least 50% identity at protein level.
By clicking on the View All
button, or on the View all 269 entries
link, we can obtain a more detailed view of the data (Figure 2). This page also presents a Download
link that lets us download all 270 sequences (the 269 homologues and P9WIP7 itself).
Clicking on the Download
link presents options. We want to download the FASTA (Canonical)
records - these contain the protein sequence information and relevant headers - and we do not want a compressed file (Figure 3).
Clicking the Download
button returns the 269 homologous sequences, and our original P9WIP7 sequence, in the file below.
2 Aligning homologues
Taking the 50% identity sequences as our starting point, we could usually align them using tools in UniProt, but with ≈270 sequences there are too many. So we’ll have to use another approach.
2.1 Align using MAFFT
on Galaxy
We could download a standalone tool, but we’re going to use Galaxy instead (don’t forget to register and log in) We’ll use MAFFT
to align sequences.
- Load the 270 sequence dataset
- Find the
MAFFT
tool (you might as well use theFFT-NS-2
default method) - Click on
Run Tool
This will generate the sequence alignment for you in FASTA format. Download the alignment to your own machine (click on the floppy disk icon).
2.2 Visualise alignment with JalView
To visualise the alignment, we will use a standalone software tool: JalView
. Download and install this on your own machine.
To visualise the alignment:
- Start
JalView
- Click on
File -> Input Alignment -> From File
, then select your downloaded alignment - Click on the maximise button to see the larger alignment
We can see the full length of the sequence and that similar residues are aligned vertically, in Figure 4. We can see any insertions/deletions for specific sequences visually. The extent of conservation, and a consensus sequence, can be seen below the alignment. We can colour the alignment to highlight aspects of the biology/biochemistry:
- Click on the
Colour
menu item - Choose a colour scheme (e.g.
Clustal
to recreate Figure 4)
Scrolling up and down the alignment shows where some sequences are quite different, or may be incomplete. We might want to delete some sequences from the collection.
Whenever we modify our dataset, we need to record which sequences are removed or otherwise altered, and our grounds for removing them (and this should be described in the Methods section of the thesis).
2.2.1 Arranging sequences by similarity
Initially the alignment puts quite different sequences next to each other. We can sort the sequences by similarity.
- Select all the sequences (
Select -> Select All
, orCtrl/Cmd-A
) Calculate -> Calculate Tree, PCA, or PaSiMap
- Select
Neighbour Joining Tree
andBLOSUM 62
- Click
Calculate
When the tree is created, sort the alignment
View -> Sort Alignment By Tree
Now, when you click within the tree, sequences will be coloured by where they are found on the tree
You can save the tree as a Newick file, to visualise in other software (like Dendroscope
or FigTree
):
File -> Save As -> Newick format
The combination of alignment and tree tells me several things immediately:
- That there are several distinct clusters of similar Lsr2 sequences (clades/groups in the tree, visible differences between sequences in the alignment)
- That there are two regions of high sequence conservation, with a region of low sequence conservation separating them (from the alignment - the gaps that run down the middle of the proteins)
- That I need to remove some sequences because they’re relatively low quality (e.g. the sequences with half the protein missing, in the alignment)
3 Generating a phylogenetic tree
Using a sequence alignment to produce a phylogenetic tree can be a complex and detailed topic. There is an entire research field dedicated to the problem of generating an accurate representation of evolutionary history from protein and/or nucleotide sequences. Here we have space to quickly go through basic approaches to producing an initial tree and, if you would like to know more, please check out the links below.
- BM329 Workshop B – UPGMA: https://sipbs-compbiol.github.io/BM329_Block_B_Workshop/upgma.html
- EMBL’s Introduction to Phylogenetics: https://www.ebi.ac.uk/training/online/courses/introduction-to-phylogenetics/what-is-phylogenetics/
- Conor Meehan’s online course: https://conmeehan.github.io/PathogenDataCourse/IntroToPhylogenetics.html
3.1 Generating a basic phylogenetic tree
We have already produced a basic phylogenetic tree using JalView
(Section 2.2.1), as shown in Figure 5. To make this tree, Jalview
used the Neighbor-Joining
algorithm and the BLOSUM62
amino acid substitution matrix - some advantages and disadvantages of the method are noted in this paper.
In general, we would prefer to use a more sophisticated method such as Maximum Likelihood or Bayesian phylogenetic reconstruction, using the underlying nucleotide sequences rather than the protein sequences. We would want to use the nucleotide sequences because they carry more evolutionary information (the same amino acid may be encoded by different codons in two sequences, and the differences between those codons is evolutionary information). We’d use Maximum Likelihood/Bayesian approaches because they give better estimates of the evolutionary history leading to the sequences in the alignment than Neighbour-Joining (which has systematic biases in its approach).
You can use some of these more advanced phylogenetic tools at Galaxy
3.2 Generating a more sophisticated phylogenetic tree
For the example below I have removed some distant/incomplete sequences, and a short leader sequence that isn’t present in most proteins, to make the alignment truncated_sequences.fa
.
To generate a Maximum Likelihood phylogenetic tree from the truncated_sequences.fa
file, we can use the RAxML
tool in Galaxy
(Figure 6).
- Select the
RAxML
tool in the tools sidebar - Choose the
Amino Acid
model type - Choose the
PROTCAT
substitution model and theBLOSUM62
matrix - Click on
Run Tool
The tree will take a short while to be generated, and will produce five output files (Figure 7). The tree information can be found in the Best-scoring ML tree
output.
4 Visualising a phylogenetic tree
A simple way to visualise phylogenetic tree output on Galaxy
is to use the Newick Display
tool (Figure 8):
- Find and click on the
Newick Display
tool in Galaxy - Make sure the tree output file you want is selected in the
Newick file
field - Click on
Run Tool
- When the tool finishes, click on the
eye
icon, and thefloppy/download
icon to download the tree rendering.
This generates a bitmap of the input tree (Figure 9). It is a quick way to see the overall tree structure, but is very limited in terms of customisability and interaction. The image is usually too large to view all at once, and needs to be downloaded. Even then it may not be very aesthetically pleasing.
Many alternative tree visualisation tools are available, including:
- iToL: https://itol.embl.de/
- Dendroscope: https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/dendroscope/
- FigTree: http://tree.bio.ed.ac.uk/software/Figtree/
4.1 Visualising a tree with FigTree
The FigTree
tool is a useful, cross-platform phylogenetic tree visualisation package that can generate very useful figures at publication quality. It is fairly straightforward to use, and good for experimenting with tree layouts and options.
- Download
FigTree
from http://tree.bio.ed.ac.uk/software/Figtree/ and install it on your machine - Start
FigTree
and open theRAxML
best-scoring tree, downloaded fromGalaxy
(the file extension will be.nhx
)
You should be presented with a screen that resembles Figure 10, showing a similar, but reshaped, view to that in Figure 9. A quick visual comparison should be enough to convince you that, as a scientist, you can make many choices about visualisation that can help the reader understnd your tree, and that can help you make your point better, in your manuscript.
4.1.1 Rooted or unrooted?
Real trees have roots, but phylogenetic trees may or may not have roots - it depends on the method used to construct them.
In a phylogenetic tree, the root is the most ancient common ancestor of all sequences/species/taxa in the tree.
UPGMA and Neighbour-Joining trees have roots because of the way the tree construction algorithm works. This may or may not be a “true” root, in the sense that it represents the true last common ancestor of all sequences in the tree.
Maximum Likelihood and Bayesian trees do not have roots enforced on them by their algorithms, but if you choose sequences so that you know something about the evolutionary history of your organisms, you can accurately identify the true root of a tree.
The tree I generated with RAxML
is a Maximum Likelihood tree, so I’m going to visualise it as an unrooted tree, first, in Figure 11.
Looking at Figure 11, I can see that the top left pert of the tree has very long branches (the lines between nodes of the tree that either represent divergence events or the sequences themselves). This implies that the sequences at the ends of these branches have undergone much more evolutionary change (i.e. underwent more substitutions) than the other sequences. This has two broad explanations:
- This reflects the true evolutionary history of these sequences. They may have retained more substitutions because they were under extreme selection pressure (not unusual for pathogen effectors).
- This is an artefact (i.e. not a real biological result). We may have included sequences in our analysis that do not belong - they appear to have accepted many substitutions because they are too different to the rest of the sequences to really belong in the same category.
We cannot decide between these two options by looking at the tree alone. We need to do more work analysing the sequences and the tree.
4.1.2 Midpoint Rooting
Midpoint rooting is a way of deciding upon a root point for an unrooted tree, without any reference to the true biology. It works by finding the longest path between any two taxa (i.e. end points on the tree) and setting the root at that position. Using the Midpoint Root
option on visually unrooted tree doesn’t change it’s appearance drastically, as you can see from Figure 12.
In FigTree
the Midpoint Root
option is in the Tree
menu.
But, if we switch the display style to a phylogram as in Figure 13, we can see that the tree appears “balanced”, with a clear outgroup at the bottom end of the tree. We can see three clear sets of sequences that apper to be distanced from each other in the tree.
4.1.3 Interpreting the tree
As noted above, the very long branches separating our sequences into three groups could represent true evolutionary history, or be an artefact from analysing sequences that don’t really belong together.
Generally, I would expect a set of related sequences in a phylogenetic tree to have about the same “distance” from the root to each tip. The presence of these three distinct groups could be a red flag that I’ve done something wrong - or it could be very interesting!
The first thing I want to know is what sequences are in each of these three groups. I first ladderise (Tree
-> I ncreasing node order
) the tree, and turn Tip Labels
back on. Then I use the Expansion
slider to stretch the tree vertically and make space between the tip labels, as in Figure 14.
By using the Zoom
slider and increasing the font size of the Tip Labels
menu I can see the sequence names more clearly in Figure 15.
By inspecting this zoomed-in view, I can see that the main group of sequences is annotated with codes that end in the letters NOCA
, NOCBR
or similar; most of the sequences in the dataset have codes like MYCO
, MYCRU
and so on. This tells me that:
- all of the main set of sequences derive from Mycobacterium
- the larger group of outlying sequences derive from Nocardia
The single other outlying sequence appears to come from Mycobacterium, so we should look at the sequence alignment to see whether there is an issue with the alignment (Figure 16).
From this alignment, we can see that the Nocardia sequences are broadly similar to each other, and do align with the Mycobacterium sequences, even though there appears to be an inserted region (we can look at the location of this on the structure, later), and quite a bit of sequence difference. This kind of pattern is consistent with there being a common ancestor of this sequence in Nocardia and Mycobacterium, and them being separated by speciation, then evolving separately for a long period.
The Mycobacterium sequence that looks out of place is tr|A0A1Z4F4V8|A0A1Z4F4V8_9MYCO/27-140
- the sequence looks similar enough to the others to appear to be validly aligned, but it is quite different to the other Mycobacterium proteins. It it not clear at this point why these differences should be present, so we continue our analysis keeping this sequence in the dataset.
4.2 General conclusions from the tree
At this point we can draw some tentative conclusions from the tree.
- There was probably a related sequence present in an ancestor of Mycobacterium and Nocardia, and the homologues we find in the database have diverged after speciation.
- It is not clear whether these proteins have the same function(s) in these two genera.
- There is a questionable Mycobacterium homologue, but we don’t have sufficient evidence to exclude it from the analysis as an artefact and can proceed assuming there is a biological relationship.
- There are, within the Mycobacteria, several groups of related sequences (we call these clades) which might represent any or all of:
- speciation (i.e. the tree reflects organism history and evolution)
- functional divergence (i.e. each clade represents a different biological interaction, activity, or function)
- but we can’t tell from the tree alone what the (most) likely explanations are - we need to do more work
- The
MYCO
coding does not give us species information and it’s not possible to see whether the groupings of sequences correspond to speciation, or paralogy within species. We would need to do some bioinformatics work to label these sequences with their corresponding taxa.