Parallel Project – BM432 Project Pages

This page is intended as an example walkthrough of some of the practical steps in a project like the one you’ve been assigned. The details will be different from those of your project, but this page should give you an idea of how I might go about the work if I was doing something similar.

1 What’s my protein?

I was assigned PHI:3077 - Lsr2 from Mycobacterium tuberculosis

1.1 Checking PHI-Base

The reference describing this protein in PHI-Base is Bartek et al. (2014): “Mycobacterium tuberculosis Lsr2 Is a Global Transcriptional Regulator Required for Adaptation to Changing Oxygen Levels and Virulence.” The essential findings in an Lsr2 knockout are:

Lsr2 is not required for DNA protection (this strain was equally susceptible as the wild type to DNA-damaging agents)
The lsr2 mutant displayed severe growth defects under normoxic and hyperoxic conditions, but it was not required for growth under low-oxygen conditions.
Lsr2 was required for adaptation to anaerobiosis. The defect in anaerobic adaptation led to a marked decrease in viability during anaerobiosis, as well as a lag in recovery from it.
Gene expression profiling of the Δlsr2 mutant under aerobic and anaerobic conditions in conjunction with published DNA binding-site data indicates that Lsr2 is a global transcriptional regulator controlling adaptation to changing oxygen levels.
The Δlsr2 strain was capable of establishing an early infection in the BALB/c mouse model; however, it was severely defective in persisting in the lungs and caused no discernible lung pathology.

This suggests that Lsr2 binds to DNA and acts within the pathogen to control its response to encountering the host as an environment (i.e. what we describe as pathogenicity). I would be on the lookout for suggestions in the literature, and from what I discover from databases and annotations, for elements of the sequence and structure associated with DNA-binding.

1.2 Checking UniProt

The PHI-Base entry links to this UniProt record: P9WIP7

1.2.1 Functional information

The UniProt record leads to further references supporting protein function (six publications under “Function”)

DNA-bridging protein that has both architectural and regulatory roles (PubMed:18187505).
Influences the organization of chromatin and gene expression by binding non-specifically to DNA, with a preference for AT-rich sequences, and bridging distant DNA segments (PubMed:20133735).
Binds in the minor groove of AT-rich DNA (PubMed:21673140).
Represses expression of multiple genes involved in a broad range of cellular processes, including major virulence factors or antibiotic-induced genes, such as iniBAC or efpA (PubMed:17590082), and genes important for adaptation of changing O2 levels (PubMed:24895305).
May also activate expression of some gene (PubMed:24895305).
May coordinate global gene regulation and virulence (PubMed:20133735).
Also protects mycobacteria against reactive oxygen intermediates during macrophage infection by acting as a physical barrier to DNA degradation (PubMed:19237572); the physical protection has been questioned (PubMed:24895305).
A strain overexpressing this protein consumes O2 more slowly than wild-type (PubMed:24895305).

The feature viewer gives me information about regions of the protein:

This indicates that there is mutagenesis evidence supporting structure-function interpretation:

residues 97-99: Description Loss of DNA-binding, in fragment 66-112. Alternative sequence AGA Evidence Publication: 21673140 (PubMed EuropePMC) Cross-references UniProtKB P9WIP7
residue 84: Description Loss of activity. Can form dimers but does not bind DNA. Alternative sequence A Evidence Publication: 18187505
residue 45 Description Loss of activity. Alternative sequence A Evidence Publication: 18187505
residue 28 Description Loss of activity. Alternative sequence A Evidence Publication: 18187505

There are ten papers linked from UniProt to follow up about function, and Uniprot also provides GO terms for functional annotation.

UniProt has so far provided leads about sequence-structure-function relationships, the key role of this protein in pathogenicity, and even highlighted individual residues with a potential functional role.

1.2.2 Structural information

There is an AlphaFold structure, indicated in the feature viewer, that includes a low-confidence region.

There are PDB records of solved structures: 2KNG, 4E1P, 41R, 6QKP, 6QKQ; 4E1P appears to be highest quality. Solved structures are generally more reliable than AlphaFold.

Evolutionary Trace analysis exists at http://mammoth.bcm.tmc.edu/cgi-bin/report_maker_ls/uniprotTraceServerResults.pl?identifier=P9WIP7 - this maps sequence variation onto structure.

It looks like there will be ample structural information for me to begin inferring structure-function relationships.

1.2.3 Sequence information

UniProt gives the protein sequence as:

>sp|P9WIP7|LSR2_MYCTU Nucleoid-associated protein Lsr2 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) OX=83332 GN=lsr2 PE=1 SV=1
MAKKVTVTLVDDFDGSGAADETVEFGLDGVTYEIDLSTKNATKLRGDLKQWVAAGRRVGG
RRRGRSGSGRGRGAIDREQSAAIREWARRNGHNVSTRGRIPADVIDAYHAAT

1.2.4 Homologues

UniProt provides multiple links to external resources listing homologues; these have varying numbers of homologues, because the tools work in different ways.

UniProt also lists the sequences it knows about that share a minimum level of similarity, at 100%, 90% and 50%, in the Similar Proteins section of the page (Figure 1). At time of writing, there are 269 sequences sharing at least 50% identity at protein level.

Figure 1: The `Similar Proteins` section of the P9WIP7 record

By clicking on the View All button, or on the View all 269 entries link, we can obtain a more detailed view of the data (Figure 2). This page also presents a Download link that lets us download all 270 sequences (the 269 homologues and P9WIP7 itself).

Figure 2: A more detailed view of the 269 homologues of P9WIP7

Clicking on the Download link presents options. We want to download the FASTA (Canonical) records - these contain the protein sequence information and relevant headers - and we do not want a compressed file (Figure 3).

Figure 3: UniProt sequence download options: choose `FASTA (Canonical)` and do not request compressed format

Clicking the Download button returns the 269 homologous sequences, and our original P9WIP7 sequence, in the file below.

269 homologues of Lsr2 (FASTA)

2 Aligning homologues

Taking the 50% identity sequences as our starting point, we could usually align them using tools in UniProt, but with ≈270 sequences there are too many. So we’ll have to use another approach.

2.1 Align using `MAFFT` on Galaxy

We could download a standalone tool, but we’re going to use Galaxy instead (don’t forget to register and log in) We’ll use MAFFT to align sequences.

Load the 270 sequence dataset
Find the MAFFT tool (you might as well use the FFT-NS-2 default method)
Click on Run Tool

This will generate the sequence alignment for you in FASTA format. Download the alignment to your own machine (click on the floppy disk icon).

2.2 Visualise alignment with `JalView`

To visualise the alignment, we will use a standalone software tool: JalView. Download and install this on your own machine.

JalView home page

To visualise the alignment:

Start JalView
Click on File -> Input Alignment -> From File, then select your downloaded alignment
Click on the maximise button to see the larger alignment

We can see the full length of the sequence and that similar residues are aligned vertically, in Figure 4 (a). We can see any insertions/deletions for specific sequences visually. The extent of conservation, and a consensus sequence, can be seen below the alignment. We can colour the alignment to highlight aspects of the biology/biochemistry:

Click on the Colour menu item
Choose a colour scheme (e.g. Clustal to recreate Figure 4 (a))

Scrolling up and down the alignment shows where some sequences are quite different, or may be incomplete. We might want to delete some sequences from the collection.

Important

Whenever we modify our dataset, we need to record which sequences are removed or otherwise altered, and our grounds for removing them (and this should be described in the Methods section of the thesis).

2.2.1 Arranging sequences by similarity

Initially the alignment puts quite different sequences next to each other. We can sort the sequences by similarity.

Select all the sequences (Select -> Select All, or Ctrl/Cmd-A)
Calculate -> Calculate Tree, PCA, or PaSiMap
Select Neighbour Joining Tree and BLOSUM 62
Click Calculate

When the tree is created, sort the alignment

View -> Sort Alignment By Tree

Now, when you click within the tree, sequences will be coloured by where they are found on the tree

You can save the tree as a Newick file, to visualise in other software (like Dendroscope or FigTree):

File -> Save As -> Newick format

JalView Neighbour-Joining Tree file (NEWICK)

The combination of alignment and tree tells me several things immediately:

That there are several distinct clusters of similar Lsr2 sequences (clades/groups in the tree, visible differences between sequences in the alignment)
That there are two regions of high sequence conservation, with a region of low sequence conservation separating them (from the alignment - the gaps that run down the middle of the proteins)
That I need to remove some sequences because they’re relatively low quality (e.g. the sequences with half the protein missing, in the alignment)

3 Generating a phylogenetic tree

Using a sequence alignment to produce a phylogenetic tree can be a complex and detailed topic. There is an entire research field dedicated to the problem of generating an accurate representation of evolutionary history from protein and/or nucleotide sequences. Here we have space to quickly go through basic approaches to producing an initial tree and, if you would like to know more, please check out the links below.

BM329 Workshop B – UPGMA: https://sipbs-compbiol.github.io/BM329_Block_B_Workshop/upgma.html
EMBL’s Introduction to Phylogenetics: https://www.ebi.ac.uk/training/online/courses/introduction-to-phylogenetics/what-is-phylogenetics/
Conor Meehan’s online course: https://conmeehan.github.io/PathogenDataCourse/IntroToPhylogenetics.html

3.1 Generating a basic phylogenetic tree

We have already produced a basic phylogenetic tree using JalView (Section 2.2.1), as shown in Figure 4 (b). To make this tree, Jalview used the Neighbor-Joining algorithm and the BLOSUM62 amino acid substitution matrix - some advantages and disadvantages of the method are noted in this paper.

In general, we would prefer to use a more sophisticated method such as Maximum Likelihood or Bayesian phylogenetic reconstruction, using the underlying nucleotide sequences rather than the protein sequences. We would want to use the nucleotide sequences because they carry more evolutionary information (the same amino acid may be encoded by different codons in two sequences, and the differences between those codons is evolutionary information). We’d use Maximum Likelihood/Bayesian approaches because they give better estimates of the evolutionary history leading to the sequences in the alignment than Neighbour-Joining (which has systematic biases in its approach).

Tip

You can use some of these more advanced phylogenetic tools at Galaxy

3.2 Generating a more sophisticated phylogenetic tree

Note

For the example below I have removed some distant/incomplete sequences, and a short leader sequence that isn’t present in most proteins, to make the alignment truncated_sequences.fa.

To generate a Maximum Likelihood phylogenetic tree from the truncated_sequences.fa file, we can use the RAxML tool in Galaxy (Figure 5).

Select the RAxML tool in the tools sidebar
Choose the Amino Acid model type
Choose the PROTCAT substitution model and the BLOSUM62 matrix
Click on Run Tool

The tree will take a short while to be generated, and will produce five output files (Figure 6). The tree information can be found in the Best-scoring ML tree output.

Figure 5: Settings for generating a protein phylogenetic tree with `RAxML` in Galaxy`

4 Visualising a phylogenetic tree

A simple way to visualise phylogenetic tree output on Galaxy is to use the Newick Display tool (Figure 7):

Find and click on the Newick Display tool in Galaxy
Make sure the tree output file you want is selected in the Newick file field
Click on Run Tool
When the tool finishes, click on the eye icon, and the floppy/download icon to download the tree rendering.

This generates a bitmap of the input tree (Figure 8). It is a quick way to see the overall tree structure, but is very limited in terms of customisability and interaction. The image is usually too large to view all at once, and needs to be downloaded. Even then it may not be very aesthetically pleasing.

Figure 7: Options for the `Newick Display` tool

Figure 8: `Newick Display` rendered output

Many alternative tree visualisation tools are available, including:

4.1 Visualising a tree with `FigTree`

The FigTree tool is a useful, cross-platform phylogenetic tree visualisation package that can generate very useful figures at publication quality. It is fairly straightforward to use, and good for experimenting with tree layouts and options.

Download FigTree from http://tree.bio.ed.ac.uk/software/Figtree/ and install it on your machine
Start FigTree and open the RAxML best-scoring tree, downloaded from Galaxy (the file extension will be .nhx)

You should be presented with a screen that resembles Figure 9, showing a similar, but reshaped, view to that in Figure 8. A quick visual comparison should be enough to convince you that, as a scientist, you can make many choices about visualisation that can help the reader understnd your tree, and that can help you make your point better, in your manuscript.

Figure 9: Initial `FigTree` view of the `RAxML` tree

4.1.1 Rooted or unrooted?

Real trees have roots, but phylogenetic trees may or may not have roots - it depends on the method used to construct them.

Tip

In a phylogenetic tree, the root is the most ancient common ancestor of all sequences/species/taxa in the tree.

UPGMA and Neighbour-Joining trees have roots because of the way the tree construction algorithm works. This may or may not be a “true” root, in the sense that it represents the true last common ancestor of all sequences in the tree.

Maximum Likelihood and Bayesian trees do not have roots enforced on them by their algorithms, but if you choose sequences so that you know something about the evolutionary history of your organisms, you can accurately identify the true root of a tree.

The tree I generated with RAxML is a Maximum Likelihood tree, so I’m going to visualise it as an unrooted tree, first, in Figure 10.

Figure 10: Unrooted maximum likelihood tree. Note that the `FigTree` `Tip Labels` setting is unchecked so that the tree is prominent.

Looking at Figure 10, I can see that the top left pert of the tree has very long branches (the lines between nodes of the tree that either represent divergence events or the sequences themselves). This implies that the sequences at the ends of these branches have undergone much more evolutionary change (i.e. underwent more substitutions) than the other sequences. This has two broad explanations:

This reflects the true evolutionary history of these sequences. They may have retained more substitutions because they were under extreme selection pressure (not unusual for pathogen effectors).
This is an artefact (i.e. not a real biological result). We may have included sequences in our analysis that do not belong - they appear to have accepted many substitutions because they are too different to the rest of the sequences to really belong in the same category.

Warning

We cannot decide between these two options by looking at the tree alone. We need to do more work analysing the sequences and the tree.

4.1.2 Midpoint Rooting

Midpoint rooting is a way of deciding upon a root point for an unrooted tree, without any reference to the true biology. It works by finding the longest path between any two taxa (i.e. end points on the tree) and setting the root at that position. Using the Midpoint Root option on visually unrooted tree doesn’t change it’s appearance drastically, as you can see from Figure 11.

Tip

In FigTree the Midpoint Root option is in the Tree menu.

Figure 11: Unrooted maximum likelihood tree, with midpoint rooting. Due to the visual presentation of this tree, the midpoint rooting appears to have no effect.

But, if we switch the display style to a phylogram as in Figure 12, we can see that the tree appears “balanced”, with a clear outgroup at the bottom end of the tree. We can see three clear sets of sequences that apper to be distanced from each other in the tree.

Figure 12: Unrooted maximum likelihood tree, with midpoint rooting. Due to the visual presentation of this tree as a phylogram, the midpoint rooting “balances” the tree, and we can see three groups of sequences that are relatively dissimilar from each other: the small group at the top of the tree, the large group in the centre, and the sequences at the very bottom of the tree.

4.1.3 Interpreting the tree

As noted above, the very long branches separating our sequences into three groups could represent true evolutionary history, or be an artefact from analysing sequences that don’t really belong together.

Tip

Generally, I would expect a set of related sequences in a phylogenetic tree to have about the same “distance” from the root to each tip. The presence of these three distinct groups could be a red flag that I’ve done something wrong - or it could be very interesting!

The first thing I want to know is what sequences are in each of these three groups. I first ladderise (Tree -> I ncreasing node order) the tree, and turn Tip Labels back on. Then I use the Expansion slider to stretch the tree vertically and make space between the tip labels, as in Figure 13.

Figure 13: Ladderised maximum likelihood tree with tip labels (some of the tree disappears out of the top of the `FigTree` window.

By using the Zoom slider and increasing the font size of the Tip Labels menu I can see the sequence names more clearly in Figure 14.

Figure 14: Ladderised maximum likelihood tree with tip labels, zoomed to see the two most distantly-related sequence sets.

By inspecting this zoomed-in view, I can see that the main group of sequences is annotated with codes that end in the letters NOCA, NOCBR or similar; most of the sequences in the dataset have codes like MYCO, MYCRU and so on. This tells me that:

all of the main set of sequences derive from Mycobacterium
the larger group of outlying sequences derive from Nocardia

The single other outlying sequence appears to come from Mycobacterium, so we should look at the sequence alignment to see whether there is an issue with the alignment (Figure 15).

Figure 15: `Jalview` visualisation of the sequence alignment, centring on the *Nocardia* sequences, and the singleoutlying *Mycobacterium* sequence.

From this alignment, we can see that the Nocardia sequences are broadly similar to each other, and do align with the Mycobacterium sequences, even though there appears to be an inserted region (we can look at the location of this on the structure, later), and quite a bit of sequence difference. This kind of pattern is consistent with there being a common ancestor of this sequence in Nocardia and Mycobacterium, and them being separated by speciation, then evolving separately for a long period.

The Mycobacterium sequence that looks out of place is tr|A0A1Z4F4V8|A0A1Z4F4V8_9MYCO/27-140 - the sequence looks similar enough to the others to appear to be validly aligned, but it is quite different to the other Mycobacterium proteins. It it not clear at this point why these differences should be present, so we continue our analysis keeping this sequence in the dataset.

4.2 General conclusions from the tree

At this point we can draw some tentative conclusions from the tree.

There was probably a related sequence present in an ancestor of Mycobacterium and Nocardia, and the homologues we find in the database have diverged after speciation.
- It is not clear whether these proteins have the same function(s) in these two genera.
There is a questionable Mycobacterium homologue, but we don’t have sufficient evidence to exclude it from the analysis as an artefact and can proceed assuming there is a biological relationship.
There are, within the Mycobacteria, several groups of related sequences (we call these clades) which might represent any or all of:
- speciation (i.e. the tree reflects organism history and evolution)
- functional divergence (i.e. each clade represents a different biological interaction, activity, or function)
- but we can’t tell from the tree alone what the (most) likely explanations are - we need to do more work
The MYCO coding does not give us species information and it’s not possible to see whether the groupings of sequences correspond to speciation, or paralogy within species. We would need to do some bioinformatics work to label these sequences with their corresponding taxa.

1 What’s my protein?

1.1 Checking PHI-Base

1.2 Checking UniProt

1.2.1 Functional information

1.2.2 Structural information

1.2.3 Sequence information

1.2.4 Homologues

2 Aligning homologues

2.1 Align using MAFFT on Galaxy

2.2 Visualise alignment with JalView

2.2.1 Arranging sequences by similarity

3 Generating a phylogenetic tree

3.1 Generating a basic phylogenetic tree

3.2 Generating a more sophisticated phylogenetic tree

4 Visualising a phylogenetic tree

4.1 Visualising a tree with FigTree

4.1.1 Rooted or unrooted?

4.1.2 Midpoint Rooting

4.1.3 Interpreting the tree

4.2 General conclusions from the tree

2.1 Align using `MAFFT` on Galaxy

2.2 Visualise alignment with `JalView`

4.1 Visualising a tree with `FigTree`