Parallel Project

What sort of thing I’d be doing, if I was doing your project

This page is intended as an example walkthrough of some of the practical steps in a project like the one you’ve been assigned. The details will be different from those of your project, but this page should give you an idea of how I might go about the work if I was doing something similar.

1 What’s my protein?

I was assigned PHI:3077 - Lsr2 from Mycobacterium tuberculosis

1.1 Checking PHI-Base

The reference describing this protein in PHI-Base is Bartek et al. (2014): “Mycobacterium tuberculosis Lsr2 Is a Global Transcriptional Regulator Required for Adaptation to Changing Oxygen Levels and Virulence.” The essential findings in an Lsr2 knockout are:

  • Lsr2 is not required for DNA protection (this strain was equally susceptible as the wild type to DNA-damaging agents)
  • The lsr2 mutant displayed severe growth defects under normoxic and hyperoxic conditions, but it was not required for growth under low-oxygen conditions.
  • Lsr2 was required for adaptation to anaerobiosis. The defect in anaerobic adaptation led to a marked decrease in viability during anaerobiosis, as well as a lag in recovery from it.
  • Gene expression profiling of the Δlsr2 mutant under aerobic and anaerobic conditions in conjunction with published DNA binding-site data indicates that Lsr2 is a global transcriptional regulator controlling adaptation to changing oxygen levels.
  • The Δlsr2 strain was capable of establishing an early infection in the BALB/c mouse model; however, it was severely defective in persisting in the lungs and caused no discernible lung pathology.

This suggests that Lsr2 binds to DNA and acts within the pathogen to control its response to encountering the host as an environment (i.e. what we describe as pathogenicity). I would be on the lookout for suggestions in the literature, and from what I discover from databases and annotations, for elements of the sequence and structure associated with DNA-binding.

1.2 Checking UniProt

The PHI-Base entry links to this UniProt record: P9WIP7

1.2.1 Functional information

The UniProt record leads to further references supporting protein function (six publications under “Function”)

  • DNA-bridging protein that has both architectural and regulatory roles (PubMed:18187505).
  • Influences the organization of chromatin and gene expression by binding non-specifically to DNA, with a preference for AT-rich sequences, and bridging distant DNA segments (PubMed:20133735).
  • Binds in the minor groove of AT-rich DNA (PubMed:21673140).
  • Represses expression of multiple genes involved in a broad range of cellular processes, including major virulence factors or antibiotic-induced genes, such as iniBAC or efpA (PubMed:17590082), and genes important for adaptation of changing O2 levels (PubMed:24895305).
  • May also activate expression of some gene (PubMed:24895305).
  • May coordinate global gene regulation and virulence (PubMed:20133735).
  • Also protects mycobacteria against reactive oxygen intermediates during macrophage infection by acting as a physical barrier to DNA degradation (PubMed:19237572); the physical protection has been questioned (PubMed:24895305).
  • A strain overexpressing this protein consumes O2 more slowly than wild-type (PubMed:24895305).

The feature viewer gives me information about regions of the protein:

UniProt Feature viewer

This indicates that there is mutagenesis evidence supporting structure-function interpretation:

  • residues 97-99: Description Loss of DNA-binding, in fragment 66-112. Alternative sequence AGA Evidence Publication: 21673140 (PubMed EuropePMC) Cross-references UniProtKB P9WIP7
  • residue 84: Description Loss of activity. Can form dimers but does not bind DNA. Alternative sequence A Evidence Publication: 18187505
  • residue 45 Description Loss of activity. Alternative sequence A Evidence Publication: 18187505
  • residue 28 Description Loss of activity. Alternative sequence A Evidence Publication: 18187505

There are ten papers linked from UniProt to follow up about function, and Uniprot also provides GO terms for functional annotation.

UniProt has so far provided leads about sequence-structure-function relationships, the key role of this protein in pathogenicity, and even highlighted individual residues with a potential functional role.

1.2.2 Structural information

There is an AlphaFold structure, indicated in the feature viewer, that includes a low-confidence region.

There are PDB records of solved structures: 2KNG, 4E1P, 41R, 6QKP, 6QKQ; 4E1P appears to be highest quality. Solved structures are generally more reliable than AlphaFold.

Evolutionary Trace analysis exists at http://mammoth.bcm.tmc.edu/cgi-bin/report_maker_ls/uniprotTraceServerResults.pl?identifier=P9WIP7 - this maps sequence variation onto structure.

It looks like there will be ample structural information for me to begin inferring structure-function relationships.

1.2.3 Sequence information

UniProt gives the protein sequence as:

>sp|P9WIP7|LSR2_MYCTU Nucleoid-associated protein Lsr2 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) OX=83332 GN=lsr2 PE=1 SV=1
MAKKVTVTLVDDFDGSGAADETVEFGLDGVTYEIDLSTKNATKLRGDLKQWVAAGRRVGG
RRRGRSGSGRGRGAIDREQSAAIREWARRNGHNVSTRGRIPADVIDAYHAAT

1.2.4 Homologues

UniProt provides multiple links to external resources listing homologues; these have varying numbers of homologues, because the tools work in different ways.

UniProt also lists the sequences it knows about that share a minimum level of similarity, at 100%, 90% and 50%, in the Similar Proteins section of the page (Figure 1). At time of writing, there are 269 sequences sharing at least 50% identity at protein level.

Figure 1: The Similar Proteins section of the P9WIP7 record

By clicking on the View All button, or on the View all 269 entries link, we can obtain a more detailed view of the data (Figure 2). This page also presents a Download link that lets us download all 270 sequences (the 269 homologues and P9WIP7 itself).

Figure 2: A more detailed view of the 269 homologues of P9WIP7

Clicking on the Download link presents options. We want to download the FASTA (Canonical) records - these contain the protein sequence information and relevant headers - and we do not want a compressed file (Figure 3).

Figure 3: UniProt sequence download options: choose FASTA (Canonical) and do not request compressed format

Clicking the Download button returns the 269 homologous sequences, and our original P9WIP7 sequence, in the file below.

2 Aligning homologues

Taking the 50% identity sequences as our starting point, we could usually align them using tools in UniProt, but with ≈270 sequences there are too many. So we’ll have to use another approach.

2.1 Align using MAFFT on Galaxy

We could download a standalone tool, but we’re going to use Galaxy instead (don’t forget to register and log in) We’ll use MAFFT to align sequences.

  1. Load the 270 sequence dataset
  2. Find the MAFFT tool (you might as well use the FFT-NS-2 default method)
  3. Click on Run Tool

This will generate the sequence alignment for you in FASTA format. Download the alignment to your own machine (click on the floppy disk icon).

2.2 Visualise alignment with JalView

To visualise the alignment, we will use a standalone software tool: JalView. Download and install this on your own machine.

To visualise the alignment:

  1. Start JalView
  2. Click on File -> Input Alignment -> From File, then select your downloaded alignment
  3. Click on the maximise button to see the larger alignment

We can see the full length of the sequence and that similar residues are aligned vertically, in Figure 4. We can see any insertions/deletions for specific sequences visually. The extent of conservation, and a consensus sequence, can be seen below the alignment. We can colour the alignment to highlight aspects of the biology/biochemistry:

  1. Click on the Colour menu item
  2. Choose a colour scheme (e.g. Clustal to recreate Figure 4)

Scrolling up and down the alignment shows where some sequences are quite different, or may be incomplete. We might want to delete some sequences from the collection.

Important

Whenever we modify our dataset, we need to record which sequences are removed or otherwise altered, and our grounds for removing them (and this should be described in the Methods section of the thesis).

2.2.1 Arranging sequences by similarity

Initially the alignment puts quite different sequences next to each other. We can sort the sequences by similarity.

  1. Select all the sequences (Select -> Select All, or Ctrl/Cmd-A)
  2. Calculate -> Calculate Tree, PCA, or PaSiMap
  3. Select Neighbour Joining Tree and BLOSUM 62
  4. Click Calculate

When the tree is created, sort the alignment

  1. View -> Sort Alignment By Tree

Now, when you click within the tree, sequences will be coloured by where they are found on the tree

You can save the tree as a Newick file, to visualise in other software (like Dendroscope or FigTree):

  1. File -> Save As -> Newick format
Figure 4: JalView alignment
Figure 5: JalView Neighbour-Joining Tree

The combination of alignment and tree tells me several things immediately:

  1. That there are several distinct clusters of similar Lsr2 sequences (clades/groups in the tree, visible differences between sequences in the alignment)
  2. That there are two regions of high sequence conservation, with a region of low sequence conservation separating them (from the alignment - the gaps that run down the middle of the proteins)
  3. That I need to remove some sequences because they’re relatively low quality (e.g. the sequences with half the protein missing, in the alignment)

3 Generating a phylogenetic tree

Using a sequence alignment to produce a phylogenetic tree can be a complex and detailed topic. There is an entire research field dedicated to the problem of generating an accurate representation of evolutionary history from protein and/or nucleotide sequences. Here we have space to quickly go through basic approaches to producing an initial tree and, if you would like to know more, please check out the links below.

3.1 Generating a basic phylogenetic tree

We have already produced a basic phylogenetic tree using JalView (Section 2.2.1), as shown in Figure 5. To make this tree, Jalview used the Neighbor-Joining algorithm and the BLOSUM62 amino acid substitution matrix - some advantages and disadvantages of the method are noted in this paper.

In general, we would prefer to use a more sophisticated method such as Maximum Likelihood or Bayesian phylogenetic reconstruction, using the underlying nucleotide sequences rather than the protein sequences. We would want to use the nucleotide sequences because they carry more evolutionary information (the same amino acid may be encoded by different codons in two sequences, and the differences between those codons is evolutionary information). We’d use Maximum Likelihood/Bayesian approaches because they give better estimates of the evolutionary history leading to the sequences in the alignment than Neighbour-Joining (which has systematic biases in its approach).

Tip

You can use some of these more advanced phylogenetic tools at Galaxy

3.2 Generating a more sophisticated phylogenetic tree

Note

For the example below I have removed some distant/incomplete sequences, and a short leader sequence that isn’t present in most proteins, to make the alignment truncated_sequences.fa.

To generate a Maximum Likelihood phylogenetic tree from the truncated_sequences.fa file, we can use the RAxML tool in Galaxy (Figure 6).

  1. Select the RAxML tool in the tools sidebar
  2. Choose the Amino Acid model type
  3. Choose the PROTCAT substitution model and the BLOSUM62 matrix
  4. Click on Run Tool

The tree will take a short while to be generated, and will produce five output files (Figure 7). The tree information can be found in the Best-scoring ML tree output.

Figure 6: Settings for generating a protein phylogenetic tree with RAxML in Galaxy`
Figure 7: RAxML output on Galaxy

4 Visualising a phylogenetic tree

A simple way to visualise phylogenetic tree output on Galaxy is to use the Newick Display tool (Figure 8):

  1. Find and click on the Newick Display tool in Galaxy
  2. Make sure the tree output file you want is selected in the Newick file field
  3. Click on Run Tool
  4. When the tool finishes, click on the eye icon, and the floppy/download icon to download the tree rendering.

This generates a bitmap of the input tree (Figure 9). It is a quick way to see the overall tree structure, but is very limited in terms of customisability and interaction. The image is usually too large to view all at once, and needs to be downloaded. Even then it may not be very aesthetically pleasing.

Figure 8: Options for the Newick Display tool
tr|A0AAJ3NRW0|A0AAJ3NRW0 9MYCO/1-1120.00000100000050002909tr|A0A1X1QYK4|A0A1X1QYK4 MYCBE/1-1120.00000100000050002909tr|A0A0U0W8H0|A0A0U0W8H0 MYCBE/1-1120.000001000000500029090.015902799133798512650.01593502189397471083tr|A0A0Q2LUF4|A0A0Q2LUF4 MYCGO/1-1120.00000100000050002909tr|A0A1A3PCG2|A0A1A3PCG2 MYCAS/1-1120.000001000000500029090.00000100000050002909tr|A0A386UCX9|A0A386UCX9 9MYCO/1-1120.000001000000500029090.01568326802819392310tr|A0AAJ1S9I9|A0AAJ1S9I9 9MYCO/1-1120.000001000000500029090.016194578035448971290.00000100000050002909tr|A0A1R0VJZ6|A0A1R0VJZ6 9MYCO/1-1120.00000100000050002909tr|A0A1R0TZ10|A0A1R0TZ10 9MYCO/1-1120.000001000000500029090.01553820500365333222tr|A0A7D6I0F9|A0A7D6I0F9 9MYCO/1-1120.01628692122745890175tr|A0A9P3Q8M6|A0A9P3Q8M6 9MYCO/1-1120.000001000000500029090.016347137820220025010.00000100000050002909tr|A0A7I9YWU6|A0A7I9YWU6 MYCBU/1-1120.00000100000050002909tr|A0A1X1ZFU2|A0A1X1ZFU2 9MYCO/1-1120.000001000000500029090.015764702813666996860.00000100000050002909tr|A0A2U3NRS9|A0A2U3NRS9 9MYCO/1-1120.00000100000050002909tr|A0A7I7QD51|A0A7I7QD51 9MYCO/48-1590.00000100000050002909tr|A0AA37PTC4|A0AA37PTC4 9MYCO/1-1120.00000100000050002909tr|A0A1X1U2S1|A0A1X1U2S1 MYCFL/1-1120.000001000000500029090.000001000000500029090.01639270985134819864tr|A0A2U3P8I3|A0A2U3P8I3 9MYCO/1-1120.01600278745256599952tr|A0A1X1SYH5|A0A1X1SYH5 9MYCO/1-1120.01638441782710404332tr|A0A1W9Z9N5|A0A1W9Z9N5 MYCAN/1-1120.00000100000050002909tr|A0A7I7MUW3|A0A7I7MUW3 9MYCO/1-1120.03235047280228617317tr|A0A557XZ10|A0A557XZ10 9MYCO/1-1130.00000100000050002909tr|A0A7V3HYH2|A0A7V3HYH2 UNCAC/1-1130.000001000000500029090.11748525077867127964tr|A0A2Z5YMZ6|A0A2Z5YMZ6 MYCMR/1-1130.00000100000050002909tr|A0PV46|A0PV46 MYCUA/1-1130.000001000000500029090.00000100000050002909tr|B2HJ42|B2HJ42 MYCMM/1-1130.00000100000050002909tr|A0A7I7L6H0|A0A7I7L6H0 9MYCO/1-1130.000001000000500029090.000001000000500029090.00000100000050002909tr|L7VHB9|L7VHB9 MYCL1/1-1130.00000100000050002909tr|A0A1B4XY44|A0A1B4XY44 MYCUL/1-1130.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A9N7LIU5|A0A9N7LIU5 9MYCO/1-1130.000001000000500029090.000001000000500029090.016029658410253940640.000001000000500029090.00000100000050002909tr|A0A1X2EX97|A0A1X2EX97 MYCSZ/1-1120.000001000000500029090.00000100000050002909tr|A0A447GK94|A0A447GK94 9MYCO/1-1120.01049313912990654761tr|A0A522NJG7|A0A522NJG7 9MYCO/1-1120.00000100000050002909tr|A0A1A1WFE8|A0A1A1WFE8 9MYCO/1-1120.00000100000050002909tr|J9W6T7|J9W6T7 MYCIP/1-1120.00000100000050002909tr|J5EHD8|J5EHD8 9MYCO/1-1120.00000100000050002909tr|A0A1X2KP34|A0A1X2KP34 9MYCO/1-1120.00000100000050002909tr|A0A1A3J8H9|A0A1A3J8H9 9MYCO/1-1120.00000100000050002909tr|A0A051TSS5|A0A051TSS5 9MYCO/1-1120.000001000000500029090.000001000000500029090.000001000000500029090.01594423114925477489tr|A0AAD2XSB8|A0AAD2XSB8 9MYCO/1-1080.00000100000050002909tr|A0A0H2ZZ67|A0A0H2ZZ67 MYCA1/1-1080.000001000000500029090.01588906461389110597tr|X8CP96|X8CP96 MYCIT/1-1080.00000100000050002909tr|H8INA4|H8INA4 MYCIA/1-1080.000001000000500029090.000001000000500029090.015875794514063922000.00000100000050002909tr|A0A498QCX2|A0A498QCX2 9MYCO/1-1120.00000100000050002909tr|A0A164GC22|A0A164GC22 MYCKA/1-1120.000001000000500029090.00000100000050002909tr|A0A1X0L8U0|A0A1X0L8U0 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|U5WP77|U5WP77 MYCKA/1-1120.000001000000500029090.00000100000050002909tr|A0A5B1BNC1|A0A5B1BNC1 MYCSI/1-1120.03254871669358536806tr|A0A164BEC0|A0A164BEC0 9MYCO/1-1120.00000100000050002909tr|A0A1X1VTN2|A0A1X1VTN2 MYCGS/1-1120.000001000000500029090.000001000000500029090.01603336502020784315tr|A0A498R4I5|A0A498R4I5 9MYCO/1-1120.00000100000050002909tr|A0A498QCR6|A0A498QCR6 9MYCO/1-1120.000001000000500029090.016201449248845957440.000001000000500029090.01610356552976845504tr|A0A1A2LT18|A0A1A2LT18 9MYCO/1-1120.00000100000050002909tr|A0A1A2V3Y5|A0A1A2V3Y5 9MYCO/1-1120.00000100000050002909tr|A0A1A2RGU9|A0A1A2RGU9 9MYCO/1-1120.00000100000050002909tr|A0A9E3DKM0|A0A9E3DKM0 9MYCO/1-1120.00000100000050002909tr|A0A1A3SX27|A0A1A3SX27 9MYCO/1-1120.000001000000500029090.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A1A2DNV1|A0A1A2DNV1 9MYCO/1-1120.00000100000050002909tr|A0A1A2YK10|A0A1A2YK10 9MYCO/1-1120.000001000000500029090.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A1A0U746|A0A1A0U746 9MYCO/1-1120.000001000000500029090.00688710644748392713tr|A0A1A2MQA7|A0A1A2MQA7 9MYCO/1-1120.00000100000050002909tr|A0A1A2IV37|A0A1A2IV37 9MYCO/1-1120.00000100000050002909tr|A0A1A2WX52|A0A1A2WX52 9MYCO/1-1120.00000100000050002909tr|A0A0U1CYS7|A0A0U1CYS7 9MYCO/1-1120.00000100000050002909tr|A0A1A2VV59|A0A1A2VV59 MYCSC/1-1120.00000100000050002909tr|A0AAP7S3C5|A0AAP7S3C5 MYCBC/1-1120.00000100000050002909tr|A0A7I7P2U8|A0A7I7P2U8 9MYCO/1-1120.00000100000050002909tr|A0A2A3L5Z5|A0A2A3L5Z5 MYCAV/1-1120.00000100000050002909tr|A0AAN4I372|A0AAN4I372 MYCAV/1-1120.00000100000050002909tr|A0A1A1WEN6|A0A1A1WEN6 9MYCO/1-1120.00000100000050002909tr|A0A853LUG8|A0A853LUG8 9MYCO/1-1130.00000100000050002909tr|A0A1A2PG97|A0A1A2PG97 9MYCO/1-1130.000001000000500029090.01559825705773837216tr|A0A7V3J344|A0A7V3J344 UNCAC/1-1120.08113455957786921968tr|A0A7I9XWN1|A0A7I9XWN1 9MYCO/17-1280.17247012706796457926tr|A0AAD1MGD7|A0AAD1MGD7 9MYCO/1-1110.09167374631409085095tr|A0A7I9YDZ8|A0A7I9YDZ8 MYCAL/1-1110.00000100000050002909tr|A0A7I7JN54|A0A7I7JN54 9MYCO/1-1110.000001000000500029090.023576461698922291520.098067378585088960110.029996817794704588870.04453659051707727773tr|A0A1A2Z369|A0A1A2Z369 9MYCO/1-1120.000001000000500029090.01582860700734316228tr|A0AAD1H8S5|A0AAD1H8S5 9MYCO/1-1130.00000100000050002909tr|A0A5N5VAQ7|A0A5N5VAQ7 MYCPH/1-1130.06372467306382882668tr|A0A7I7UK34|A0A7I7UK34 MYCPV/1-1130.02457878802335766721tr|A0A439DWH3|A0A439DWH3 9MYCO/4-1160.00000100000050002909tr|A0A0M2ZF27|A0A0M2ZF27 9MYCO/1-1130.000001000000500029090.007047633260779255160.028994477502235133090.00000100000050002909tr|A0A101AFM1|A0A101AFM1 9MYCO/1-1130.00000100000050002909tr|A0A1A2DYB7|A0A1A2DYB7 9MYCO/1-1130.00000100000050002909tr|A0A117JI08|A0A117JI08 9MYCO/1-1130.00000100000050002909tr|A0AAP1BLB3|A0AAP1BLB3 MYCNV/1-1130.000001000000500029090.00000100000050002909tr|A0A1A0UQX1|A0A1A0UQX1 9MYCO/1-1130.000001000000500029090.000001000000500029090.015611235082880472670.08493978580941538570tr|A0A9E3DVR3|A0A9E3DVR3 9MYCO/1-1130.116891748088534502830.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A9X0L6Q9|A0A9X0L6Q9 9MYCO/1-1130.00000100000050002909tr|A0A4Y9MXJ6|A0A4Y9MXJ6 9MYCO/1-1130.12193468675359298015tr|A0AAD1ITH8|A0AAD1ITH8 9MYCO/1-1130.000001000000500029090.03088512905340754650tr|A0A6S6PAZ3|A0A6S6PAZ3 9MYCO/1-1130.00000100000050002909tr|A0AAD1J1U5|A0AAD1J1U5 MYCMB/1-1130.00000100000050002909tr|A1UMH5|A1UMH5 MYCSK/1-1130.00000100000050002909tr|A0A5Q5CMZ1|A0A5Q5CMZ1 MYCSJ/1-1130.000001000000500029090.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A5Q5BQW0|A0A5Q5BQW0 MYCSS/1-1130.000001000000500029090.000001000000500029090.07067029392768058027tr|A0A942B609|A0A942B609 9ACTN/1-1130.067949132109041784910.00000100000050002909tr|A0AAP7HAK4|A0AAP7HAK4 9MYCO/1-1130.00000100000050002909tr|A0A7X6MMT7|A0A7X6MMT7 9MYCO/1-1130.00000100000050002909tr|A0A1A1ZFZ9|A0A1A1ZFZ9 MYCPR/1-1130.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A1Q9WI35|A0A1Q9WI35 9MYCO/1-1130.000001000000500029090.01545846488753877847tr|A0A1A3MVH3|A0A1A3MVH3 MYCFO/1-1130.00000100000050002909tr|A0A0J8UEY0|A0A0J8UEY0 9MYCO/1-1130.00000100000050002909tr|A0AAN1T2T1|A0AAN1T2T1 9MYCO/1-1130.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A124E540|A0A124E540 MYCFO/1-1130.00000100000050002909tr|K0UNC3|K0UNC3 MYCFO/1-1130.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A1A2LBV8|A0A1A2LBV8 9MYCO/1-1130.00000100000050002909tr|A0A378WBH6|A0A378WBH6 9MYCO/1-1130.000001000000500029090.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A7K1KB60|A0A7K1KB60 9MYCO/1-1130.030723299841910505640.03090775270748175388tr|A0AAI8TS60|A0AAI8TS60 MYCME/1-1140.02947026839178903609tr|A0A132PLT6|A0A132PLT6 9MYCO/1-1140.000001000000500029090.000001000000500029090.016635735769164591850.05260013033657716386tr|A0A6N4VF61|A0A6N4VF61 9MYCO/1-1130.01573232171431164159tr|A0A9W5TIB8|A0A9W5TIB8 9MYCO/1-1130.000001000000500029090.04890405569318377610tr|A0A839QD30|A0A839QD30 MYCIR/1-1130.03160566655385791540tr|E6THM1|E6THM1 MYCSR/1-1140.00000100000050002909tr|A0A378STW9|A0A378STW9 9MYCO/1-1140.000001000000500029090.00000100000050002909tr|A4T5N0|A4T5N0 MYCGI/1-1140.015150608453498588810.000001000000500029090.015400749502275528500.00000100000050002909tr|A1TG30|A1TG30 MYCVP/23-1350.04889718162363871495tr|A0A552ZKW3|A0A552ZKW3 9MYCO/1-1130.02109883441775217694tr|A0A0J6WJ79|A0A0J6WJ79 9MYCO/1-1130.075533225373287923850.04850916908093930169tr|A0A178M5J3|A0A178M5J3 MYCIR/1-1130.020656978616705911480.00996550370290386241tr|A0A0J6WB21|A0A0J6WB21 MYCCU/1-1130.00000100000050002909tr|A0A0J6WPD1|A0A0J6WPD1 9MYCO/1-1130.000001000000500029090.00000100000050002909tr|A0A7I7M782|A0A7I7M782 9MYCO/1-1130.015803208257558456350.000001000000500029090.015355460603290594160.000001000000500029090.00000100000050002909tr|A0A960D856|A0A960D856 9MYCO/1-1130.08009508152472170905tr|A0A850PRW6|A0A850PRW6 9MYCO/1-1130.04746802673333298805tr|A0AAF0VU89|A0AAF0VU89 9MYCO/1-1130.06323007474137549988tr|A0A5S9MW48|A0A5S9MW48 MYCVN/1-1130.000001000000500029090.01504902265671460328tr|A0A7I9ZXL0|A0A7I9ZXL0 9MYCO/1-1130.000001000000500029090.000001000000500029090.000001000000500029090.03120802612232992032tr|K0UXC1|K0UXC1 MYCVA/1-1130.00000100000050002909tr|A0A132T9K4|A0A132T9K4 9MYCO/1-1130.00000100000050002909tr|A0A7I7K8S9|A0A7I7K8S9 9MYCO/1-1130.030810981982328046470.015459013041033824520.000001000000500029090.01545361005483133002tr|I4BQ53|I4BQ53 MYCCN/1-1130.01525746062313910502tr|A0A448J0E7|A0A448J0E7 MYCAU/1-1130.00000100000050002909tr|A0A2S9F7Q0|A0A2S9F7Q0 9MYCO/1-1130.015429224137667547560.015418159590249072570.015876108066538486410.000001000000500029090.01545097779833348728tr|A0A375YBW3|A0A375YBW3 MYCPF/1-1130.015606537120499145570.024399164187931633020.023500268607851381580.00000100000050002909tr|A0A1H6JNZ8|A0A1H6JNZ8 MYCRU/1-1130.00000100000050002909tr|A0A1X0BQV0|A0A1X0BQV0 MYCCF/1-1130.00000100000050002909tr|A0A1A3C7D9|A0A1A3C7D9 9MYCO/1-1130.03153858495894425878tr|A0A5A7X2Z6|A0A5A7X2Z6 9MYCO/1-1130.02980767152858758776tr|A0A7I7XNC9|A0A7I7XNC9 9MYCO/1-1130.04453902077837912366tr|A0AA37RKN3|A0AA37RKN3 9MYCO/1-1130.034418858144654534970.00000100000050002909tr|A0A9X2Z2X1|A0A9X2Z2X1 9MYCO/1-1130.01504696344766751938tr|A0A1V4PE75|A0A1V4PE75 9MYCO/1-1130.00000100000050002909tr|A0A0T1WEL9|A0A0T1WEL9 9MYCO/1-1130.000001000000500029090.00000100000050002909tr|A0A544W4L8|A0A544W4L8 9MYCO/1-1130.000001000000500029090.000001000000500029090.042469141030940647800.04329030882432565591tr|A0A7I7QJW9|A0A7I7QJW9 9MYCO/1-1130.065211403301181081190.000001000000500029090.02993034590402720435tr|A0A846XMW4|A0A846XMW4 9NOCA/1-1200.04587295553889964711tr|A0A7X6L7J0|A0A7X6L7J0 9NOCA/1-1190.01216132026462173575tr|A0A318KCY8|A0A318KCY8 9NOCA/1-1190.00000100000050002909tr|A0A6G9XK99|A0A6G9XK99 NOCBR/1-1190.02313564364527734285tr|A0A5N0EK29|A0A5N0EK29 9NOCA/1-1190.00000100000050002909tr|A0A4P7H6H5|A0A4P7H6H5 9NOCA/1-1190.00000100000050002909tr|A0A511MC36|A0A511MC36 9NOCA/1-1190.000001000000500029090.01128351390136108842tr|A0AAC9XBG3|A0AAC9XBG3 NOCBR/1-1190.01199913324239770991tr|A0A0C1CQR0|A0A0C1CQR0 9NOCA/1-1190.00000100000050002909tr|K0END3|K0END3 NOCB7/1-1190.000001000000500029090.000001000000500029090.023105303252916126190.000001000000500029090.000001000000500029090.011494575018306741470.023573472117190432530.01095471638080972379tr|A0A3A4KFJ3|A0A3A4KFJ3 9NOCA/1-1180.05034591218397550805tr|A0A285LVQ7|A0A285LVQ7 9NOCA/1-1190.00000100000050002909tr|A0A543F5S3|A0A543F5S3 9NOCA/1-1190.000001000000500029090.000001000000500029090.01155547959387292004tr|A0A2A7UJY6|A0A2A7UJY6 9NOCA/1-1200.00000100000050002909tr|A0A0H5NY09|A0A0H5NY09 NOCFR/1-1200.000001000000500029090.00000100000050002909tr|Q5Z2T7|Q5Z2T7 NOCFA/1-1200.000001000000500029090.073193163442524925010.00000100000050002909tr|A0A931N769|A0A931N769 9NOCA/1-1190.02038052492093713614tr|A0A1W0AW07|A0A1W0AW07 9NOCA/1-1190.083002958860663472730.02274817915446870897tr|A0A6G9Y5P2|A0A6G9Y5P2 9NOCA/13-1310.01157276790253576777tr|A0A849C6Q7|A0A849C6Q7 9NOCA/1-1200.08487920587449963861tr|A0A366DBF0|A0A366DBF0 9NOCA/1-1190.030362983128669834840.041702815378124299950.011604334830915901190.022900952419331811060.023554800998831011690.098477285368178360540.50830010508515743872tr|A0A1Z4F4V8|A0A1Z4F4V8 9MYCO/27-1401.329539542468561874600.22582123366576142831tr|A0A255DKL4|A0A255DKL4 9MYCO/1-1130.046799979710157431360.078254056162245538840.00000100000050002909tr|A0A1E8Q509|A0A1E8Q509 9MYCO/1-1130.04874091412432705883tr|A0A7I7S2V9|A0A7I7S2V9 9MYCO/1-1130.032328945795105520910.04591726329324152117tr|A0A932MZ12|A0A932MZ12 9MYCO/1-1130.07968497056525630784tr|V5XJA9|V5XJA9 MYCNE/1-1130.00000100000050002909tr|A0AAQ1EJP3|A0AAQ1EJP3 MYCNE/1-1130.000001000000500029090.00000100000050002909tr|A0A1W9YX71|A0A1W9YX71 MYCBA/1-1130.015174778325004615110.01511252785020235009tr|A0A172UUC6|A0A172UUC6 9MYCO/1-1130.000001000000500029090.00000100000050002909tr|A0AAF0VPB4|A0AAF0VPB4 9MYCO/1-1130.000001000000500029090.00000100000050002909tr|A0A6H0SG47|A0A6H0SG47 9MYCO/1-1130.000001000000500029090.00000100000050002909tr|A0A5A7ZN64|A0A5A7ZN64 9MYCO/1-1130.00000100000050002909tr|A0A0Q8ZH84|A0A0Q8ZH84 9MYCO/1-1130.000001000000500029090.014942445388675576480.021468000960384960030.00953063935084137379tr|A0A117I9H1|A0A117I9H1 MYCCR/1-1130.000001000000500029090.00000100000050002909tr|W9AXE4|W9AXE4 MYCCO/1-1130.000001000000500029090.00000100000050002909tr|A0A1G4VHR0|A0A1G4VHR0 9MYCO/1-1130.014558207473377153540.00000100000050002909tr|A0A1Q4HDP6|A0A1Q4HDP6 9MYCO/1-1130.030351204898537889610.015446650696368101340.014481395956804856900.015382994180192633910.000001000000500029090.00000100000050002909tr|A0A1E3RTD4|A0A1E3RTD4 9MYCO/1-1130.000001000000500029090.00000100000050002909tr|A0A329M635|A0A329M635 9MYCO/1-1130.000001000000500029090.00000100000050002909tr|A0A1E3REL1|A0A1E3REL1 MYCFV/1-1130.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A1A0P6J5|A0A1A0P6J5 9MYCO/1-1130.01454551197415888258tr|A0A1A3TAD7|A0A1A3TAD7 9MYCO/1-1130.000001000000500029090.02968473092091984331tr|L0J3C8|L0J3C8 9MYCO/1-1130.000001000000500029090.01506503275800652464tr|A0A1A1ZB48|A0A1A1ZB48 9MYCO/1-1130.01505160487362387604tr|A0A2A7NGU6|A0A2A7NGU6 MYCAG/1-1130.049210422366550356230.00000100000050002909tr|A0A7I7WRT4|A0A7I7WRT4 MYCGU/1-1130.00000100000050002909tr|A0A9W4B5E5|A0A9W4B5E5 9MYCO/1-1130.000001000000500029090.00000100000050002909tr|A0A1X0JNS7|A0A1X0JNS7 9MYCO/1-1130.000001000000500029090.04570560367148063935tr|A0A318HF47|A0A318HF47 9MYCO/29-1410.024004616080217255900.00816301036521194408tr|G8RS13|G8RS13 MYCRN/1-1130.015651457377141207720.015379079723562658070.000001000000500029090.014887439959711401200.000001000000500029090.014746538029128763640.016152800424905387060.016338908960348681730.00000100000050002909tr|A0A1B9D8Y3|A0A1B9D8Y3 MYCMA/1-1120.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A1Q4I116|A0A1Q4I116 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|A0A1X0I4P7|A0A1X0I4P7 9MYCO/1-1120.000001000000500029090.000001000000500029090.000001000000500029090.000001000000500029090.00000100000050002909tr|A0AAP9KXI3|A0AAP9KXI3 MYCAV/1-1120.000001000000500029090.000001000000500029090.00000100000050002909tr|A0AAQ3AC69|A0AAQ3AC69 MYCPC/1-1120.000001000000500029090.00000100000050002909tr|Q743Y1|Q743Y1 MYCPA/1-1120.000001000000500029090.00000100000050002909tr|D5P8N5|D5P8N5 9MYCO/1-1120.000001000000500029090.000001000000500029090.000001000000500029090.000001000000500029090.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A1A2PGI1|A0A1A2PGI1 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|A0A1A2BKX3|A0A1A2BKX3 9MYCO/1-1120.015952781081317247290.00000100000050002909tr|A0A2A2ZPS7|A0A2A2ZPS7 MYCAV/1-1120.00000100000050002909tr|A0AAI8SJ67|A0AAI8SJ67 MYCAV/1-1120.000001000000500029090.015794762659400763540.011264156451923711120.014274370732352222040.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A081I6K8|A0A081I6K8 9MYCO/1-1120.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A7I9ZFD2|A0A7I9ZFD2 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|A0A1A3E7Y6|A0A1A3E7Y6 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|A0A0F5NB39|A0A0F5NB39 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|A0A0H3MP74|A0A0H3MP74 MYCLB/1-1120.00000100000050002909tr|A0A0F4ET51|A0A0F4ET51 9MYCO/1-1120.01598217816393867133sp|P24094|LSR2 MYCLE/1-1120.00000100000050002909tr|A0AAD0KTM9|A0AAD0KTM9 MYCLR/1-1120.000001000000500029090.00000100000050002909tr|A0A3E1HLW8|A0A3E1HLW8 9MYCO/1-1120.058520880454228742240.000001000000500029090.000001000000500029090.087570286797684249170.00000100000050002909tr|A0AAC9VN05|A0AAC9VN05 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|A0A1A2F657|A0A1A2F657 MYCIT/1-1120.000001000000500029090.000001000000500029090.00000100000050002909tr|A0AAE5J1W7|A0AAE5J1W7 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|I2A816|I2A816 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|A0A1Y0T3J5|A0A1Y0T3J5 MYCIT/1-1120.000001000000500029090.00000100000050002909tr|A0A1A3E3N1|A0A1A3E3N1 9MYCO/1-1120.00000100000050002909tr|A0A329L6S6|A0A329L6S6 9MYCO/1-1120.00000100000050002909tr|A0A1W9Z868|A0A1W9Z868 MYCAI/1-1120.000001000000500029090.00000100000050002909tr|A0A1A3F9W6|A0A1A3F9W6 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|A0A1A2TN70|A0A1A2TN70 MYCNT/1-1120.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A1A3K0U1|A0A1A3K0U1 9MYCO/1-1120.000001000000500029090.015970957841382180110.013120042654319359270.008768030997082903660.000001000000500029090.017183007894974212940.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A0E3WDV3|A0A0E3WDV3 MYCLN/1-1120.000001000000500029090.00000100000050002909tr|A0A9E3G999|A0A9E3G999 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|A0A024K503|A0A024K503 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|A0A9E3EUL7|A0A9E3EUL7 9MYCO/1-1120.016229077321782772040.00000100000050002909tr|R4MB35|R4MB35 MYCTX/1-1120.00000100000050002909tr|A0AAQ0EW62|A0AAQ0EW62 MYCTX/1-1120.000001000000500029090.00000100000050002909tr|A0AAP5BW94|A0AAP5BW94 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|A0A829C9I8|A0A829C9I8 9MYCO/1-1120.00000100000050002909tr|A0A679LLM8|A0A679LLM8 MYCBO/1-1120.00000100000050002909sp|P65649|LSR2 MYCBO/1-1120.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A9P2H3T5|A0A9P2H3T5 MYCTX/1-1120.000001000000500029090.00000100000050002909sp|P9WIP6|LSR2 MYCTO/1-1120.000001000000500029090.000001000000500029090.00000100000050002909tr|A5U8S3|A5U8S3 MYCTA/1-1120.00000100000050002909tr|A0A0H3LF71|A0A0H3LF71 MYCTE/1-1120.00000100000050002909tr|A0A045IM61|A0A045IM61 MYCTX/1-1120.000001000000500029090.00000100000050002909tr|A0A0H3MBK0|A0A0H3MBK0 MYCBP/1-1120.000001000000500029090.00000100000050002909tr|A0A7V9WJK7|A0A7V9WJK7 9MYCO/1-1120.00000100000050002909tr|R4MDS7|R4MDS7 MYCTX/1-1120.000001000000500029090.000001000000500029090.000001000000500029090.000001000000500029090.00000100000050002909sp|P9WIP7|LSR2 MYCTU/1-1120.000001000000500029090.01603114430974815138tr|A0A1X2LUG6|A0A1X2LUG6 9MYCO/1-1120.00000100000050002909tr|A0A7Z7IS22|A0A7Z7IS22 9MYCO/1-1120.000001000000500029090.000001000000500029090.01658826219010460512tr|A0A7I7NSK3|A0A7I7NSK3 9MYCO/25-1360.032620612921096303620.00000100000050002909tr|A0A1X2C384|A0A1X2C384 9MYCO/1-1120.016365291130881583910.015816207277505382050.01625254027291897149tr|A0A1X0XQA5|A0A1X0XQA5 MYCSI/1-1120.00000100000050002909tr|A0A1E3SVN8|A0A1E3SVN8 9MYCO/1-1120.000001000000500029090.000001000000500029090.015633371101231821610.00000100000050002909tr|A0A1A2GP05|A0A1A2GP05 9MYCO/1-1120.00000100000050002909tr|A0A1A2XE95|A0A1A2XE95 9MYCO/1-1120.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A1X2D9N8|A0A1X2D9N8 9MYCO/1-1120.00000100000050002909tr|A0A1X2A2E4|A0A1X2A2E4 9MYCO/1-1120.000001000000500029090.00000100000050002909tr|A0A1A2ZFZ5|A0A1A2ZFZ5 9MYCO/1-1120.00000100000050002909tr|A0A1A1ZPJ4|A0A1A1ZPJ4 9MYCO/1-1120.00000100000050002909tr|A0A1A3CS60|A0A1A3CS60 9MYCO/1-1120.000001000000500029090.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A2U3NFH6|A0A2U3NFH6 9MYCO/1-1120.00000100000050002909tr|A0A1A3QM54|A0A1A3QM54 9MYCO/1-1120.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A1A2ZW53|A0A1A2ZW53 9MYCO/1-1120.000001000000500029090.000001000000500029090.00000100000050002909tr|A0A1A3CII0|A0A1A3CII0 MYCAS/1-1120.000001000000500029090.00000100000050002909tr|A0A1A2G290|A0A1A2G290 9MYCO/1-1120.0157796236971837061700.250.50.7511.251.51.75
Figure 9: Newick Display rendered output

Many alternative tree visualisation tools are available, including:

4.1 Visualising a tree with FigTree

The FigTree tool is a useful, cross-platform phylogenetic tree visualisation package that can generate very useful figures at publication quality. It is fairly straightforward to use, and good for experimenting with tree layouts and options.

  1. Download FigTree from http://tree.bio.ed.ac.uk/software/Figtree/ and install it on your machine
  2. Start FigTree and open the RAxML best-scoring tree, downloaded from Galaxy (the file extension will be .nhx)

You should be presented with a screen that resembles Figure 10, showing a similar, but reshaped, view to that in Figure 9. A quick visual comparison should be enough to convince you that, as a scientist, you can make many choices about visualisation that can help the reader understnd your tree, and that can help you make your point better, in your manuscript.

Figure 10: Initial FigTree view of the RAxML tree

4.1.1 Rooted or unrooted?

Real trees have roots, but phylogenetic trees may or may not have roots - it depends on the method used to construct them.

Tip

In a phylogenetic tree, the root is the most ancient common ancestor of all sequences/species/taxa in the tree.

UPGMA and Neighbour-Joining trees have roots because of the way the tree construction algorithm works. This may or may not be a “true” root, in the sense that it represents the true last common ancestor of all sequences in the tree.

Maximum Likelihood and Bayesian trees do not have roots enforced on them by their algorithms, but if you choose sequences so that you know something about the evolutionary history of your organisms, you can accurately identify the true root of a tree.

The tree I generated with RAxML is a Maximum Likelihood tree, so I’m going to visualise it as an unrooted tree, first, in Figure 11.

Figure 11: Unrooted maximum likelihood tree. Note that the FigTree Tip Labels setting is unchecked so that the tree is prominent.

Looking at Figure 11, I can see that the top left pert of the tree has very long branches (the lines between nodes of the tree that either represent divergence events or the sequences themselves). This implies that the sequences at the ends of these branches have undergone much more evolutionary change (i.e. underwent more substitutions) than the other sequences. This has two broad explanations:

  1. This reflects the true evolutionary history of these sequences. They may have retained more substitutions because they were under extreme selection pressure (not unusual for pathogen effectors).
  2. This is an artefact (i.e. not a real biological result). We may have included sequences in our analysis that do not belong - they appear to have accepted many substitutions because they are too different to the rest of the sequences to really belong in the same category.
Warning

We cannot decide between these two options by looking at the tree alone. We need to do more work analysing the sequences and the tree.

4.1.2 Midpoint Rooting

Midpoint rooting is a way of deciding upon a root point for an unrooted tree, without any reference to the true biology. It works by finding the longest path between any two taxa (i.e. end points on the tree) and setting the root at that position. Using the Midpoint Root option on visually unrooted tree doesn’t change it’s appearance drastically, as you can see from Figure 12.

Tip

In FigTree the Midpoint Root option is in the Tree menu.

Figure 12: Unrooted maximum likelihood tree, with midpoint rooting. Due to the visual presentation of this tree, the midpoint rooting appears to have no effect.

But, if we switch the display style to a phylogram as in Figure 13, we can see that the tree appears “balanced”, with a clear outgroup at the bottom end of the tree. We can see three clear sets of sequences that apper to be distanced from each other in the tree.

Figure 13: Unrooted maximum likelihood tree, with midpoint rooting. Due to the visual presentation of this tree as a phylogram, the midpoint rooting “balances” the tree, and we can see three groups of sequences that are relatively dissimilar from each other: the small group at the top of the tree, the large group in the centre, and the sequences at the very bottom of the tree.

4.1.3 Interpreting the tree

As noted above, the very long branches separating our sequences into three groups could represent true evolutionary history, or be an artefact from analysing sequences that don’t really belong together.

Tip

Generally, I would expect a set of related sequences in a phylogenetic tree to have about the same “distance” from the root to each tip. The presence of these three distinct groups could be a red flag that I’ve done something wrong - or it could be very interesting!

The first thing I want to know is what sequences are in each of these three groups. I first ladderise (Tree -> I ncreasing node order) the tree, and turn Tip Labels back on. Then I use the Expansion slider to stretch the tree vertically and make space between the tip labels, as in Figure 14.

Figure 14: Ladderised maximum likelihood tree with tip labels (some of the tree disappears out of the top of the FigTree window.

By using the Zoom slider and increasing the font size of the Tip Labels menu I can see the sequence names more clearly in Figure 15.

Figure 15: Ladderised maximum likelihood tree with tip labels, zoomed to see the two most distantly-related sequence sets.

By inspecting this zoomed-in view, I can see that the main group of sequences is annotated with codes that end in the letters NOCA, NOCBR or similar; most of the sequences in the dataset have codes like MYCO, MYCRU and so on. This tells me that:

  • all of the main set of sequences derive from Mycobacterium
  • the larger group of outlying sequences derive from Nocardia

The single other outlying sequence appears to come from Mycobacterium, so we should look at the sequence alignment to see whether there is an issue with the alignment (Figure 16).

Figure 16: Jalview visualisation of the sequence alignment, centring on the Nocardia sequences, and the singleoutlying Mycobacterium sequence.

From this alignment, we can see that the Nocardia sequences are broadly similar to each other, and do align with the Mycobacterium sequences, even though there appears to be an inserted region (we can look at the location of this on the structure, later), and quite a bit of sequence difference. This kind of pattern is consistent with there being a common ancestor of this sequence in Nocardia and Mycobacterium, and them being separated by speciation, then evolving separately for a long period.

The Mycobacterium sequence that looks out of place is tr|A0A1Z4F4V8|A0A1Z4F4V8_9MYCO/27-140 - the sequence looks similar enough to the others to appear to be validly aligned, but it is quite different to the other Mycobacterium proteins. It it not clear at this point why these differences should be present, so we continue our analysis keeping this sequence in the dataset.

4.2 General conclusions from the tree

At this point we can draw some tentative conclusions from the tree.

  • There was probably a related sequence present in an ancestor of Mycobacterium and Nocardia, and the homologues we find in the database have diverged after speciation.
    • It is not clear whether these proteins have the same function(s) in these two genera.
  • There is a questionable Mycobacterium homologue, but we don’t have sufficient evidence to exclude it from the analysis as an artefact and can proceed assuming there is a biological relationship.
  • There are, within the Mycobacteria, several groups of related sequences (we call these clades) which might represent any or all of:
    • speciation (i.e. the tree reflects organism history and evolution)
    • functional divergence (i.e. each clade represents a different biological interaction, activity, or function)
    • but we can’t tell from the tree alone what the (most) likely explanations are - we need to do more work
  • The MYCO coding does not give us species information and it’s not possible to see whether the groupings of sequences correspond to speciation, or paralogy within species. We would need to do some bioinformatics work to label these sequences with their corresponding taxa.