11  Assemble the ERR531380 Genome

Someone else has assembled the genome for you

The next step in the investigation would normally be to assemble the ERR531380 genome and assess the assembly quality.

Using Galaxy, genome assembly would take too long for this workshop.

In this scenario, a colleague has kindly assembled your genome for you using the SPAdes assembler (Prjibelski et al. (2020)) and provided the result for you to download in two files, below - the assembled genome (.fasta file), and the assembly graph (.gfa file):

  1. Download the assembled genome and assembly graph to the local folder containing your workshop files
  2. Upload both files to Galaxy
Video: Upload the ERR531380 assembly files
Inspect the uploaded assembly
  1. Click on the uploaded ERR531380.contigs.fasta file
QUESTIONS
  1. Has the genome been assembled into a single sequence?
  2. How many sequences (called contigs) was the genome assembled into?

In this section you will investigate the quality of this genome assembly visually using the Bandage tool. You will also use the CheckM tool to obtain a measure of genome completeness.

11.1 Visualising the Assembly

The Bandage software package (Wick et al. (2015)) can take assembly output from tools such as SPAdes and Shovill, and visualise them as a graph. This is a useful step in assessing the quality of an assembly and can help identify poorly-assembled regions and when there is potential unintentional sequencing of co-cultures of related strains, rather than an axenic isolate.

To use Bandage to visualise your genome assembly

  1. Navigate to the Bandage Image tool using the Tools sidebar in Galaxy
  2. Select the Bandage Image tool
  3. Make sure that you have selected the .gfa file as the Graphical Fragment Assembly input
  4. Click Run Tool
Video: Visualising the ERR531380 assembly using Bandage Image
Inspect the assembly graph by clicking on the eye icon for the Assembly Graph Image output.

Each separate subgraph in the Bandage graph image shows a section of the genome that the assembly software thinks might be linked together in some way. The branched and looped nature of these graphs shows where the assembler has been unable to make a decision between two or more ways the reads could be joined together.

Each differently-coloured section of the subgraph is a part of the genome sequence that the assembler was confident about stitching together into a contiguous sequence.

This kind of genome assembly, comprising many fragments, some of which can’t be confidently linked together, is called a draft genome assembly.

11.2 Estimating Assembly Quality

Bandage gives a visual account of the “quality” of an assembly in terms of its contiguity. By contrast, the CheckM software (Parks et al. (2015)) is a suite of tools used to assess the quality of a bacterial assembly in terms of whether it resembles known similar genome sequences.

CheckM estimates the completeness of the genome by looking for the presence of single copy marker genes (SCMGs) for a stated phylogenetic lineage. The more of these marker genes that can be determined to be present in the genome, the more complete the assembly is assumed to be.

The use CheckM to estimate the quality of your assembly:

  1. Navigate to the CheckM taxonomy_wf tool using the Tools sidebar in Galaxy
  2. Select the Taxonomic rank of Species and set the Taxon of interest to be Pseudomonas aeruginosa (the identity of the isolate is known, or at least strongly suspected, on the basis of polyphasic tests)
  3. Under Data structure for bins select In individual datasets and choose the ERR531380.contigs.fasta file
  4. Click on Run Tool
Video: Estimating assembly quality using CheckM

11.3 Next steps

With the draft genome assembly you have for ERR531380 you can use pubMLST to classify your isolate in more detail, using Galaxy and the pubMLST server.