9  Annotate the Genome

The Prokka tool (Seemann (2014)) is an annotation pipeline designed specifically for prokaryotes (hence the name…), but it will also work very well on virus genomes.

Note

Prokka produces a set of output files corresponding to accepted community standards, intended to be ready for submission of the annotated genome to a public repository.

In practice, submissions to the NCBI repository often use NCBI’s own Prokaryotic Genome Annotation Pipeline (PGAP) tool, which runs when the genome is sent to NCBI. This typically produces different annotation results to Prokka.

Good practice

When reporting how you annotated your genome in a manuscript or dissertation, you should always state:

  1. The software tool you used, with its version number and a citation of the paper describing it (if available; provide a URL to the software if there is no paper)
  2. The parameters used when running the tool (if default parameters were used, state this)
  3. The number of features that were annotated; if there is space, providing a table of feature types (e.g. genes, coding sequences (CDS), RNA sequences, etc.) and their corresponding counts can be helpful.

9.1 Determine Genome Taxonomy

Prokka can use a core set of reference databases for the specified genus being annotated, to improve annotation accuracy. To make the best use of Prokka we should provide the sequenced organism’s genus. To find out what this is, we will use the NCBI Taxonomy service.

  1. Click on the NCBI Taxonomy link to reach the NCBI Taxonomy service
  2. Enter SARS-CoV-2 into the search field and click Search
  3. Click on the search result
  4. Mouseover the Lineage information to identify the genus of SARS-CoV-2
Questions
  1. What is the genus of SARS-CoV-2
  2. How many SARS-CoV-2 genome assemblies can be found at NCBI?
  3. How many SARS-CoV-2 nucleotide sequences can be found at NCBI?
  1. SARS-CoV-2 is a Betacoronovirus
  2. There are 116 SARS-CoV-2 assemblies at NCBI (as of 2024-08-30)
  3. There are 8,850,155 SARS-CoV-2 nucleotide assemblies at NCBI (as of 2024-08-30)
Video: Identifying the SARS-CoV-2 lineage at NCBI Taxonomy

9.2 Annotate the Genome

Now that we know the appropriate genus for our assembly, we can use Prokka to annotate it.

  1. Navigate to the prokka tool using the Tools sidebar in Galaxy
  2. Select the prokka tool
  3. Make sure that the Shovill output for Contigs is selected as Contigs to annotate
  4. Set the Locus tag prefix to SARSCoV2
  • this will be used to help give each annotated feature a recognisable name
  1. Select Yes for Force GenBank/ENA/DDJB compliance
  2. Enter Betacoronavirus under Genus name
  3. Enter SARS-CoV-2 under Species name
  4. Enter Wuhan 1 under Strain name
  5. Select Viruses from the drop-down box under Kingdom
  6. Select Yes for Use genus-specific BLAST database
  7. Click on Run Tool
Video: Annotating the SARS-CoV-2 genome using Prokka
Caution

Prokka can take a few minutes to run to completion.

9.3 Prokka output

Prokka produces several annotation output files, in standard bioinformatics data formats. These formats are designed to be processed unambiguously by bioinformatics software tools. The output includes

Figure 9.1: Prokka output files in Galaxy.
  • .log: a log file of what Prokka did while annotating the genome
  • .gff: a table of annotated features
  • .tsv: a summary table of annotated features
  • .ffn: FASTA format nucleotide sequences of annotated features
  • .faa: FASTA format amino acid sequences of annotated features
  • .gbk: GenBank format file combining the genome sequence and its annotations

It can be worth spending a little time inspecting these output files to see what kinds of data they do and do not contain.

Video: Inspecting Prokka output
Questions
  1. How many genes were annotated by Prokka?
  2. How many coding sequences (CDS) were annotated by Prokka?
  3. How many genes were identified on the reverse strand?
  4. What is the product of the longest annotated gene?
  5. How long is the shortest annotated gene that is not a “hypothetical protein”?
  6. What version of Prokka was used to annotate the genome?
  7. What are the start and end base positions of the spike protein?
  1. Look in the .gff file
  2. Look in the .gff file
  3. Look in the .gff file
  4. Look in the .tsv file
  5. Look in the .tsv file
  6. Look in the .gbk file
  7. Look in the .gff file
  1. Nine (9) genes were annotated
  2. Nine (9) CDS were annotated
  3. No genes are identified on the reverse strand (there is no reverse strand to an RNA virus)
  4. Replicase polyprotein 1a is the longest gene (13218 bases)
  5. The shortest non-hypothetical gene is protein 7a (366 bases)
  6. Prokka v1.14.6
  7. 21568..25389