9 Annotate the Genome

The Prokka tool (Seemann (2014)) is an annotation pipeline designed specifically for prokaryotes (hence the name…), but it will also work very well on virus genomes.

Prokka documentation

Note

Prokka produces a set of output files corresponding to accepted community standards, intended to be ready for submission of the annotated genome to a public repository.

In practice, submissions to the NCBI repository often use NCBI’s own Prokaryotic Genome Annotation Pipeline (PGAP) tool, which runs when the genome is sent to NCBI. This typically produces different annotation results to Prokka.

Good practice

When reporting how you annotated your genome in a manuscript or dissertation, you should always state:

The software tool you used, with its version number and a citation of the paper describing it (if available; provide a URL to the software if there is no paper)
The parameters used when running the tool (if default parameters were used, state this)
The number of features that were annotated; if there is space, providing a table of feature types (e.g. genes, coding sequences (CDS), RNA sequences, etc.) and their corresponding counts can be helpful.

9.1 Determine Genome Taxonomy

Prokka can use a core set of reference databases for the specified genus being annotated, to improve annotation accuracy. To make the best use of Prokka we should provide the sequenced organism’s genus. To find out what this is, we will use the NCBI Taxonomy service.

NCBI Taxonomy

Click on the NCBI Taxonomy link to reach the NCBI Taxonomy service
Enter SARS-CoV-2 into the search field and click Search
Click on the search result
Mouseover the Lineage information to identify the genus of SARS-CoV-2

Questions

What is the genus of SARS-CoV-2
How many SARS-CoV-2 genome assemblies can be found at NCBI?
How many SARS-CoV-2 nucleotide sequences can be found at NCBI?

Answers

SARS-CoV-2 is a Betacoronovirus
There are 116 SARS-CoV-2 assemblies at NCBI (as of 2024-08-30)
There are 8,850,155 SARS-CoV-2 nucleotide assemblies at NCBI (as of 2024-08-30)

Video: Identifying the SARS-CoV-2 lineage at NCBI Taxonomy

9.2 Annotate the Genome

Now that we know the appropriate genus for our assembly, we can use Prokka to annotate it.

Navigate to the prokka tool using the Tools sidebar in Galaxy
Select the prokka tool
Make sure that the Shovill output for Contigs is selected as Contigs to annotate
Set the Locus tag prefix to SARSCoV2

this will be used to help give each annotated feature a recognisable name

Select Yes for Force GenBank/ENA/DDJB compliance
Enter Betacoronavirus under Genus name
Enter SARS-CoV-2 under Species name
Enter Wuhan 1 under Strain name
Select Viruses from the drop-down box under Kingdom
Select Yes for Use genus-specific BLAST database
Click on Run Tool

Video: Annotating the SARS-CoV-2 genome using Prokka

Caution

Prokka can take a few minutes to run to completion.

9.3 `Prokka` output

Prokka produces several annotation output files, in standard bioinformatics data formats. These formats are designed to be processed unambiguously by bioinformatics software tools. The output includes

Figure 9.1: `Prokka` output files in Galaxy.

.log: a log file of what Prokka did while annotating the genome
.gff: a table of annotated features
.tsv: a summary table of annotated features
.ffn: FASTA format nucleotide sequences of annotated features
.faa: FASTA format amino acid sequences of annotated features
.gbk: GenBank format file combining the genome sequence and its annotations

Data file formats

It can be worth spending a little time inspecting these output files to see what kinds of data they do and do not contain.

Video: Inspecting Prokka output