9 Annotate the Genome
The Prokka tool (Seemann (2014)) is an annotation pipeline designed specifically for prokaryotes (hence the name…), but it will also work very well on virus genomes.
Prokka produces a set of output files corresponding to accepted community standards, intended to be ready for submission of the annotated genome to a public repository.
In practice, submissions to the NCBI repository often use NCBI’s own Prokaryotic Genome Annotation Pipeline (PGAP) tool, which runs when the genome is sent to NCBI. This typically produces different annotation results to Prokka.
When reporting how you annotated your genome in a manuscript or dissertation, you should always state:
- The software tool you used, with its version number and a citation of the paper describing it (if available; provide a URL to the software if there is no paper)
- The parameters used when running the tool (if default parameters were used, state this)
- The number of features that were annotated; if there is space, providing a table of feature types (e.g. genes, coding sequences (CDS), RNA sequences, etc.) and their corresponding counts can be helpful.
9.1 Determine Genome Taxonomy
Prokka can use a core set of reference databases for the specified genus being annotated, to improve annotation accuracy. To make the best use of Prokka we should provide the sequenced organism’s genus. To find out what this is, we will use the NCBI Taxonomy service.
- Click on the NCBI Taxonomy link to reach the NCBI Taxonomy service
- Enter
SARS-CoV-2into the search field and clickSearch - Click on the search result
- Mouseover the
Lineageinformation to identify the genus of SARS-CoV-2
- What is the genus of SARS-CoV-2
- How many SARS-CoV-2 nucleotide sequences can be found at NCBI?
- SARS-CoV-2 is a Betacoronovirus
- There are 9,158,699 SARS-CoV-2 nucleotide assemblies at NCBI (as of 2025-09-19)
9.2 Annotate the Genome
Now that we know the appropriate genus for our assembly, we can use Prokka to annotate it.
- Navigate to the
prokkatool using theToolssidebar in Galaxy - Select the
prokkatool - Make sure that the
Shovilloutput forContigsis selected asContigs to annotate - Set the
Locus tag prefixtoSARSCoV2
- this will be used to help give each annotated feature a recognisable name
- Select
YesforForce GenBank/ENA/DDJB compliance - Enter
BetacoronavirusunderGenus name - Enter
SARS-CoV-2underSpecies name - Enter
Wuhan 1underStrain name - Select
Virusesfrom the drop-down box underKingdom - Select
YesforUse genus-specific BLAST database - Click on
Run Tool
Prokka
Prokka can take a few minutes to run to completion.
9.3 Prokka output
Prokka produces several annotation output files, in standard bioinformatics data formats. These formats are designed to be processed unambiguously by bioinformatics software tools. The output includes
Prokka output files in Galaxy.
.log: a log file of whatProkkadid while annotating the genome.gff: a table of annotated features.tsv: a summary table of annotated features.ffn: FASTA format nucleotide sequences of annotated features.faa: FASTA format amino acid sequences of annotated features.gbk: GenBank format file combining the genome sequence and its annotations
It can be worth spending a little time inspecting these output files to see what kinds of data they do and do not contain.
Prokka output
- How many genes were annotated by
Prokka? - How many coding sequences (CDS) were annotated by
Prokka? - How many genes were identified on the reverse strand?
- What is the product of the longest annotated gene?
- How long is the shortest annotated gene that is not a “hypothetical protein”?
- What version of
Prokkawas used to annotate the genome? - What are the start and end base positions of the spike protein?
- Look in the
.gfffile - Look in the
.gfffile - Look in the
.gfffile - Look in the
.tsvfile - Look in the
.tsvfile - Look in the
.gbkfile - Look in the
.gfffile
- Nine (9) genes were annotated
- Nine (9) CDS were annotated
- No genes are identified on the reverse strand (there is no reverse strand to an RNA virus)
- Replicase polyprotein 1a is the product of the longest gene (13218 bases)
- The product of the shortest non-hypothetical gene is protein 7a (366 bases)
Prokkav1.14.6- 21568..25389