9 Annotate the Genome
The Prokka
tool (Seemann (2014)) is an annotation pipeline designed specifically for prokaryotes (hence the name…), but it will also work very well on virus genomes.
Prokka
produces a set of output files corresponding to accepted community standards, intended to be ready for submission of the annotated genome to a public repository.
In practice, submissions to the NCBI repository often use NCBI’s own Prokaryotic Genome Annotation Pipeline (PGAP) tool, which runs when the genome is sent to NCBI. This typically produces different annotation results to Prokka
.
When reporting how you annotated your genome in a manuscript or dissertation, you should always state:
- The software tool you used, with its version number and a citation of the paper describing it (if available; provide a URL to the software if there is no paper)
- The parameters used when running the tool (if default parameters were used, state this)
- The number of features that were annotated; if there is space, providing a table of feature types (e.g. genes, coding sequences (CDS), RNA sequences, etc.) and their corresponding counts can be helpful.
9.1 Determine Genome Taxonomy
Prokka can use a core set of reference databases for the specified genus being annotated, to improve annotation accuracy. To make the best use of Prokka
we should provide the sequenced organism’s genus. To find out what this is, we will use the NCBI Taxonomy service.
- Click on the NCBI Taxonomy link to reach the NCBI Taxonomy service
- Enter
SARS-CoV-2
into the search field and clickSearch
- Click on the search result
- Mouseover the
Lineage
information to identify the genus of SARS-CoV-2
- What is the genus of SARS-CoV-2
- How many SARS-CoV-2 genome assemblies can be found at NCBI?
- How many SARS-CoV-2 nucleotide sequences can be found at NCBI?
- SARS-CoV-2 is a Betacoronovirus
- There are 116 SARS-CoV-2 assemblies at NCBI (as of 2024-08-30)
- There are 8,850,155 SARS-CoV-2 nucleotide assemblies at NCBI (as of 2024-08-30)
9.2 Annotate the Genome
Now that we know the appropriate genus for our assembly, we can use Prokka
to annotate it.
- Navigate to the
prokka
tool using theTools
sidebar in Galaxy - Select the
prokka
tool - Make sure that the
Shovill
output forContigs
is selected asContigs to annotate
- Set the
Locus tag prefix
toSARSCoV2
- this will be used to help give each annotated feature a recognisable name
- Select
Yes
forForce GenBank/ENA/DDJB compliance
- Enter
Betacoronavirus
underGenus name
- Enter
SARS-CoV-2
underSpecies name
- Enter
Wuhan 1
underStrain name
- Select
Viruses
from the drop-down box underKingdom
- Select
Yes
forUse genus-specific BLAST database
- Click on
Run Tool
Prokka
Prokka
can take a few minutes to run to completion.
9.3 Prokka
output
Prokka
produces several annotation output files, in standard bioinformatics data formats. These formats are designed to be processed unambiguously by bioinformatics software tools. The output includes
.log
: a log file of whatProkka
did while annotating the genome.gff
: a table of annotated features.tsv
: a summary table of annotated features.ffn
: FASTA format nucleotide sequences of annotated features.faa
: FASTA format amino acid sequences of annotated features.gbk
: GenBank format file combining the genome sequence and its annotations
It can be worth spending a little time inspecting these output files to see what kinds of data they do and do not contain.
Prokka
output
- How many genes were annotated by
Prokka
? - How many coding sequences (CDS) were annotated by
Prokka
? - How many genes were identified on the reverse strand?
- What is the product of the longest annotated gene?
- How long is the shortest annotated gene that is not a “hypothetical protein”?
- What version of
Prokka
was used to annotate the genome? - What are the start and end base positions of the spike protein?
- Look in the
.gff
file - Look in the
.gff
file - Look in the
.gff
file - Look in the
.tsv
file - Look in the
.tsv
file - Look in the
.gbk
file - Look in the
.gff
file
- Nine (9) genes were annotated
- Nine (9) CDS were annotated
- No genes are identified on the reverse strand (there is no reverse strand to an RNA virus)
- Replicase polyprotein 1a is the longest gene (13218 bases)
- The shortest non-hypothetical gene is protein 7a (366 bases)
Prokka
v1.14.6- 21568..25389