Glossary

Accession Number

An accession number is the unique identifier for an entry, such as a biological sequence record, in a database.

Note

If the same sequence is present in more than one database (e.g. GenBank, RefSeq, UniProt, Ensembl, etc.) it will usually have a different accession number in each database.

Tip

The online interface to a BLAST search may allow you to simply paste the accession number into a tool to carry out your search.

Alignment

Sequence alignment is the process of optimally matching the symbols between two biological sequences to obtain a maximum level of similarity between those sequences.

The result of that process is also called a sequence alignment. This is often used synonymously with match, even though a single match may comprise more than one alignment.

Bit Score

A sequence alignment and/or match is associated with a bit score that statistically represents the extent of similarity between the query sequence and the match/alignment, using the specified scoring scheme. A larger bit score indicates greater similarity between the sequences.

Tip

Bit scores are normalised to the scoring scheme, so searches with the same search tool against different databases can be compared by bit score (but not by E-value)

BLAST (Basic Local Alignment Search Tool)

BLAST is a family of software tools that compares one or more query biological sequences against a database of sequences, or against a file containing one or more sequences.

BLAST is also the name given to the sequence search algorithm originally implemented in the BLAST software.

Note

The original BLAST software suite was replaced by the completely rewritten BLAST+ in 2008. The underlying algorithms were changed in some cases (e.g. for BLASTN), so results obtained using the two versions can differ, and in some cases the BLAST tools no longer use the BLAST algorithm.

blastn

blastn is the BLAST suite tool for searching with a query nucleic acid sequence against a database containing nucleic acid sequences - for example when querying a 16S rDNA sequence against a genome sequence database.

blastp

blastp is the BLAST suite tool for searching with a query amino acid sequence against a database containing amino acid sequences - for example when querying an enzyme’s protein sequence against a database of all proteins made by an organism.

Coverage (percent coverage)

Coverage is the proportion of symbols in a sequence that participate in an alignment.

Note

The best alignment of two sequences may not include the full length of either the query or subject sequence. The query coverage expresses the proportion of the query sequence involved in the alignment as a percentage, and the subject coverage expresses the proportion of the subject sequence (usually a database sequence) involved in the alignment as a percentage.

Warning

The query coverage and subject coverage are not always equal.

E-value (expectation value)

The E-value for a returned match/alignment is an estimate of the number of alignments with the same bitscore or higher (i.e. alignments of equal quality or better) that would be expected if you were to search a database of the same size that contained completely random sequences.

Important

The E-value is not a probability, and E-values should not be used to compare between searches of different databases. Use the alignment bitscore to compare results between databases.

FASTA Format

FASTA is a common file format used to describe sequence data. The FASTA record for a sequence has a header line that begins with a right-angled bracket (>), e.g: >NC_000913.3:223903-225030. The corresponding sequence begins on the following line.

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

Homology

Homology is the state of being related by descent from a common ancestor. Two sequences are homologous if they share a common ancestor.

Important

There are not degrees of homology: sequences are either homologous or not homologous.

Identity (percentage identity)

Sequence identity is the proportion of symbols (nucleotides or amino acids) that are identical and aligned in both sequences of an alignment or match.

Warning

Note that an alignment or match may not extend for the full length of either the query or subject sequence, so it is not always useful to assume that higher sequence identity implies a better sequence match. See coverage.

Match

A match is a sequence returned from the database being searched that has appreciable sequence similarity with the query sequence. “Match” is often used synonymously with alignment, though a single match may comprise more than one alignment.

Query Sequence

The biological sequence used as a search term.

Scoring Scheme

The scoring scheme for a database search is a quantitative representation of how the search tool rates similarities between two sequences.

Note

The scoring scheme is often a substitution scoring matrix of values indicating how likely we expect one symbol (e.g. amino acid) to be substituted by a different symbol in a sequence alignment. These values are often derived from observations of substitutions in blocks of aligned sequences.

Tip

The choice of scoring scheme affects the alignment, bit score, and E-value. Some scoring schemes are suited to finding more distant, or more recent, relationships between sequences.

Subject Sequence

A subject sequence is a specific match returned by the database search.