Online Resources – BM432 Project Pages

Computational Biology is unusually accessible as an applied science in part because so much can be done by an individual on modest hardware without access to a laboratory or computing cluster. All you need to bring is your brain.

A large part of the reason for the accessibility of the topic is the sustained drive for Open Science practised by bioinformatics, computational biologists, and other scientists. These have encouraged, and sometimes demanded, open, free, FAIR (findable, accessible, interoperable, reusable) data, which has benefited us all.

This page lists some of the incredibly valuable, open data resources that might be of use to you in your project. It is not an exhaustive list.

1 Sequence data repositories (including annotated genome data)

NCBI - the repository of record for many datasets, not just sequence data
- Assembly - assembled genomes and other metadata
- GenBank - all publicly available DNA sequences
- Nucleotide - aggregated data from GenBank, RefSeq, and elsewhere
- RefSeq - curated, non-redundant, gDNA, transcript, and protein sequences
- SRA - sequencing read data
UniProt - protein sequence and annotation data
Ensembl - vertebrate genome data
- Ensembl Bacteria - bacterial genome data
- Ensembl Fungi - fungal genome data
- Ensembl Plants - plant genome data
- Ensembl Protists - protist genome data
InterPro - protein families and sequence domains

2 Structural data repositories

RCSB-PBD - the repository of record for biomolecular structure data
EMBL AlphaFold - EMBL’s AlphaFold predictions for multiple organisms

3 Transcriptome data repositories

GEO - transcriptome experiment (microarray, RNAseq etc., data
HTCA - human transcriptome cell atlas

4 Molecular interaction databases

STRING - known and predicted interactions
BioGrid - curated interactions and post-translational modifications
IntAct - EMBL-EBI’s database of interactions

5 Biological models

BioModels - mathematical models of biological systems

6 Specialised functional databases

PHI-Base - curated database of pathogen-host interactions
CAZy - curated database of carbohydrate-acive enzymes

7 Taxonomic and other classification resources

NCBI Taxonomy
- Widely-used, but not as widely trusted, as it is often at odds with other classification databases - LP
GTDB
- Excellent genome-based microbial taxonomy and classification database and resource - LP
genomeRxiv
- Genome-based, taxonomy-independent classification. I work on this - LP
Enterobase
- The central resource for enteric bacteria genomic variation and classification - LP
PhytoBacExplorer
- Like Enterobase, but for plant pathogenic bacteria - LP