This page lists some data resources that you may find useful while working on your capstone project.
While we have attempted to include common databases and resources you are likely to use, but it is not an exhaustive list of all the public databases that exist in the biomedical sciences.
Please send any suggestions for additional databases that should be included here, or report any broken links, to Dr. Morgan Feeney¹.

1 Multi-organism databases

BioCyc
- BioCyc integrates genome data with a comprehensive body of additional data including metabolic reconstructions, regulatory networks, protein features, orthologs, gene essentiality, and atom mappings.
BioModels
- BioModels is the repository of record for mathematical models of biological and biomedical systems. Many of the models in this database are manually validated and tested for reproducibility.
BRENDA
- BRENDA integrates functional and sequence data for enzymes.
Genbank
- GenBank is the NIH genetic sequence database, the reference of record for nucleotide sequences. Deposition of new sequence data (including genes, genomes, and metagenomes) into this database is usually a requirement for publication.
Genomes OnLine Database (GOLD)
- Comprehensive access to information regarding genome and metagenome sequencing projects, and their associated metadata, around the world.
Ensembl Genome Browser
- A genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation.
Ensembl Genomes
- The equivalent of Ensembl Genome Browser for non-vertebrate genomes, including plants, bacteria, fungi and protists.
IMG/M
- Supports the annotation, analysis and distribution of microbial genome and microbiome for those datasets sequenced at DOE’s Joint Genome Institute (JGI).
Kyoto Encyclopedia of Genes and Genomes (KEGG)
- A database resource that integrates pathway, sequence and enzyme function into high-level overviews of biological system, such as metabolic pathways, the cell, the organism and the ecosystem.
microbesonline
- A comparative functional genomic database and workbench with tools for phylogenetic analysis and annotation, functional data storage/display/analysis, metabolic analysis and metafunctional genomics
NCBI Databases
- NCBI provides several databases (SRA, GEO, and GenBank are provided by NCBI but listed separately, here). In addition to raw sequence and transcriptome data, NCBI provides metadata describing biological samples, research projects, publications (PMIC) that are integrated and cross-referenced to each other.
Pfam
- A large collection of protein domains and families, each represented by multiple sequence alignments and hidden Markov models (HMMs), that is a central reference for protein functional annotation.
RCSB Protein Data Bank (PDB)
- The repository of record for 3D structure data of large biological molecules (proteins, DNA, RNA, and complex assemblies). Deposition of structural data in this database is usually a requirement for publication.
Rfam
- A resource that is analogous to Pfam for RNA sequences. A large collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models.
STRING
- A database of known and predicted protein-protein interactions, and networks of those interactions.
Sequence Read Archive (SRA)
- The repository of record for high throughput sequencing (sequence read) data. Deposition of read data in this database is usually a requirement for publication.
Uniprot
- A comprehensive, high-quality and freely accessible resource of protein sequence and functional information. It comprises four databases, the two key resources being Swiss-Prot - which is manually annotated and reviewed - and TrEMBL, which is automatically annotated and not reviewed.

2 Organism-specific databases

EcoCyc:
- Database for the bacterium Escherichia coli K-12 MG1655
FlyBase:
- Database of Drosophila Genes & Genomes
Mouse Genome Informatics (MGI):
- International database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease.
MycoBrowser:
- A comprehensive genomic and proteomic data repository for pathogenic mycobacteria
PomBase:
- Comprehensive database for the fission yeast Schizosaccharomyces pombe, providing structural and functional annotation, literature curation and access to large-scale data sets
pseudomonas.com:
- A repository for Pseudomonas genomes and functional annotations.
Saccharomyces Genome Database (SGD):
- Comprehensive integrated biological information for the budding yeast Saccharomyces cerevisiae along with search and analysis tools to explore these data
StrepDB:
- The Streptomyces Annotation Server
The Arabidopsis Information Resource (TAIR):
- Genetic and molecular biology data for the model higher plant Arabidopsis thaliana
WormBase:
- Accurate, current, accessible information concerning the genetics, genomics and biology of C. elegans and related nematodes
XenBase:
- Resource that integrates all the diverse biological, genomic, genotype and phenotype data available from Xenopus research
ZFin:
- Database of genetic and genomic data for the zebrafish (Danio rerio) as a model organism

3 Topic-specific databases

ChEMBL:
- ChEMBL is a manually-curated database of bioactive molecules with drug-like properties that integrates chemical, bioactivity and genomic data.
Codon Usage Database:
- Database of codon usage tabulated from Genbank.
Comprehensive Antibiotic Resistance Database (CARD):
- Bioinformatic database of resistance genes, their products and associated phenotypes.
MetaCyc:
- A curated database of experimentally elucidated metabolic pathways from all domains of life.
PHI-Base
- A curated database describing molecular and biological information on genes proven to affect the outcome of pathogen-host interactions.
The Restriction Enzyme Database (REBASE):
- A dynamic, curated database of restriction enzymes and related proteins

4 Other data resources

Some publications will deposit experimental or other supplementary data at generic data repositories that are not, strictly, databases. This is an increasingly common practice as funders insist on Open Science practices. Some examples of these resources are listed below.

FigShare
- FigShare provides a repository for papers, FAIR data and non-traditional research outputs. It is able to assign DOIs to each deposited item.
GitHub
- GitHub offers distributed version control and source code management, plus a range of features such as bug tracking, feature requests, task management, continuous integration and wikis. Most open source scientific software is hosted on GitHub. Many researchers use GitHub as a repository for supplementary data, with Zenodo as the DOI issuing authority.
osf.io:
- OSF is a free, open source project management tool that supports researchers throughout their entire project lifecycle. Many researchers use OSF as a long-term data repository, electronic lab notebook, or as the collaboration tool for their team’s research - this repository may then become supplementary information in a publication.
Zenodo
- Zenodo - developed at CERN as a catch-all repository for research data - provides a range of services for open science, including issuing of DOIs for research items. Many researchers use Zenodo to host preprints, archive public datasets, and software projects.

Click to see links to additional databases

Looking for a database not listed here? Try looking in the NAR database list or the Online Bioinformatics Resources Collection (OBRC).

Or, if you’re feeling ambitious, make a pull request against the course material repository ↩︎

Links to Key Public Data Resources

Morgan Feeney, Leighton Pritchard

2021 Presentation

1 Multi-organism databases

2 Organism-specific databases

3 Topic-specific databases

4 Other data resources