Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system.
For researchers to benefit from the data stored in a database, two additional requirements must be met:
Easy access to the information
A method for extracting only that information needed to answer a specific biological question.
According to a report of 2014 Molecular Biology Database Collection in the journal Nucleic Acids Research, there are sum of 1552 databases that are publicly accessible online.
They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures.
Primary databases are also called as archieval database.
They are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure. Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature.Once given a database accession number, the data in primary databases are never changed, they form part of the scientific record.
Examples
ENA, GenBank and DDBJ
Array Express Archive and GEO
Protein Data
Swiss-Prot and PIR for protein sequences
Secondary databases have data derived from the results of analysing primary data. They often draw upon information from many sources, including other databases (primary and secondary), controlled vocabularies and the scientific literature. They are highly curated, often using a complex combination of computational algorithms and manual analysis and interpretation to derive new knowledge from the public record of science.
Examples
InterPro
UniProt Knowledgebase
Ensembl
SCOP
CATH
PROSITE
eMOTIF
TrEMBL
Species-specific databases are available for some species, mainly those that are often used in research (Model Organisms).
EcoCyc
Mouse Genome Informatics for the laboratory mouse
the Rat Genome Database for Rattus
ZFIN for Danio Rerio (zebrafish)
PomBase
FlyBase
WormBase
Xenbase
There are also specialized databases are those that cater to a particular research interest. For example, Flybase, HIV sequence database, and Ribosomal Database Project are databases that specialize in a particular organism or a particular type of data.
Hybrid Databases
Many data resources have both primary and secondary characteristics. For example, UniProt accepts primary sequences derived from peptide sequencing experiments. However, UniProt also infers peptide sequences from genomic information, and it provides a wealth of additional information, some derived from automated annotation (TrEMBL), and even more from careful manual analysis (SwissProt).
Biological databases can be broadly classified into sequence, structure and functional databases. Nucleic acid and protein sequences are stored in sequence databases and structure databases store solved structures of RNA and proteins. Functional databases provide information on the physiological role of gene products, for example enzyme activities, mutant phenotypes, or biological pathways.
Main sequence databases:
NCBI
EMBL
Main protein databases:
Uniprot
PDB
MMDB
Entrez Protein
Genome databases:
ENSEMBL (Human, mouse and others)
SGD (Yeast)
TAIR (Arabidopsis)
Bibliography:
Pubmed
Web of Science
Human diseases:
OMIM
Metabolic pathways:
KEGG
REGG
Phenotype databases
PHI-base
RGD
PomBase
RNA databases
miRBase
Rfam