In response to a need for a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology, the National Center for Biotechnology Information (NCBI) has established the dbSNP database. Submissions to dbSNP will be integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data.
The National Center for Biotechnology Information (NCBI) creates and maintains a set of databases that archive, process, display and report information related to human germline and somatic variants. The database, Database of Short Genetic Variations (dbSNP) represents almost 2 billion submitted human variants. The primary role of this is to process submissions, archive the data, annotate on the genome and NCBI Reference Sequences (RefSeqs), and distribute it worldwide. The data is important for studying the basis of human diseases to improve diagnosis, treatment, and prevention and for research in a variety of fields such as species diversity, evolution, and conservation. Submission is accepted in certain formats including VCF for reporting many variations generated by high-throughput sequencing (HTS) projects over multiple populations, as well as a wide variety of associated data including genotype and allele frequency data. Each submitted variant is assigned a database identifier (ss# in dbSNP or nsv#/esv# in dbVar) for citing in publications, allow cross-reference to other databases and linking to related data, facilitate annotation, and promote data exchange. These submissions are then processed to aggregate information from multiple submitters (rs# in dbSNP) and to calculate locations and functional consequences on RefSeqs and to integrate with other NCBI resources including Gene, PubMed, Nucleotide, Protein, and Genome. dbSNP data are updated during regular build cycle with annotations on new assemblies and RefSeqs and the data distributed in diverse ways:
dbSNP contains population-specific frequency and genotype data, experimental conditions, molecular context, and mapping information for both neutral variations and clinical mutations.
dbSNP includes disease-causing clinical mutations as well as neutral polymorphisms. This method links variations (polymorphisms and clinical mutations) to NCBI sequence resources via BLAST and E-PCR analysis. It also facilitates searches along five major axes of information: (i) sequence location, (ii) function, (iii) cross-species homology, (iv) single nucleotide polymorphisms (SNPs) quality or validation status and (v) degree of heterozygosity (degree of population variation).
dbSNP currently classifies nucleotide sequence variations with the following types and percentage composition of the database:
single nucleotide substitutions, 99.77%
small insertion/deletion polymorphisms, 0.21%
invariant regions of sequence, 0.02%; (iv) microsatellite repeats, 0.001%
named variants, <0.001%; and (vi) uncharacterized heterozygous assays, <0.001%.
The current level of activity in the discovery of general sequence variation suggests that SNP markers with unknown selective effects will be the majority of submitted records. Although most submissions are currently for Homo sapiens, dbSNP already has submissions for Mus musculus, and in general the database can accept variation information from any species and from any part of a particular genome. dbSNP is currently integrated with other large public variation databases such as the NCI CGAP-GAI database of EST-derived SNPs, the TSC (The SNP Consortium, Ltd) variation initiative and HGBASE.
dbSNP links variations (polymorphisms and clinical mutations) to other NCBI sequence resources through BLAST and E-PCR analysis of the flanking sequence that immediately surrounds the variation. Links to the literature databases are made with the citation information provided at submission time. This integration process makes dbSNP part of the NCBI ‘discovery space’ .
dbSNP Reference SNP (rs or RefSNP) number is a locus accession for a variant type assigned by dbSNP. The RefSNP catalog is a non-redundant collection of submitted variants which were clustered, integrated and annotated.