NCBI's Gene database is designed to aggregate gene-specific information from multiple perspectives including sequence, mapping, publications, function of products, expression, evolution, links to genome, phenotype, and locus-specific resources and consequences of variation. Gene makes these data available for diverse scenarios, from occasional interactive access on the Web through computational access to selected or complete data sets and centralizes gene-related information into individual records.
Gene assigns an identifier, the GeneID, for each gene in each taxon either represented in the NCBI Reference Sequence (RefSeq) project, or under consideration by RefSeq. Usually this taxon is defined at the species level, but sometimes will be per isolate, strain or cultivar. Gene is closely coupled with RefSeq, in that genes annotated on RefSeq sequences are assigned GeneIDs for tracking. Not all records in Gene, however, are based on RefSeqs. Gene works closely with multiple groups that may identify a gene before it has been been defined by sequence. In other words, some records in Gene are mapped traits or other phenotypes.
The database currently known as Gene was first made public in 1999 as LocusLink. There was only one species represented human and little more than 9000 records. The Web interface supported links only to dbSNP, OMIM, RefSeq, GenBank, and UniGene within NCBI, as well as to the now defunct Genome Database (GDB) and a few other databases externally. By late 2003, when Entrez Gene was released, there were 10 species, almost 195000 records, and links computed to dbSNP, Ensembl, the HUGO Gene Nomenclature Committee (HGNC), GEO, Map Viewer, Mammalian Gene Collection (MGC), Nucleotide, Protein, PubMed, Taxonomy, UCSC, UniSTS, UniGene, and multiple species-specific model organism databases. Now Gene represents more than 11,000 taxa, more than 13,000,000 records, and more than 40 types of links to other NCBI databases.
Gene has a simple data model. Once the concept of a gene is defined by sequence or mapped location, it is assigned a unique integer identifier or GeneID. Then data of particular types are connected to that identifier. These types include sequence accessions, names, summary descriptions, genomic locations, terms from the Gene Ontology Consortium, interactions, related phenotypes, and summaries of orthology. For some of the commonly requested elements, and because of the simplicity of the data model, Gene provides tab-delimited files of content anchored on the GeneID.
The content of Gene is derived from both automated dataflows and curation by RefSeq staff. The starting point is typically the extraction of gene-specific information from a publicly available, annotated genome sequence. The gene is assigned a category (e.g. protein coding, non-coding RNA (ncRNA), pseudogene, ribosomal RNA (rRNA), unknown) and added value is provided by connecting the information captured from each gene feature with information from collaborating databases, public users and literature review, in particular Gene References into Function (GeneRIFs). When new information is available from any of these sources, the record is updated. Most updates to Gene are processed daily.