The Reference Sequence (RefSeq) database collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq sequences form a foundation for medical, functional, and diversity studies. They provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses.
RefSeq genomes are copies of selected assembled genomes available in GenBank. RefSeq transcript and protein records are generated by several processes including:
Eukaryotic Genome Annotation Pipeline
Prokaryotic Genome Annotation Pipeline
Propagation from annotated genomes that are submitted to members of the International Nucleotide Sequence Database Collaboration (INSDC)
NCBI provides RefSeqs for taxonomically diverse organisms including archaea, bacteria, eukaryotes, and viruses. References sequences are provided for genomes, transcripts, and proteins. Some targeted loci projects are included in RefSeq including;
New or updated records are added to the collection as data become publicly available.
RefSeq is accessible via BLAST , Entrez, and the NCBI FTP site ( RefSeq releases , and RefSeq Genomes ). Information is also available in NCBI's Assembly, Genomes and Gene resources, and for some organisms additional information is available in NCBI's genome browser Map Viewer . Special properties have been defined to facilitate Entrez-based retrieval.
The main features of the RefSeq collection include:
explicitly linked nucleotide and protein sequences
updates to reflect current knowledge of sequence data and biology
data validation and format consistency
distinct accession series
ongoing curation by NCBI staff and collaborators, with reviewed records indicated
RefSeq records are derived from publicly available sequence data; varying levels of validation, additional annotation, and manual curation are applied to the RefSeq record. NCBI Reference Sequences are provided through the separate processes described below.
This page provides a brief overview of the RefSeq production processes.
Complete genomic molecules
Incomplete genomic region
predicted mRNA model
predicted ncRNA model
predicted Protein model (eukaryotic sequences)
predicted Protein model (prokaryotic sequences)
For some organisms, the annotated RefSeq records are provided by collaborating groups. Depending on the organism, collaborations may be established at the whole-genome level, or smaller collaborations may be established for gene families. Whole-genome collaborations include records for Saccharomyces cerevisiae , Arabidopsis thaliana , Drosophila melanogaster , and Caenorhabditis elegans . When such a collaboration is established, the primary sequence level review is carried out by the collaborating group. Processing of annotated genome data submitted by collaborations is semi-automated; data is provided by a collaborating group and validated at NCBI to detect obvious errors and to apply the annotation in a more uniform way. NCBI processing may integrate additional information such as nomenclature or other descriptive data. Additional manual curation of these records is not carried out by NCBI staff. NCBI may update the records to correct a general format problem, but otherwise these records are only updated when the collaborating group provides an update.