A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the different data types generated for that project.

As the volume and complexity of data sets archived at NCBI grow rapidly, so does the need to gather and organize the associated metadata. Although metadata has been collected for some archival databases, previously, there was no centralized approach at NCBI for collecting this information and using it across databases. The BioProject database was recently established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases. It captures descriptive information about research projects that result in high volume submissions to archival databases, ties together related data across multiple archives and serves as a central portal by which to inform users of data availability.

The BioProject resource is a redesigned, expanded, replacement of the NCBI Genome Project resource. The redesign adds tracking of many data elements including more precise information about a project’s scope, material, and objectives. Genome Project identifiers are retained in the BioProject as the ID value for a record, and an Accession number has been added. Other changes include a more flexible approach to grouping projects and addition of data elements including fields for funding source and general relevance categories.

The database is not limited by taxonomy and as such includes information for studies of eukaryotes, prokaryotes, and environmental samples. Registration for a BioProject accession is encouraged for projects that result in a very large volume of data submissions, submissions from multiple members of a collaboration, or submissions to multiple archival databases. Registration for a BioProject accession is discouraged for small datasets for which the results are found in one of accession numbers such as a single viral or organelle genome sequencing study. A BioProject ID is required for some database submissions including dbVar, SRA, and GenBank microbial and eukaryotic genomes. A unique BioProject accession number is assigned to each submitted project. Submitters reference this accession when depositing corresponding BioSample records or experimental data into the archival databases.

Primary submission projects have attributes that describe the scope, material, capture, methodology, and objectives of the project.

The BioProject resource organizes both the projects and the data from those projects which is deposited into many archival databases maintained by members of the INSDC. This allows searching by characteristics of these projects, using the project description and project content across the INSDC-associated databases.

The BioProject database provides a dedicated environment in which to:

  • find distinct data types for a registered project

  • find projects that are related by a number of different metrics including organism, submitter, data type, collaboration

  • link project information to experimental data across multiple resources

  • access information about data availability for a project (sub-project or umbrella collaboration)

  • support cross-database queries by project identifier


