top of page

Gene Expression Omnibus (GEO) DataSets

BioCodeKb - Bioinformatics Knowledgebase

The GEO DataSets database stores original submitter-supplied records (Series, Samples and Platforms) as well as curated DataSets. GEO Profiles are derived from GEO DataSets.

Curated DataSets form the basis of GEO's advanced data display and analysis features, including tools to identify differences in gene expression levels and cluster heatmaps. Not all original submitter-supplied records have been assembled into curated DataSets yet.

The GEO DataSets database can be searched using many different attributes including keywords, organism, DataSet type and authors.

Users should use this database to search for studies of interest. Retrievals include the title, summary, organism, and accession for each record, as well as links to related data.

Simple keyword searches work very well in these databases. For example, if a user wants to find studies that examine hepatocellular carcinoma, it is only necessary to type “hepatocellular carcinoma” into the GEO DataSets search box to retrieve all the DataSet, Series, and Sample records that mention that term. Generally, if users want to identify particular studies of interest, they should search the GEO DataSets database first, and then they have the option to use either GEO2R or GEO DataSets analysis tools.

While simple keyword searches work well, the ever-growing volume of data in GEO means it is increasingly necessary to use structured and filtered queries to find the most relevant data. The GEO DataSets and GEO Profiles databases enable both simple and sophisticated queries to identify data of interest. Basic keyword searches can be performed alone or in combination with Boolean operators (AND, OR, NOT) to refine the search. Keyword searches with multiple parameters are structured with the following general format:

                      term[field] OPERATOR term[field]

where term is the search term, field is the search field (can be omitted to search for the term across all fields), and OPERATOR is the Boolean operator (“AND,” “OR,” “NOT” must be capitalized).

The results presented in either GEO DataSets or GEO Profiles can be further filtered or refined in many ways. Clicking on the word “Advanced” under the search box displaying the original query takes the user to the “Advanced Search Builder” page where searches can be built from drop-down menus.

The first analysis features developed at GEO were based on curated DataSet records which are created at periodic intervals by GEO staff from selected Series. DataSet records are designed to provide both visualization and data analysis tools for normalized, array-based gene expression studies stored in GEO.

The top section of a DataSet record provides information about the study including title, study summary, organism, citation, and Platform and Series accession numbers upon which the DataSet is based. The lower portion of the DataSet record has 4 tabs encompassing customizable data analysis tools to assist with identification of genes of interest within that DataSet:

  1. Find genes: Provides a search box for looking up specific gene names or symbols in this DataSet, as well as an option to identify genes that have been flagged as being differentially expressed according to the specific experimental variables in this study.

  2. Compare 2 sets of samples: Enables a user to perform a customized Student’s t-test of self-selected Samples in order to identify differentially expressed genes in this DataSet. In order to start the analysis, the user firsts select the test to perform and P-value significance level from the drop-down menus. Second, the Samples to be included in the analysis are selected.

  3. Cluster heatmaps: Presents precalculated and interactive cluster heatmap images that help detect natural groups of coordinately regulated genes. Genes with high levels of expression are represented in pink while genes with low levels of expression are represented in green. This tool allows for choice of hierarchical and partitional (K-means/medians) clustering or clustering genes by chromosome position.

  4. Experiment design and value distribution: draws boxplots for the expression values for all Samples in a study with corresponding Sample identifiers and Sample subset labels (e.g., drug-treated or control). The boxplot provides a visual overview of the data distribution and Sample categories in this DataSet.


Need to learn more about Gene Expression Omnibus (GEO) DataSets and much more?

To learn Bioinformatics, analysis, tools, biological databases, Computational Biology, Bioinformatics Programming in Python & R through interactive video courses and tutorials, Join BioCode.

bottom of page