Although genome wide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Gene set enrichment analysis (GSEA) (also called functional enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins and may have an association with disease phenotypes. The method uses statistical approaches to identify significantly enriched or depleted groups of genes. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. Transcriptomics technologies and proteomics results often identify thousands of genes which are used for the analysis.
Researchers performing high-throughput experiments that generate sets of genes often want to retrieve a functional profile of that gene set, in order to better understand the underlying biological processes. This can be done by comparing the input gene set to each of the bins (terms) in the gene ontology.
Gene set enrichment analysis uses a priori gene sets that have been grouped together by their involvement in the same biological pathway, or by proximal location on a chromosome. A database of these predefined sets can be found at the Molecular signatures database (MSigDB). In GSEA, DNA microarrays, or now RNA-Seq, are still performed and compared between two cell categories, but instead of focusing on individual genes in a long list, the focus is put on a gene set.
In the method that is typically referred to as standard GSEA, there are three steps involved in the analytical process. The general steps are summarized below:
Calculate the enrichment score (ES) that shows the amount to which the genes in the set are over-represented at either the top or bottom of the list. This score is a Kolmogorov–Smirnov-like statistic.
Estimate the statistical significance of the ES. This calculation is done by a phenotypic-based permutation test in order to produce a null distribution for the ES. The P value is determined by comparison to the null distribution.
Adjust for multiple hypothesis testing for when a large number of gene sets are being analyzed at one time. The enrichment scores for each set are normalized and a false discovery rate is calculated.
GSEA uses complicated statistics, so it requires a computer program to run the calculations. GSEA has become standard practice, and there are many websites and downloadable programs that will provide the data sets and run the analysis.
Tools
NASQAR (Nucleic Acid SeQuence Analysis Resource)
PlantRegMap
MSigDB
Broad Institute
WebGestalt
WebGestalt
Enrichr
GeneSCF
DAVID
Metascape
AmiGO 2
GREAT
FunRich
FuncAssociate
InterMine
ToppGene Suite
QuSAGE
Blast2GO
g:Profiler
Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.
BioinfoLytics Company
Our company, BioinfoLytics, is affliated with BioCode and is a project, where we are providing many topics on Genomics, Proteomics, their analysis using many tools in a cool way, Sequence Alignment & Analysis, Bioinformatics Scripting & Software Development, Phylogenetic and Phylogenomic Analysis, Functional Analysis, Biological Data Analysis & Visualization, Custom Analysis, Biological Database Analysis, Molecular Docking, Protein Structure Prediction and Molecular Dynamics etc. for the seekers of Biocode to further develop their interest to take part in these services to fulfill their requirements and obtain their desired results. We are providing such a platform where one can find opportunity to learn, research projects analysis and get help and huge knowledge based on molecular, computational and analytical biology.
Our service “Functional Enrichment Analysis” is providing information on method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with disease phenotypes.