top of page

Pan genomics

BioCodeKb - Bioinformatics Knowledgebase

A pan-genome is defined as the set of all unique gene families found in one or more strains of a prokaryotic species. Studies of pan-genomes have become popular due to the easy access to whole-genome sequence data for prokaryotes.

In the fields of molecular biology and genetics, a pan-genome (or supragenome) is the entire set of genes for all strains within a clade. The pan-genome includes: the core genome containing genes present in all strains within the clade, the accessory genome containing 'dispensable' genes present in a subset of the strains, and strain-specific genes. The study of the pan-genome is called pangenomics.

Some species have open (or extensive) pan-genomes, while others have closed pan-genomes. For species with a closed pan-genome, very few genes are added per sequenced genome (after sequencing many strains), and the size of the full pan-genome can be theoretically predicted. Species with an open pan-genome have enough genes added per additional sequenced genome that predicting the size of the full pan-genome is impossible. Population size and niche versatility have been suggested as the most influential factors in determining pan-genome size. The pan-genome can be broken down into a "core pan-genome" that contains genes present in all individuals, a "shell pan-genome" that contains genes present in two or more strains, and a "cloud pan-genome" that contains genes only found in a single strain.

Pan-genomes were originally constructed for species of bacteria and archaea, but more recently eukaryotic pan-genomes have been developed, particularly for plant species. Plant studies have shown that pan-genome dynamics are linked to transposable elements. The significance of the pan-genome arises in an evolutionary context, especially with relevance to metagenomics, but is also used in a broader genomics context.

As sequencing costs have dropped, thousands of new genomes have been sequenced, and scientists have come to realize that a single reference genome is inadequate for many purposes. By sampling a diverse set of individuals, one can begin to assemble a pan-genome: a collection of all the DNA sequences that occur in a species. As scientists continue to survey and catalogue the genomic variation across human populations and begin to assemble a human pan-genome, these efforts will increase our power to connect variation to human diversity, disease and beyond.


  • characterizing strains by their individual gene set (e.g., detecting virulence factors only present in one particular strain of a species)

  • develop vaccines against pathogenic strains

  • detection, identification and tracking of new strains in metagenomics samples based on their individual gene subset of the species pangenome

  • study the evolutionary impact of horizontal gene transfer

  • Exploring strain diversity in environmental population genomics studies

Pangenome tools

  • Roary: Fast tool for extracting complete pangenomes, core gene sets, or differences between reference genomes

  • panX: pangenome analysis and web-based visualization

  • PanOCT: considers both gene homology and conserved gene neighborhoods

  • OrthoMCL: extracting the core genomes, etc..

  • LS-BSR: rapid comparison of the genetic content of large numbers of genomes

  • PanPhlAn: pangenome based detection of gene compositions of strains in environmental WGS samples


Need to learn more about Pan genomics and much more?

To learn Bioinformatics, analysis, tools, biological databases, Computational Biology, Bioinformatics Programming in Python & R through interactive video courses and tutorials, Join BioCode.

bottom of page