top of page

HomoloGene (Gene and Protein Families) Database

BioCodeKb - Bioinformatics Knowledgebase

HomoloGene is a database of both curated and calculated gene orthologs and homologs and now covers 21 organisms that identify putative homologs based on sequence similarity using pre-computed comparisons. Curated orthologs include gene pairs from the Mouse Genome Database (MGD) at the Jackson Laboratory, the Zebrafish Information (ZFIN) database at the University of Oregon, Saccharomyces Genome Database (SGD), Clusters of Orthologous Groups (COG), FlyBase, Online Mendelian Inheritance in Man (OMIM) and from published reports. Computed orthologs and homologs, which are considered putative, are identified from BLAST nucleotide sequence comparisons between all UniGene clusters for each pair of organisms. The HomoloGene database can be queried using UniGene ClusterIDs, LocusLink LocusIDs, gene symbols, gene names and nucleotide accession numbers, as well as those terms found in UniGene cluster titles.

HomoloGene is a tool of the United States National Center for Biotechnology Information (NCBI).

The HomoloGene processing consists of the protein analysis from the input organisms. Sequences are compared using blastp, then matched up and put into groups, using a taxonomic tree built from sequence similarity, where closer related organisms are matched up first, and then further organisms are added to the tree. The protein alignments are mapped back to their corresponding DNA sequences, and then distance metrics as molecular distances can be calculated.

The sequences are matched up by using a heuristic algorithm for maximizing the score globally, rather than locally, in a bipartite matching. And then it calculates the statistical significance of each match. Cutoffs are made per position and Ks values are set to prevent false "orthologs" from being grouped together. “Paralogs” are identified by finding sequences that are closer within species than other species.

Database fields include homologene_group_id, taxon_id, gene_id_key, gene_symbol, protein_gi and protein_acc. HomoloGene IDs should be relatively stable. A group is assigned an existing id as long as it contains more than 50% of the genes from the currently existing group.

We can search the HomoloGene database with the gene name. If our search finds multiple records, click on the desired record. The homologous genes are listed in the top of the report. If our search in HomoloGene returns no records, search the Gene database with the gene name. Click on the desired record, and then click on the HomoloGene link in the list on the right side of the page. If there is no link to HomoloGene, locate a protein Reference Sequence (e.g. NP_005537) in the NCBI Reference Sequences section of the Gene record and follow the instructions under "a protein accession number" below. If there are no Reference Sequences in the Gene record, search the Protein database with the gene name and select the desired record. Then follow the instructions under "a protein accession number" below.


  • View related genes, their proteins, related phenotypes and PubMed entries

  • View conserved domains identified in these proteins

  • View curated homology data

  • View related UniGene clusters

Because HomoloGene is now an Entrez database, it can be queried using an assortment of fielded terms combined with boolean operators.


Need to learn more about HomoloGene (Gene and Protein Families) Database and much more?

To learn Bioinformatics, analysis, tools, biological databases, Computational Biology, Bioinformatics Programming in Python & R through interactive video courses and tutorials, Join BioCode.

bottom of page