top of page

Sequence Similarity

BioCodeKb - Bioinformatics Knowledgebase

Similarity is the degree of likeness between two sequences, usually expressed as a percentage of similar (or identical) residues over a given length of the alignment. Sequence similarity is meaningful only when possible substitutions are scored according to the probability with which they occur. In protein sequences, amino acids of similar chemical properties are found to substitute each other more often than dissimilar amino acids. These propensities are represented in “scoring Matrices” that are used to score sequence alignments.

Sequence Similarity Searching is a method of searching sequence databases by using alignment to a query sequence. By statistically assessing how well database and query sequences match one can work out homology and transfer information to the query sequence. Sequence similarity searches can identify ”homologous” proteins or genes by detecting excess similarity, statistically significant similarity that shows common ancestry.

Sequence similarity searching has become an important part of the daily routine of molecular Biologists, Bioinformaticians and Biophysicists. With the rapidly growing sequence databanks, this computational approach is commonly applied to determine functions and structures of un-annotated sequences, to investigate relationships between sequences, and to construct phylogenetic trees. We introduce arguably the most popular BLAST‐based family of the sequence similarity search tools.

Sequence similarity is a concept from computational biology and computer science. Sequence similarity is a number that shows how much two sequences are similar. Sequence similarity is sometimes, but not always, defined through sequence distance, the smaller the distance, the more similar the sequences.

Sometimes the similarity score is expressed as a percentage, namely “percent similarity” or “percent identity”. Percent identity usually refers to the ratio of the number of matching residues to the total length of the alignment. Percent similarity counts “similar” residues (usually amino acids) in addition to the identical ones. The similarity between amino acids can be defined either by their chemical properties or based on a PAM matrix.

That said, there are two valid connections between homology and similarity, and they are the source of confusion between these notions:

  1. Similarity can be used to infer homology. Specifically, if the two similar genomic sequences are long and complex enough to have unlikely arisen independently even under similar selective pressures, then this is a strong evidence for their homology.

  2. Most of the homologous loci that we observe have high similarity. I emphasized “that we observe” because there is an obvious circularity and bias here: we only know that two regions are homologous when they are similar. If we could somehow sample homologous regions in different organisms in an unbiased way, perhaps we would find much less similarity than we are accustomed to.

NCBI BLAST is the most commonly used sequence similarity search tool. It uses heuristics to perform fast local alignment searches. PSI-BLAST allows users to construct and perform a BLAST search with a custom, position-specific, scoring matrix which can help find distant evolutionary relationships.


Need to learn more about Sequence Similarity and much more?

To learn Bioinformatics, analysis, tools, biological databases, Computational Biology, Bioinformatics Programming in Python & R through interactive video courses and tutorials, Join BioCode.

bottom of page