top of page

Smith–Waterman algorithm

BioCodeKb - Bioinformatics Knowledgebase

Smith Waterman algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981. The algorithm explains the local sequence alignment, it gives conserved regions between the two sequences, and one can align two partially overlapping sequences, also it’s possible to align the subsequence of the sequence to itself.


In regular sequence alignment, the divergence level between the two sequences to be aligned is not easily known. The sequence lengths of the two sequences may also be unequal. In such cases, identification of regional sequence similarity may be of greater significance than finding a match that includes all residues. The first application of dynamic programming in local alignment is the Smith–Waterman algorithm. In this algorithm, positive scores are assigned for matching residues and zeros form is matches. No negative scores are used. A similar tracing-back procedure is used in dynamic programming. However, the alignment path may begin and end internally along the main diagonal.


Most commonly used pairwise alignment web servers apply the local alignment strategy, which include SIM, SSEARCH and LALIGN.


Working

Smith–Waterman algorithm aligns two sequences by matches/mismatches (also known as substitutions), insertions, and deletions. Both insertions and deletions are the operations that introduce gaps, which are represented by dashes. The Smith–Waterman algorithm has following steps:

  1. Determine the substitution matrix and the gap penalty scheme. A substitution matrix assigns each pair of bases or amino acids a score for match or mismatch. Usually matches get positive scores, whereas mismatches get relatively lower scores. A gap penalty function determines the score cost for opening or extending gaps. It is suggested that users choose the appropriate scoring system based on the goals. In addition, it is also a good practice to try different combinations of substitution matrices and gap penalties.

  2. Initialize the scoring matrix. The dimensions of the scoring matrix are 1+length of each sequence respectively. All the elements of the first row and the first column are set to 0. The extra first row and first column make it possible to align one sequence to another at any position, and setting them to 0 makes the terminal gap free from penalty.

  3. Score each element from left to right, top to bottom in the matrix, considering the outcomes of substitutions (diagonal scores) or adding gaps (horizontal and vertical scores). If none of the scores are positive, this element gets a 0. Otherwise the highest score is used and the source of that score is recorded.

  4. Starting at the element with the highest score, trace back based on the source of each score recursively, until 0 is encountered. The segments that have the highest similarity score based on the given scoring system is generated in this process. To obtain the second best local alignment, apply the trace back process starting at the second highest score outside the trace of the best alignment.


The two alignments can be given with a score, for matching as +5 , mismatch as -3 and gap penalty as -4, sum up all the individual scores and the alignment which has maximum score after this can be taken as the best alignment.

ad-scaled.webp

Need to learn more about BioCodeKB - Bioinformatics Knowledge... | BioCode and much more?

To learn Bioinformatics, analysis, tools, biological databases, Computational Biology, Bioinformatics Programming in Python & R through interactive video courses and tutorials, Join BioCode.

bottom of page