Multiple sequence alignment is a tool used to study closely related genes or proteins in order to find the evolutionary relationships between genes and to identify shared patterns among functionally or structurally related genes.
Multiple sequence alignments provide more information than pairwise alignments since they show conserved regions within a protein family which are of structural and functional importance. Multiple sequence alignments (MSA) are an essential and widely used computational procedure for biological sequence analysis in molecular biology, computational biology, and bioinformatics.
A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. MSA is used to find conservation in proteins domains, tertiary, secondary structures and even individual amino acids or nucleotides.
Multiple sequence alignment also refers to the process of aligning such a sequence set. Because three or more sequences of biologically relevant length can be difficult and are almost always time-consuming to align by hand, computational algorithms are used to produce and analyze the alignments. MSAs require more sophisticated methodologies than pairwise alignment because they are more computationally complex. Most multiple sequence alignment programs use heuristic methods rather than global optimization because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive.
Multiple sequence alignment is to arrange sequences in such a way that a maximum number of residues from each sequence are matched up according to a particular scoring function. The scoring function for multiple sequence alignment is based on the concept of sum of pairs (SP). As the name suggests, it is the sum of the scores of all possible pairs of sequences in a multiple alignment based on a particular scoring matrix. In calculating the SP scores, each column is scored by summing the scores for all possible pairwise matches, mismatches and gap costs.
There are many alignment methods used within multiple sequences to maximize scores and correctness of alignments. Each is usually based on a certain heuristic with an insight into the evolutionary process. Most try to replicate evolution to get the most realistic alignment possible to best predict relations between sequences.
Methods used MSA
Progressive alignment construction
Hidden Markov models
Clustal Omega is a new multiple sequence alignment program that uses seeded guide trees and HMM profile-profile techniques to generate alignments between three or more sequences.
ClustalW2 is a general purpose DNA or protein multiple sequence alignment program for three or more sequences.
T-Coffee, which stands for tree-based consistency objective function for alignment evolution, is an iterative MSA algorithm. T-Coffee provides a simple and flexible means of producing multiple sequence alignments by using heterogeneous data sources which are provided to T-Coffee through library of global and local pairwise alignments.
Another good quality, highly accurate multiple sequence alignment is an algorithm called MAFFT.
MUSCLE stands for multiple sequence comparison by log expectation. MUSCLE uses two distance measures, kmer distance for unaligned pairs of sequences and the Kimura distance method for aligned pairs of sequences.