top of page

Multiple genome alignments

BioCodeKb - Bioinformatics Knowledgebase

The first alignment step could rely on a simple clustering/segmentation approach, such a procedure would produce disconnected MSA blocks, giving few insights into genomic evolution. For this reason, most new-generation genome aligners rely on the sorting by reversal algorithm for the segmentation step. Sorting by reversal is an NP complete problem that amounts to reconstructing the minimum chain of events that would edit one genome into another using a series of translocations and inversions. It is not necessary to solve this problem to align genomes, but it helps quantifying the evolutionary cost of alternative alignments. In practice, most algorithms start by seeking colinear segments, often relying on anchor points (usually proteins) gathered using an all-against-all BLAST procedure. The most popular procedures include Mercator that uses protein anchors, MUMS/Mems (Mugsy or the systematic use of local alignments.

TBA was one of the first algorithm to consider a multiple genome alignment (MGA) as a set of separate blocks rather than a continuous sequence, thus making data processing a necessary prerequisite. In the newest generation of MGAs, the pre-processing has become tightly integrated with the alignment process, as in Mercator-Mavid or Enredo/Pecan, which uses graph structures to identify the different genome rearrangements, splits the multiple genomes accordingly and feeds the resulting bins of multiple sequences to Pecan, a space-efficient consistency-based aligner using the Durbin forward-only linear space dynamic programming procedure. Other graph structures have been used for this purpose.

MGA method development has, however, been hampered by the difficulty to objectively assess the relative merits of each aligner. In contrast with proteins or RNA sequences, no such thing as a structure or its equivalent is available for genomes, and when the Alignathon contest proposed to compare the capacities of MGAs on eukaryotic data, the benchmarking was eventually carried out using the PSAR objective function, a sequence-based estimator relying on probabilistic sampling. The PSAR objective function was initially developed to evaluate genomic MSAs. Its principle is somehow similar to the consistency-based approach of T-Coffee, though more complete and more computationally demanding. In PSAR, given a data set, all sequences are removed in turn, the remaining sequences realigned and the removed sequence realigned to the sub-alignment. The stability of the realignment with respect to the input MSA is then used to estimate the reliability of each residue positioning within the final alignment model. This procedure is generic with no constraint limiting it to nucleotide alignments. It has, however, so far only been tested and benchmarked on simulated genomic data sets.

To align longer sequences, most programs for genomic alignment rely on some sort of anchoring. In a first step, they use a fast local alignment method to identify high-scoring local homologies, so-called anchor points. Next, chains of such local alignments are calculated and, finally, sequence segments between the selected anchor points are aligned with a slower but more sensitive alignment method. For multiple sequence sets, either pairwise or multiple local alignments can be used as anchor points. A pioneering tool to find anchor points for genomic alignment is MUMmer; the current version of the program is considered the state-of-the-art in alignment anchoring. MUMmer uses maximal unique matches as pairwise anchor points. The genome aligner MGA, by contrast, uses maximal exact matches involving all input sequences. Both MUMmer and MGA use suffix trees and related data structures to rapidly identify the pairwise or multiple word matches. MUMmer and MGA can rapidly align entire bacterial genomes. However, since the number of exact word matches decreases with increasing evolutionary distances, these approaches are most useful if closely related genomes are to be compared, such as different strains of E. coli.

Mugsy is a popular software pipeline for multiple genome alignment. In a first step, this program uses nucmer to construct all pairwise alignments of the input sequences. Nucmer, in turn, uses MUMmer to find exact unique word matches which are used as alignment anchor points. An alignment graph is constructed from these pairwise alignments using the SeqAn software and Locally Collinear Blocks are constructed. Finally, a multiple alignment is calculated using SeqAn or TCoffee.

BioinfoLytics Company

Our company, BioinfoLytics, is affliated with BioCode and is a project, which is covering many topics on Genomics, Proteomics, their analysis using many tools in a cool way, Sequence Alignment & Analysis, Bioinformatics Scripting & Software Development, Phylogenetic and Phylogenomic Analysis, Functional Analysis, Biological Data Analysis & Visualization, Custom Analysis, Biological Database Analysis, Molecular Docking, Protein Structure Prediction and Molecular Dynamics etc for the seekers of Biocode to further develop their interest to take part in these services to fulfill their requirements and obtain their desired results. We are providing such a platform where one can find opportunity to learn, research projects analysis and get help and huge knowledge based on molecular, computational and analytical biology.

We are providing “Multiple Genome Alignments” service to our customers to calculate ancestral sequences, age of base, conservation scores and constrained elements between groups of genomes and to strive high quality research and will advance science in the domain of Genome Analysis.


Need to learn more about Multiple genome alignments and much more?

To learn Bioinformatics, analysis, tools, biological databases, Computational Biology, Bioinformatics Programming in Python & R through interactive video courses and tutorials, Join BioCode.

bottom of page