EMBOSS is a free open source software analysis package developed for the needs of the molecular biology and bioinformatics user community. The software automatically copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web.
When an alignment of two or more sequences is done by programs in EMBOSS, then the resulting output is written to a simple text file.
Alignment format contains the following specifictaions;
Gaps in sequences
In all EMBOSS alignment formats, gaps that have been introduced into the sequences to make them align are showed by the '-' character.
Head and tail of the format
The majority of the alignment formats (except those that are also standard sequence formats, like fasta or MSF) have a block of information at the start of the alignment describing the program, date, output filename, ID names of the sequences and some of the parameters and statistics of the alignment.
There is also a block of information at the end of the alignment for summary information. This is used by a few programs e.g. merger.
Length
The header block have a parameter “Length”. This is the length of the alignment, including any gaps that have been introduced to construct the alignment.
Identity
The header block have a parameter “Identity”. This is a count of the number of positions over the length of the alignment where all of the residues or bases at that position are identical.
It is followed by “/___” - the length of the alignment and “(___%)” - the percentage of positions in the alignment where there are identities.
Similarity
This parameter shows the count of the number of positions over the length of the alignment where >= ___% of the residues or bases at that position are similar.
Any two residues or bases are defined as similar when they have positive comparisons (as defined by the comparison matrix being used in the alignment algorithm).
It is followed by “/___” - the length of the alignment and “(___%)” - the percentage of positions in the alignment where there are similarities.
Usually, the sum of identical and similar positions is greater than 100%. This is because the count of similar positions includes the count of identical positions; if residues are identical, they must also be similar.
Gaps
The header block have a parameter of “Gaps” that shows a count of the number of positions over the length of the alignment where there are one or more sequences with a gap.
It is followed by “/___” - the length of the alignment and “(___%)” - the percentage of positions in the alignment where there are gaps.
Score
The header block may also have a “Score” parameter that describes the score used by the program that calculated the alignment to determine which is the best possible alignment to report.
The algorithm that was used to derive the score is not part of the alignment formatting routines.
Markup Line
The markup line is the line commonly placed between a pairwise alignment or at the bottom of alignments of 3 or more sequences that shows where sequences are mismatched, gapped, identical or similar.
In general the markup line uses a space for a mismatch or a gap, '.' for any small positive score, ':' for a similarity which scores more than 1.0, and '|' for an identity where both sequences have the same residue regardless of its score ('W' matching 'W' scores much more than 'L' matching 'L' because a conserved tryptophan is more significant than a conserved leucine).