top of page

EMBOSS (Alignment Format)

BioCodeKb - Bioinformatics Knowledgebase

EMBOSS is a free open source software analysis package developed for the needs of the molecular biology and  bioinformatics user community. The software automatically copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web.


When an alignment of two or more sequences is done by programs in EMBOSS, then the resulting output is written to a simple text file.


Alignment format contains the following specifictaions;


Gaps in sequences

In all EMBOSS alignment formats, gaps that have been introduced into the sequences to make them align are showed by the '-' character.


Head and tail of the format

The majority of the alignment formats (except those that are also standard sequence formats, like fasta or MSF) have a block of information at the start of the alignment describing the program, date, output filename, ID names of the sequences and some of the parameters and statistics of the alignment.


There is also a block of information at the end of the alignment for summary information. This is used by a few programs e.g. merger.


Length

The header block have a parameter “Length”. This is the length of the alignment, including any gaps that have been introduced to construct the alignment.


Identity

The header block have a parameter “Identity”. This is a count of the number of positions over the length of the alignment where all of the residues or bases at that position are identical.


It is followed by “/___” - the length of the alignment and “(___%)” - the percentage of positions in the alignment where there are identities.


Similarity

This parameter shows the count of the number of positions over the length of the alignment where >= ___% of the residues or bases at that position are similar.


Any two residues or bases are defined as similar when they have positive comparisons (as defined by the comparison matrix being used in the alignment algorithm).


It is followed by “/___” - the length of the alignment and “(___%)” - the percentage of positions in the alignment where there are similarities.


Usually, the sum of identical and similar positions is greater than 100%. This is because the count of similar positions includes the count of identical positions; if residues are identical, they must also be similar.


Gaps

The header block have a parameter of “Gaps” that shows a count of the number of positions over the length of the alignment where there are one or more sequences with a gap.


It is followed by “/___” - the length of the alignment and “(___%)” - the percentage of positions in the alignment where there are gaps.


Score

The header block may also have a “Score” parameter that describes the score used by the program that calculated the alignment to determine which is the best possible alignment to report.


The algorithm that was used to derive the score is not part of the alignment formatting routines.


Markup Line

The markup line is the line commonly placed between a pairwise alignment or at the bottom of alignments of 3 or more sequences that shows where sequences are mismatched, gapped, identical or similar.


In general the markup line uses a space for a mismatch or a gap, '.' for any small positive score, ':' for a similarity which scores more than 1.0, and '|' for an identity where both sequences have the same residue regardless of its score ('W' matching 'W' scores much more than 'L' matching 'L' because a conserved tryptophan is more significant than a conserved leucine).

ad-scaled.webp

Need to learn more about EMBOSS (Alignment Format) and much more?

To learn Bioinformatics, analysis, tools, biological databases, Computational Biology, Bioinformatics Programming in Python & R through interactive video courses and tutorials, Join BioCode.

bottom of page