CLUSTAL (Alignment Format)

BioCodeKb - Bioinformatics Knowledgebase

Purchase a Video Lecture on this Article

Different types of sequence alignment formats are currently in use, leading to file-interconversion difficulties where diverse software packages are used. EMBOSS simplifies things by supporting most of the common alignment formats for input and output. This makes the interoperation with other sequence analysis packages easy. If our alignment is not in a recognized standard format then we will first need to convert it into a suitable one.

CLUSTAL (Alignment Format)

A clustal-formatted file is a plain text format. It can have a header as an optional, which describes the clustal version number. This is followed by the multiple sequence alignment, and optional information about the degree of conservation at each position in the alignment.

Many programs in the MEME Suite allow as input a file containing a multiple alignment of protein or DNA sequences. These input files must be in CLUSTAL W format, usually identified with the suffix ".aln".

6 formats are accepted in CLUSTAl including NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF. Partially formatted sequences are not accepted. Adding a return to the end of the sequence may help some applications understand the input.

The only supported format parameter in CLUSTAL is constructor, which specifies the type of in-memory sequence object to read each aligned sequence. This must be a subclass of GrammaredSequence (e.g., DNA, RNA, Protein) and is a required format parameter. For example, if we know that the clustal file we’re reading contains DNA sequences, we would pass constructor=DNA to the reader call.

Each sequence in the alignment is divided into subsequences each at most 60 characters long. The sequence identifier for each sequence precedes each subsequence. Each subsequence can optionally be followed by the cumulative number of non-gap characters up to that point in the full sequence. A line containing conservation information about each position in the alignment can optionally follow all of the subsequences.

Format Specifications

The format is very simple:

The first line in the file must start with the words "CLUSTAL W" or "CLUSTALW". Other information in the first line is ignored
One or more empty lines
One or more blocks of sequence data. Each block consists of:

One line for each sequence in the alignment. Each line consists of:

the sequence name
white space
up to 60 sequence symbols.
optional - white space followed by a cumulative count of residues for the sequences

A line showing the degree of conservation for the columns of the alignment in the block.
One or more empty lines.

Some rules about representing sequences:

Case doesn't matter
Sequence symbols should be from a valid alphabet
Gaps are represented using hyphens ("-").
The characters used to represent the degree of conservation are;

Asterisk * shows that all residues or nucleotides in that column are identical
Colon : shows conserved substitutions
Dot . shows semi-conserved substitutions
Empty space shows no match

Need to learn more about BioCodeKB - Bioinformatics Knowledge... | BioCode and much more?

To learn Bioinformatics, analysis, tools, biological databases, Computational Biology, Bioinformatics Programming in Python & R through interactive video courses and tutorials, Join BioCode.

Learn More

Open BioCodeKB Homepage

1

....

CLUSTAL (Alignment Format)

BioCodeKb - Bioinformatics Knowledgebase

Need to learn more about BioCodeKB - Bioinformatics Knowledge... | BioCode and much more?

To learn Bioinformatics, analysis, tools, biological databases, Computational Biology, Bioinformatics Programming in Python & R through interactive video courses and tutorials, Join BioCode.

Follow us for Bioinformatics knowledge

Plans & Pricing

Learn More

FAQ

Terms of Services

Privacy Policy

Office:

4 Mann Island, Liverpool, Merseyside, United Kingdom

Accepted Payment Methods