top of page

CLUSTAL (Alignment Format)

BioCodeKb - Bioinformatics Knowledgebase

Different types of sequence alignment formats are currently in use, leading to file-interconversion difficulties where diverse software packages are used. EMBOSS simplifies things by supporting most of the common alignment formats for input and output. This makes the interoperation with other sequence analysis packages easy. If our alignment is not in a recognized standard format then we will first need to convert it into a suitable one.


CLUSTAL (Alignment Format)

A clustal-formatted file is a plain text format. It can have a header as an optional, which describes the clustal version number. This is followed by the multiple sequence alignment, and optional information about the degree of conservation at each position in the alignment.


Many programs in the MEME Suite allow as input a file containing a multiple alignment of protein or DNA sequences. These input files must be in CLUSTAL W format, usually identified with the suffix ".aln".


6 formats are accepted in CLUSTAl including NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF. Partially formatted sequences are not accepted. Adding a return to the end of the sequence may help some applications understand the input.


The only supported format parameter in CLUSTAL is constructor, which specifies the type of in-memory sequence object to read each aligned sequence. This must be a subclass of GrammaredSequence (e.g., DNA, RNA, Protein) and is a required format parameter. For example, if we know that the clustal file we’re reading contains DNA sequences, we would pass constructor=DNA to the reader call.


Each sequence in the alignment is divided into subsequences each at most 60 characters long. The sequence identifier for each sequence precedes each subsequence. Each subsequence can optionally be followed by the cumulative number of non-gap characters up to that point in the full sequence. A line containing conservation information about each position in the alignment can optionally follow all of the subsequences.


Format Specifications

The format is very simple:

  1. The first line in the file must start with the words "CLUSTAL W" or "CLUSTALW". Other information in the first line is ignored

  2. One or more empty lines

  3. One or more blocks of sequence data. Each block consists of:

  • One line for each sequence in the alignment. Each line consists of:

  1. the sequence name

  2. white space

  3. up to 60 sequence symbols.

  4. optional - white space followed by a cumulative count of residues for the sequences

  • A line showing the degree of conservation for the columns of the alignment in the block.

  • One or more empty lines.


Some rules about representing sequences:

  • Case doesn't matter

  • Sequence symbols should be from a valid alphabet

  • Gaps are represented using hyphens ("-").

  • The characters used to represent the degree of conservation are;

  1. Asterisk *   shows that all residues or nucleotides in that column are identical

  2. Colon :  shows conserved substitutions

  3. Dot .  shows semi-conserved substitutions

  4. Empty space shows no match

ad-scaled.webp

Need to learn more about BioCodeKB - Bioinformatics Knowledge... | BioCode and much more?

To learn Bioinformatics, analysis, tools, biological databases, Computational Biology, Bioinformatics Programming in Python & R through interactive video courses and tutorials, Join BioCode.

bottom of page