AlignIO - Reading and Parsing a Multiple Sequence Alignment File
You're currently learning a lecture from the course:
In order to have thorough understanding of the main topic, you should have the basic concept of the following terms:
Multiple Sequence Alignment (MSA) file formats.
Stockholm (.sth) file format.
FASTA (.fasta) file format.
Bio.AlignIO is a new multiple sequence alignment input/output interface included in the BioPython package. AlignIO module deals with the files containing one or more sequence alignments represented as Alignment objects. The Bio.AlignIO interface is very similar to the SeqIO module of BioPython and both are connected internally. Among various functionalities of the AlignIO module, it can be used to read the files in a particular format, write the sequence alignment file in a particular format and can be used to read a MSA file in any particular format to print out the alignment results in a more meaningful manner.
Import the AlignIO module from BioPython to use its functionalities.
from Bio import AlignIO
To read the MSA file in any format call in the read() function within AlignIO module and use the file name and file format as parameters of the function and create a variable to store the alignment records, as:
align = AlignIO.read(“Filename”, “Format”)
Note: The parse() function is used when you’re working with multiple sequences in your file, like Bootstrapping method or that sorts of scenarios, but in most cases when you’re dealing with the alignment files coming from MEGA, ClustalW, etc, the resulting files contain only a single alignment of all the sequences, so in that case you only need to use the read() function in your code.
To get the alignment results, you can call the declared variable (i.e., align) within the print() function but this will not provide the entire alignment of the sequences rather it’ll provide you short cut-out regions of the alignment.
To get the entire alignment results, call in the print() function and .get_alignment_length() function, as:
print (“Alignment length %i” % align.get_alignemt_length())
Then you need to iterate each sequence record you’ve created (i.e., align) using a for loop and the sequence formatting as the parameters within the print() function, as:
for record in align:
print (“%s - %s” % (record.id, record.seq, record.description)
This table lists the file formats that Bio.AlignIO can read and write, with the Biopython version where this was first supported.
The format name is a simple lowercase string, matching the names used in Bio.SeqIO. Where possible we use the same name as BioPerl’s SeqIO and EMBOSS.
Format name Reads Writes Notes
clustal 1.46 1.46 The alignment format of Clustal X and Clustal W.
emboss 1.46 No The EMBOSS simple/pairs alignment format.
fasta 1.46 1.48 This refers to the input file format introduced for Bill Pearson’s FASTA tool, where each record starts with a “>” line. Note that storing more than one alignment in this format is ambiguous. Writing FASTA files with AlignIO failed prior to release 1.48 (Bug 2557).
fasta-m10 1.46 No This refers to the pairwise alignment output from Bill Pearson’s FASTA tools, specifically the machine readable version when the -m 10 command line option is used. The default free format text output from the FASTA tools is not supported.
ig 1.47 No The refers to the IntelliGenetics file format often used for ordinary un-aligned sequences.The tool MASE also appears to use the same file format for alignments, hence its inclusion in this table. See MASE format.
maf 1.69 1.69 Multiple Alignment Format (MAF) produced by Multiz. Used to store whole-genome alignments, such as the 30-way alignments available from the UCSC genome browser.
mauve 1.70 1.70 Mauve’s eXtended Multi-FastA (XMFA) file format
msf 1.75 No GCG MSF file format.
nexus 1.46 1.48 Also known as PAUP format. Uses Bio.Nexus internally. Only one alignment per file is supported.
phylip 1.46 1.46 This is a strict interpretation of the interlaced PHYLIP format which truncates names at 10 characters.
phylip-sequential 1.59 1.59 This is a strict interpretation of the sequential PHYLIP format which truncates names at 10 characters.
phylip-relaxed 1.58 1.58 This is a relaxed interpretation of the PHYLIP format which allows long names.
stockholm 1.46 1.46 Also known as PFAM format, this file format supports rich annotation.
In this tutorial video of BioPython, we learned to read a Multiple Sequence Alignment file in any particular format using the AlignIO module provided by the BioPython package. We also got to know how to print out the results in two different ways.
If a particular file is required for this video, and was discussed in the lecture, you can download it by clicking the button below.