top of page

PHYLIP format

BioCodeKb - Bioinformatics Knowledgebase

PHYLogeny Inference Package (PHYLIP) is a free computational phylogenetics package of programs for concluded evolutionary trees. It consists of 35 portable programs, such as the source code is written in the programming language C. Releases occur as source code, and as precompiled executables for many operating systems including Windows, Linux and FreeBSD from FreeBSD.org.


PHYLIP format is a plain text format containing exactly two sections: a header describing the dimensions of the alignment, followed by the multiple sequence alignment itself.


The PHYLIP file format stores a multiple sequence alignment. The format was originally defined and used in Joe Felsenstein’s PHYLIP package and has since been supported by many other bioinformatics tools.


The file begins with the information about the number of sequences and the number of nucleotides or amino acids in the alignment. The files in this format are saved in .phy extension.


Header Section

The header consists of a single line describing the dimensions of the alignment. It must be the first line in the file. The header consists of optional spaces, followed by two positive integers (n and m) separated by one or more spaces. The first integer (n) specifies the number of sequences (the number of rows) in the alignment. The second integer (m) specifies the length of the sequences (the number of columns) in the alignment. The smallest supported alignment dimensions are 1x1.


Alignment Section

The alignment section instantly follows the header. It consists of n lines (rows), one for each sequence in the alignment. Each row consists of a sequence identifier (ID) and characters in the sequence, in fixed width format.


The sequence ID can be up to 10 characters long. Other bioinformatics tools may relax this rule to allow for longer sequence identifiers.


IDs less than 10 characters must have spaces appended to them to reach the 10 character fixed width. Within an ID, all characters except newlines are supported, including spaces, underscores, and numbers.


Sequence characters immediately follow the sequence ID. They must start at the 11th character in the line, as the first 10 characters are reserved for the sequence ID. While PHYLIP format does not explicitly restrict the set of supported characters that may be used to represent a sequence, the original format description specifies the IUPAC nucleic acid lexicon for DNA or RNA sequences, and the IUPAC protein lexicon for protein sequences. The original PHYLIP specification uses “-“ as a gap character, though older versions also supported dot “.”. The sequence characters may contain optional spaces (e.g., to improve readability), and both upper and lower case characters are supported. Missing data or missing information (no sequence) is indicated by “?”. This is especially important in the end of the data file. Gaps (-) in the end of the data file may lead the programs to crash. Blanks will be ignored, and so will numerical digits. This allows GENBANK and EMBL sequence entries to be read with minimum editing.

ad-scaled.webp

Need to learn more about PHYLIP format and much more?

To learn Bioinformatics, analysis, tools, biological databases, Computational Biology, Bioinformatics Programming in Python & R through interactive video courses and tutorials, Join BioCode.

bottom of page