Stockholm format is a flatfile multiple sequence alignment format used by databases of annotated multiple sequence alignments such as Pfam, HMMER, Belvu and Rfam to disseminate protein and RNA sequence alignments.
Stockholm format files are often saved in the .sto or .stk extensions.
A well-formed stockholm file always contains a header which states the format and version identifier. The header is then followed by a multiple lines, a mix of markup (starting with #) and sequences. Finally, the "//" line indicates the end of the alignment.
Header
The first line in the file must contain a format and version identifier,
The sequence alignment
<seqname> <aligned sequence>
<seqname> <aligned sequence>
<seqname> <aligned sequence>
.
.
.
//
Sequences are written one per line. The sequence name is written first, and after any number of whitespaces the sequence is written. Sequence names are typically in the form "name/start-end" or just "name". Sequence letters may include any characters except whitespace. Gaps may be indicated by "." or "-". The "//" line indicates the end of the alignment.
Wrap-around alignments are allowed in principle, mainly for historical reasons, but are not used in e.g. Pfam. Wrapped alignments are discouraged since they are much harder to parse.
The alignment mark-up
Mark-up lines may include any characters except whitespace. Here, underscore ("_") is used instead of space. Mark-up lines start with #.
Metadata
Stockholm files support storing arbitrary metadata (features) about the MSA. All metadata explained here are optional and may appear in any order. Metadata “mark-up” lines begin with either #=GF, #=GS, #=GR, or #=GC, and each line describes a single feature of the alignment.
GF metadata
Data relating to the multiple sequence alignment as a whole, such as authors or number of sequences in the alignment. It is started with #=GF followed by a feature name and data relating to the feature. Typically comes first in a Stockholm file.
GS metadata
Data relating to a specific sequence in the multiple sequence alignment. Starts with #=GS followed by the sequence name followed by a feature name and data relating to the feature. Typically comes after GF metadata in a Stockholm file.
GR metadata
Data relating to the columns of a specific sequence in a multiple sequence alignment. It is started with #=GR followed by the sequence name followed by a feature name and data relating to the feature, one character per column. Typically, it comes after the sequence line it relates to.
GC metadata
Data relating to the columns of the multiple sequence alignment as a whole. Starts with #=GC followed by a feature name and data relating to the feature, one character per column. Typically it comes at the end of the multiple sequence alignment.
Format Parameters
The only supported format parameter is constructor, which specifies the type of in-memory sequence object to read each aligned sequence into. This must be a subclass of GrammaredSequence (e.g., DNA, RNA, Protein) and is a required format parameter. For example, if we know that the Stockholm file we are reading contains DNA sequences, we would pass constructor=DNA to the reader call.
Some restrictions to this format are;
Do not use multiple lines with the same #=GC label.
For a single sequence, do not use multiple lines with the same #=GR label. Only one unique feature assignment can be made for each sequence.