EMBOSS Stretcher calculates an optimal global alignment of two sequences using a modification of the classic dynamic programming algorithm which uses linear space.
EMBOSS Stretcher uses a modification of the Needleman-Wunsch algorithm that allows larger sequences to be globally aligned.
stretcher calculates an optimal global alignment of two sequences using a modification of the classic dynamic programming algorithm which uses linear space. The output is a standard alignment file. The substitution matrix, gap insertion penalty and gap extension penalties used to calculate the alignment may be specified.
Algorithm
The standard sequence global alignment program using the Needleman & Wunsch algorithm, as used in the program needle, requires O(MN) space. This is standard computer-science language for it needing an amount of computer memory that is proportional to the product of the two sequences being aligned. So if a 1 kb and a 10 kb sequence take 10 Mega-words of memory to align, we should expect that in order to align a 10 kb sequence and a 1 Mb sequence we will need approximately 10 Giga-words of memory. When using needle computer memory will rapidly be exhausted as the size of the aligned sequences increases.
This program implements the Myers and Miller algorithm for finding an optimal global alignment in an amount of computer memory that is only proportional to the size of the smaller sequence - O(N).
Working
Running a tool from the web form is a simple multiple steps process.
Step 1: Input Sequences
First Input Sequence
A free text (raw) list of sequences is simply a block of characters representing several DNA/RNA or Protein sequences. A sequence can be in GCG, FASTA, EMBL (Nucleotide only), GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot (Protein only) format. Partially formatted sequences are not accepted. Adding a return to the end of the sequence may help certain applications understand the input.
First Sequence File Upload
A file containing valid sequences in any format (GCG, FASTA, EMBL (Nucleotide only), GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot (Protein only)) can be used as input for the sequence similarity search. (See example input formats). Word processors files may produce unpredictable results as hidden/control characters may be present in the files. It is best to save files with the Unix format option to avoid hidden Windows characters.
Second Input Sequence
A free text (raw) list of sequences is simply a block of characters representing several DNA/RNA or Protein sequences. A sequence can be in GCG, FASTA, EMBL (Nucleotide only), GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot (Protein only) format. (See example input formats). Partially formatted sequences are not accepted. Adding a return to the end of the sequence may help certain applications understand the input.
Second Sequence File Upload
A file containing valid sequences in any format (GCG, FASTA, EMBL (Nucleotide only), GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot (Protein only)) can be used as input for the sequence similarity search. (See example input formats). Word processors files may yield unpredictable results as hidden/control characters may be present in the files. It is best to save files with the Unix format option to avoid hidden Windows characters.
Step 2: Set alignment options
Matrix
Default substitution scoring matrices.
For Protein
For Nucleotide
Gap Open Penalty
Pairwise alignment score for the first residue in a gap.
Values: 1-25
Default value (for Protein) is: 12
Default value (for Nucleotide) is: 16
Gap Extend Penalty
Pairwise alignment score for each additional residue in a gap.
Values: 1-8
Default value (for Protein) is: 2
Default value (for nucleotide) is: 4
Additional information Read more about gap penalties
Output file format
The output is a standard EMBOSS alignment file.
The results can be output in one of many styles by using the command-line qualifier -aformat xxx, where 'xxx' is replaced by the name of the required format. Some of the alignment formats can cope with an unlimited number of sequences, while others are only for pairs of sequences.
The available multiple alignment format names are: multiple, simple, fasta, msf, clustal, mega, meganon, nexus,, nexusnon, phylip, phylipnon, selex, treecon, tcoffee, debug, srs.