Stretcher

BioCodeKb - Bioinformatics Knowledgebase

EMBOSS Stretcher calculates an optimal global alignment of two sequences using a modification of the classic dynamic programming algorithm which uses linear space.

EMBOSS Stretcher uses a modification of the Needleman-Wunsch algorithm that allows larger sequences to be globally aligned.


stretcher calculates an optimal global alignment of two sequences using a modification of the classic dynamic programming algorithm which uses linear space. The output is a standard alignment file. The substitution matrix, gap insertion penalty and gap extension penalties used to calculate the alignment may be specified.


Algorithm

The standard sequence global alignment program using the Needleman & Wunsch algorithm, as used in the program needle, requires O(MN) space. This is standard computer-science language for it needing an amount of computer memory that is proportional to the product of the two sequences being aligned. So if a 1 kb and a 10 kb sequence take 10 Mega-words of memory to align, we should expect that in order to align a 10 kb sequence and a 1 Mb sequence we will need approximately 10 Giga-words of memory. When using needle computer memory will rapidly be exhausted as the size of the aligned sequences increases.


This program implements the Myers and Miller algorithm for finding an optimal global alignment in an amount of computer memory that is only proportional to the size of the smaller sequence - O(N).


Working

Running a tool from the web form is a simple multiple steps process.


Step 1: Input Sequences

First Input Sequence

A free text (raw) list of sequences is simply a block of characters representing several DNA/RNA or Protein sequences. A sequence can be in GCG, FASTA, EMBL (Nucleotide only), GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot (Protein only) format. Partially formatted sequences are not accepted. Adding a return to the end of the sequence may help certain applications understand the input.


First Sequence File Upload

A file containing valid sequences in any format (GCG, FASTA, EMBL (Nucleotide only), GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot (Protein only)) can be used as input for the sequence similarity search. (See example input formats). Word processors files may produce unpredictable results as hidden/control characters may be present in the files. It is best to save files with the Unix format option to avoid hidden Windows characters.


Second Input Sequence

A free text (raw) list of sequences is simply a block of characters representing several DNA/RNA or Protein sequences. A sequence can be in GCG, FASTA, EMBL (Nucleotide only), GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot (Protein only) format. (See example input formats). Partially formatted sequences are not accepted. Adding a return to the end of the sequence may help certain applications understand the input.


Second Sequence File Upload

A file containing valid sequences in any format (GCG, FASTA, EMBL (Nucleotide only), GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot (Protein only)) can be used as input for the sequence similarity search. (See example input formats). Word processors files may yield unpredictable results as hidden/control characters may be present in the files. It is best to save files with the Unix format option to avoid hidden Windows characters.


Step 2: Set alignment options

Matrix

Default substitution scoring matrices.

  • For Protein

  • For Nucleotide


Gap Open Penalty

Pairwise alignment score for the first residue in a gap.

Values: 1-25

Default value (for Protein) is: 12

Default value (for Nucleotide) is: 16


Gap Extend Penalty

Pairwise alignment score for each additional residue in a gap.

Values: 1-8

Default value (for Protein) is: 2

Default value (for nucleotide) is: 4

Additional information Read more about gap penalties


Output file format

The output is a standard EMBOSS alignment file.

The results can be output in one of many styles by using the command-line qualifier -aformat xxx, where 'xxx' is replaced by the name of the required format. Some of the alignment formats can cope with an unlimited number of sequences, while others are only for pairs of sequences.

The available multiple alignment format names are: multiple, simple, fasta, msf, clustal, mega, meganon, nexus,, nexusnon, phylip, phylipnon, selex, treecon, tcoffee, debug, srs.

Need to learn more about Stretcher and much more?

To learn Bioinformatics, analysis, tools, biological databases, Computational Biology, Bioinformatics Programming in Python & R through interactive video courses and tutorials, Join BioCode.

Get in touch with us

Tel: +92 314 7785980

Email: Contact@BioCode.ltd

  • Black Instagram Icon
  • Facebook

© Copyright 2020 BioCode Ltd. - All rights reserved.