BLAST finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance. BLAST can be used to conclude functional and evolutionary relationships between sequences as well as help identify members of gene families.
Genome Workbench also allows to run nucleotide BLAST search against local nucleotide database (MEGAblast and blastn), protein against translated nucleotide database (tblastn), translated nucleotide against translated nucleotide database (tblastx), and many other.
Generally, BLAST does not directly search GenBank flatfiles. Rather, sequences are transformed into BLAST databases with a special format that makes searching more efficient. The BLAST indexing processes splits and indexes the sequence records, producing many files. The “header” and the “sequence” files are the most important ones. The header file have information such as the sequence title and taxonomy information and is used mostly during formatting of the BLAST report. The sequence file contains the sequence information and is used most heavily during the BLAST search. DNA has a small alphabet (four letters, if there are no ambiguities) so the DNA sequence file uses a little more than one byte per four bases. A BLAST database is normally partitioned into multiple volumes, with each volume showing a contiguous subset of the database. The size of the volume can be specified when the database is created, but the NCBI has found that a volume size of about one gigabyte works well. For the Sequence Read Archive (SRA), BLAST searches the underlying SRA objects directly. This is more good because the SRA objects group the data in a manner similar to that of BLAST.
There are many different algorithms for searching sequence databases, but BLAST algorithms are some of the most popular, because of their speed. The key to BLAST’s speed is its use of local alignments that serve as seeds for more extensive alignments. In fact, BLAST is an acronym for Basic Local Alignment Search Tool. A set of BLAST tools for searching nucleotide and proteins sequences is available for use at the NCBI site.
BLAST searches begin with a query sequence that will be matched against sequence databases specified by the user. As the algorithms work through the data, they compute the probability that each potential match may have arisen by chance alone, which would not be consistent with an evolutionary relationship. BLAST algorithms start by breaking down the query sequence into a series of short overlapping “words” and assigning numerical values to the words. Words above a threshold value for statistical significance are then used to search databases. The default word size for BLASTN is 28 nucleotides. Because there are only four possible nucleotides in DNA, a sequence of this length would be expected to occur randomly once in every 428, or 1017, nucleotides, which is far longer than any genome. The default word size for BLASTP is three amino acids. Because proteins contain 20 different amino acids, a tripeptide sequence would be expected to arise randomly once in every 8000 tripeptides, which is longer than any protein.