A consensus sequence is a sequence of DNA, RNA, or protein that represents aligned, related sequences. The consensus sequence of the related sequences can be defined in different ways, but is normally defined by the most common nucleotide(s) or amino acid residue(s) at each position.
In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. It shows the results of multiple sequence alignments in which related sequences are compared to each other and similar sequence motifs are calculated. Such information is important when considering sequence-dependent enzymes such as RNA polymerase.
Protein binding site, represented by a consensus sequence, may be a short sequence of nucleotides which is found many times in the genome and is thought to play the same role in its different locations. For example, many transcription factors recognize particular patterns in the promoters of the genes they regulate. In the same way, restriction enzymes usually have palindromic consensus sequences, usually corresponding to the site where they cut the DNA. Transposons act in much the same manner in their identification of target sequences for transposition. Finally, splice sites (sequences immediately surrounding the exon-intron boundaries) can also be considered as consensus sequences.
Thus a consensus sequence is a model for a putative DNA binding site as it is obtained by aligning all known examples of a certain recognition site and defined as the idealized sequence that represents the predominant base at each position.
The conserved sequence motifs are called consensus sequences and they show which residues are conserved and which residues are variable. Bioinformatics tools are able to calculate and visualize consensus sequences. Examples of the tools are JalView and UGENE.
Consensus sequences are widely used in molecular biology but they have many flaws. As a result, binding sites of proteins and other molecules are missed during studies of genetic sequences and important biological effects cannot be seen. Consensus sequence design offers a promising strategy for designing proteins of high stability while retaining biological activity since it draws upon an evolutionary history in which residues important for both stability and function are likely to be conserved. Although there have been several reports of successful consensus design of individual targets, it is unclear from these anecdotal studies how often this approach succeeds and how often it fails.
A multiple sequence alignment or a motif is often represented by a graphic representation called a logo. In a logo, each position consists of stacked letters representing the residues appearing in a particular column of a multiple alignment. The overall height of a logo position reflects how conserved the position is, and the height of each letter in a position reflects the relative frequency of the residue in the alignment. Conserved positions have fewer residues and bigger symbols, whereas less conserved positions have a more heterogeneous mixture of smaller symbols stacked together. In general, a sequence logo provides a clearer description of a consensus sequence. WebLogo is an interactive program for generating sequence logos. A user needs to enter the sequence alignment in FASTA format to allow the program to compute the logos. A graphic file is returned to the user as a result.