The motif recognition problem takes as input a set of known patterns or features that in some way define a class of proteins. The goal is then to search in an unsupervised or supervised way for other instances of the same patterns. The known motifs in biological sequences are generally compiled databases that are publically available over the Internet. For example, the PRINTS database contains “protein fingerprints,” where a fingerprint is composed of a group of motifs that characterize a given set of protein sequences with the same molecular function. In contrast, the PROSITE and ELM databases contain single motifs that correspond to known functionally or structurally important amino acids, such as those involved in an active site or a ligand binding site. The motifs contained in these resources are generally manually curated and the entries in the databases include extensive documentation of the specific biological function associated with the sites.
Consensus sequences are the simplest model for representing protein motifs. They can be constructed easily by selecting the amino acid found most frequently at each position in the signal. The number of matches between a consensus and an unknown candidate sequence can be used to evaluate the significance of a potential functional site.
In the case of probability matrices or HMM‐based methods, a log‐odds score can be calculated that is a measure of how probable it is that a sequence is generated by a model rather than by a random null model, representing the universe of all sequences (also known as the “background”). The logarithm is usually base 2, and the score is given in bits. A log‐odds score greater than zero shows that the sequence fits the motif model better.
Alignment‐based methods for motif discovery first construct a multiple sequence alignment of the set of sequences, where each sequence of amino acids is typically represented as a row within a matrix. Gaps are inserted between the amino acids so that identical or similar characters are aligned in successive columns. Once the multiple alignments are constructed, the patterns are extracted from the alignment by combining the substrings common to most of the sequences.
One of the first automatic methods for the identification of conserved positions in a multiple alignment was the AMAS program, using a set‐based description of amino acid properties.
Tools
FIRE‐pro (finding informative regulatory elements in proteins)
MotifHound
Dilimot
SLiMFinder
qPMS (quorum planted motif search)
Pratt
MEME
Breaking a protein down into its constituent domain components is evidently a reductionist approach, but one which, to judge from the level of citations of domain discovery information is of great importance to an understanding of protein function. This is particularly true of metazoan genomes, and human in particular, where multi-domain proteins abound. The identification of orthologues (genes related by speciation events) and paralogues (genes related by an intra-genome duplication event) represents an alternative and complementary approach to tracking the evolution of function, but the many-to-many evolutionary relationships of metazoan proteins, and their multi-domain nature, complicates the application of these concepts, making initial domain-based annotation more appealing. Thus, increasing the numbers of proteins for which domain-based annotation can be provided is an important goal of computational genome analysis.
It is computationally intensive, but relatively straightforward, to apply database-searching techniques to entire sequence databases (for instance, either a whole genome, or a complete non-redundant sequence database) and thus establish all significant sequence similarities detectable by any given method. These similarities between pairs of sequences can then be clustered into sets of putatively homologous proteins. However, the multi-domain nature of proteins complicates the clustering procedure. A protein consisting of two domains, A and B, will cause the cluster containing homologs of domain A to be merged with that containing homologs of domain B. A number of automatic techniques have been developed to identify multi-domain proteins and decompose them into their respective domain complements; the basic principle of all is that domain boundaries can be inferred by automatic inspection of sequence alignments. In practice, low levels of sequence conservation between members of a domain family can make it difficult to establish domain boundaries, particularly from sets of pairwise sequence comparisons. This, and other problems, such as the difficulty of setting universal thresholds to establish homology between sequences within domain families, and the problem of usefully annotating automatically defined families, reduce the efficacy of these otherwise attractive approaches.
BioinfoLytics Company
Our company, BioinfoLytics, is affliated with BioCode and is a project, which covers many topics on Genomics, Proteomics, their analysis using many tools in a cool way, Sequence Alignment & Analysis, Bioinformatics Scripting & Software Development, Phylogenetic and Phylogenomic Analysis, Functional Analysis, Biological Data Analysis & Visualization, Custom Analysis, Biological Database Analysis, Molecular Docking, Protein Structure Prediction and Molecular Dynamics etc for the seekers of Biocode to further develop their interest to take part in these services to fulfill their requirements and obtain their desired results. We are providing such a platform where one can find opportunity to learn, research projects analysis and get help and huge knowledge based on molecular, computational and analytical biology.
We are providing “Domain and Motif Analysis” service to our customers in order to strive high quality research and will advance science in the domain of Sequence Alignment & Analysis.