Proteins are generally consist of one or more functional regions, commonly known as domains. The presence of different domains in varying combinations in different proteins gives rise to the diverse repertoire of proteins found in nature. Identifying the domains present in a protein can provide insights into the function of that protein.
The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and a hidden Markov model (HMMs).
Each Pfam family, often known as a Pfam-A entry, consists of a curated seed alignment containing a small set of representative members of the family, profile hidden Markov models (profile HMMs) built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences belonging to the family.
Pfam entries are classified in one of six ways:
Family:
Domain:
Repeat:
Motifs:
Coiled-Coil:
Disordered:
Regions that are conserved:
Related Pfam entries are grouped together into clans, the relationship may be defined by similarity of sequence, structure or profile-HMM.
Uses
It provides a complete and accurate classification of protein families and domains. Originally, the rationale behind creating the database was to have a semi-automated method of curating information on known protein families to improve the efficiency of annotating genomes. The Pfam classification of protein families has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions.
It is used by experimental biologists researching specific proteins, by structural biologists to identify new targets for structure determination, by computational biologists to organise sequences and by evolutionary biologists tracing the origins of proteins.
The Pfam website allows users to submit protein or DNA sequences to search for matches to families in the database. If DNA is submitted, a six-frame translation is performed, then each frame is searched. Rather than performing a typical BLAST search, Pfam uses profile hidden Markov models, which give greater weight to matches at conserved sites, allowing better remote homology detection, making them more suitable for annotating genomes of organisms with no well-annotated close relatives. The protein family databases Prints45 and Blocks46 are used on a set of short ungapped blocks of aligned residues to describe each family in Pfam.
Pfam has also been used in the creation of other resources such as iPfam, which catalogs domain-domain interactions within and between proteins, based on information in structure databases and mapping of Pfam domains onto these structures.
Features
View a description of the family
View protein domain architectures
Examine species distribution
Follow links to other databases
View known protein structures
search protein or DNA sequence against our models
browse our families and clans
retrieve text annotation about any given family/entry
view multiple sequence alignments of a family or clan
view relationships between families in a clan
see protein structure information in the context of a family
view families according to their taxonomic spread
search the database by keywords
Pfam data are available in a variety of formats, which include flatfiles and relational table dumps, both of which can be downloaded from the FTP site.