Next generation sequencing continues to flood bioinformatics databases with new protein sequences, most of which are of unknown function. Being able to know the roles that these proteins play and the likely mechanisms by which they play them from protein sequence alone would have largely important implications to many aspects of biological research. Making this kind of prediction requires a detailed knowledge of how protein function can change during the process of evolution and how these changes can be recognized from patterns within the protein sequence.
The CATH Protein Structure Classification is a curated database, free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid of 1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones and continues to be developed by the Orengo group at University College London. CATH shares many broad features with the SCOP resource, however there are also many areas in which the detailed classification differs greatly. CATH is a novel hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(C), Architecture(A), Topology(T) and Homologous superfamily (H).
When we have to predict function of protein, it is important to recognize the precise sequence patterns that can be associated with a particular functional role.
Three-dimensional structures of experimentally determined protein are obtained from the Protein Data Bank and split into their consecutive polypeptide chains, where applicable. Protein domains are identified within these chains using a mixture of automatic methods and manual curation.
The domains are then classified within the CATH structural hierarchy at the Class (C) level, domains are assigned according to their secondary structure content(Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically.), i.e. all alpha, all beta, a mixture of alpha and beta, or little secondary structure; at the Architecture (A) level, information on the secondary structure arrangement in three-dimensional space is used for assignment; at the Topology/fold (T) level, information on how the secondary structure elements are connected and arranged is used; assignments are made to the Homologous superfamily (H) level if there is good evidence that the domains are related by evolution i.e. they are homologous.
Additional sequence data for domains with no experimentally determined structures are provided by CATH's sister resource, Gene3D, which are used to populate the homologous superfamilies. Protein sequences from UniProtKB and Ensembl are scanned against CATH HMMs to predict domain sequence boundaries and make homologous superfamily assignments. All the predicted domain sequences assigned to CATH superfamilies have been subclassified into functional families (FunFams). Relatives within these functional families are likely to share highly similar structures and functions.
The assignments of structures to toplogy families and homologous superfamilies are made by sequence and structure comparisons.
The CATH homepage provides easy access to the CATH classification. The first site element contains a fast description of CATH, with a link to a more through introduction. The language is very non-technical and the reader can quickly grasp the overall structure of CATH.