The many functional partnerships and interactions that occur between proteins are at the core of cellular processing and their systematic characterization helps to provide context in molecular systems biology. However, known and predicted interactions are scattered over multiple resources, and the available data exhibit notable differences in terms of quality and completeness. The STRING tool aims to provide a critical assessment and integration of protein–protein interactions, including direct (physical) as well as indirect (functional) associations. The new version 10.0 of STRING covers more than 2000 organisms, which has necessitated novel, scalable algorithms for transferring interaction information between organisms. For this purpose, hierarchical and self-consistent orthology annotations has been discovered for all interacting proteins, grouping the proteins into families at many levels of phylogenetic resolution. Further improvements in version 10.0 include a completely redesigned prediction pipeline for inferring protein–protein associations from co-expression data, an API interface for the R computing environment and improved statistical analysis for enrichment tests in user-provided networks.
The basic interaction unit in STRING is the functional association, such as a specific and productive functional relationship between two proteins, likely contributing to a common biological purpose. Interactions are derived from multiple sources:
known experimental interactions are imported from primary databases
pathway knowledge is parsed from manually curated databases
automated text-mining is applied to uncover statistical and/or semantic links between proteins, based on Medline abstracts and a large collection of full-text articles
interactions are predicted de novo by a number of algorithms using genomic information as well as by co-expression analysis
interactions that are observed in one organism are systematically transferred to other organisms, via pre-computed orthology relations
STRING centers on protein-coding gene loci alternative splice isoforms or post-translationally modified forms are not resolved, but are instead collapsed at the level of the gene locus. All sources of interaction evidence are benchmarked and calibrated against previous knowledge, using the high-level functional groupings provided by the manually curated Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway maps.
The STRING database contains information from numerous sources, including experimental data, computational prediction methods and public text collections. It is freely accessible and it is regularly updated. The resource also serves to highlight functional enrichments in user-provided lists of proteins, using a number of functional classification systems..
Protein–protein interaction networks are an important ingredient for the system-level understanding of cellular processes. Such networks can be used for filtering and assessing functional genomics data and for providing an intuitive platform for annotating structural, functional and evolutionary properties of proteins. Exploring the predicted interaction networks can suggest new directions for future experimental research and provide cross-species predictions for efficient interaction mapping.
Like many other databases that store protein association knowledge, STRING imports data from experimentally derived protein–protein interactions through literature curation. Furthermore, STRING also store computationally predicted interactions from:
text mining of scientific texts
interactions computed from genomic features
interactions transferred from model organisms based on orthology