
Amino acid substitution matrices, which are 20×20 matrices, have been devised to reflect the likelihood of residue substitutions. There are essentially two types of amino acid substitution matrices. One type is based on interchange ability of the genetic code or amino acid properties, and the other is derived from empirical studies of amino acid substitutions. Although the two different approaches coincide to a certain extent, the first approach, which is based on the genetic code or the physicochemical features of amino acids, has been shown to be less accurate than the second approach, which is based on surveys of actual amino acid substitutions among related proteins. The empirical matrices, which include PAM and BLOSUM matrices, are derived from actual alignments of highly similar sequences. By analyzing the probabilities of amino acid substitutions in these alignments, a scoring system can be developed by giving a high score for a more likely substitution and a low score for a rare substitution. For a given substitution matrix, a positive score means that the frequency of amino acid substitutions found in a data set of homologous sequences is greater than would have occurred by random chance. They represent substitutions of very similar residues or identical residues.
In the PAM matrix construction, the only direct observation of residue substitutions is in PAM1, based on a relatively small set of extremely closely related sequences. Sequence alignment statistics for more divergent sequences are not available. To fill in the gap, a new set of substitution matrices have been developed. This is the series of blocks amino acid substitution matrices (BLOSUM), all of which are derived based on direct observation for every possible amino acid substitution in multiple sequence alignments. These were constructed based on more than 2,000 conserved amino acid patterns representing 500 groups of protein sequences. The sequence patterns, also called blocks, are ungapped alignments of less than sixty amino acid residues in length. The frequencies of amino acid substitutions of the residues in these blocks are calculated to produce a numerical table, or block substitution matrix. Instead of using the extrapolation function, the BLOSUM matrices are actual percentage identity values of sequences selected for construction of the matrices. For example, BLOSUM62 shows that the sequences selected for constructing the matrix share an average identity value of 62%. Other BLOSUM matrices based on sequence groups of various identity levels have also been constructed. In the reversing order as the PAM numbering system, the lower the BLOSUM number, the more divergent sequences they represent. The BLOSUM score for a particular residue pair is derived from the log ratio of observed residue substitution frequency versus the expected probability of a particular residue. The log odds are taken to the base of 2 instead of 10 as in the PAM matrices. The resulting value is rounded to the nearest integer and entered into the substitution matrix. As in the PAM matrices, positive and negative values correspond to substitutions that occur more or less frequently than expected among evolutionarily conserved replacements.