Substitution matrix: Difference between revisions

Content deleted Content added
Corrected citiation
Added information on commonly used maximum-likelihood model based WAG matrix
Line 61:
# Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. Example: PAM150 is used for more distant sequences than PAM100; BLOSUM62 is used for closer sequences than BLOSUM50.
 
== Maximum likelihood matrices ==
== Extensions and improvements ==
 
=== WAG matrix ===
Developed in 2001 by Simon Wheelan and Nick Goldman, the WAG (Whelan And Goldman) matrix is calculated using a [[maximum likelihood]] estimating procedure. The use of maximum likelihood makes it less prone to systematic errors than are the matrices based simply on comparing closely related homologs, such as PAM. The substitution scores are calculated based on the likelihood of a change considering multiple tree topologies derived using [[neighbor-joining]]. The scores correspond to an [[substitution model]] which includes also amino-acid stationary frequencies and a scaling factor in the similarity scoring. There are two versions of the matrix: WAG matrix based on the assumption of the same amino-acid stationary frequencies across all the compared protein and WAG* matrix with different frequencies for each of included [[protein family|protein families]].<ref name="WAG original paper">{{cite journal |last1=Whelan |first1=Simon |last2=Goldman |first2=Nick |title=A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach |journal=Molecular Biology and Evolution |date=1 May 2001 |volume=18 |issue=5 |pages=691–699 |doi=10.1093/oxfordjournals.molbev.a003851 |url=https://doi.org/10.1093/oxfordjournals.molbev.a003851 |issn=0737-4038}}</ref>
 
== Sepcialized substitution matrices and their extensions ==
Many specialized substitution matrices have been developed that describe the amino acid substitution rates in specific structural or sequence contexts, such as in transmembrane alpha helices,<ref>{{cite journal |pmid=11473008 |year=2001 |last1=Müller |first1=T |last2=Rahmann |last3=Rehmsmeier |title=Non-symmetric score matrices and the detection of homologous transmembrane proteins |volume=17 Suppl 1 |pages=S182–9 |journal=Bioinformatics |first2=S |first3=M |doi=10.1093/bioinformatics/17.suppl_1.s182|doi-access=free }}</ref> for combinations of secondary structure states and solvent accessibility states,<ref>{{cite journal |pmid=9135128 |year=1997 |last1=Rice |first1=DW |last2=Eisenberg |title=A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence |volume=267 |issue=4 |pages=1026–38 |doi=10.1006/jmbi.1997.0924 |journal=Journal of Molecular Biology |first2=D|citeseerx=10.1.1.44.1143 }}</ref><ref>{{cite journal |pmid=18833291 |year=2008 |last1=Gong |first1=Sungsam |last2=Blundell |first2=Tom L. |title=Discarding functional residues from the substitution table improves predictions of active sites within three-dimensional structures |volume=4 |issue=10 |pages=e1000179 |doi=10.1371/journal.pcbi.1000179 |journal=PLOS Computational Biology |pmc=2527532 |bibcode=2008PLSCB...4E0179G |editor1-last=Levitt |editor1-first=Michael}}</ref><ref>{{cite journal |pmid=18004781 |year=2008 |last1=Goonesekere |first1=NC |last2=Lee |title=Context-specific amino acid substitution matrices and their use in the detection of protein homologs |volume=71 |issue=2 |pages=910–9 |doi=10.1002/prot.21775 |journal=Proteins |first2=B|s2cid=27443393 }}</ref> or for local sequence-structure contexts.<ref>{{cite journal |pmid=16352653 |year=2006 |last1=Huang |first1=YM |last2=Bystroff |title=Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions |volume=22 |issue=4 |pages=413–22 |doi=10.1093/bioinformatics/bti828 |journal=Bioinformatics |first2=C|doi-access=free }}</ref> These context-specific substitution matrices lead to generally improved alignment quality at some cost of speed but are not yet widely used. Recently, sequence context-specific amino acid similarities have been derived that do not need substitution matrices but that rely on a library of sequence contexts instead. Using this idea, a context-specific extension of the popular [[BLAST]] program has been demonstrated to achieve a twofold sensitivity improvement for remotely related sequences over BLAST at similar speeds ([[CS-BLAST]]).