Substitution matrix: Difference between revisions

Content deleted Content added
Line 59:
 
It also turns out the BLOSUM computer code written by Henikoff and Henikoff does not exactly match the description in their paper. Surprisingly, this commonly-used "wrong" version has better search performance.<ref name=article>{{cite journal |last1=Styczynski |first1=Mark P |last2=Jensen |first2=Kyle L |last3=Rigoutsos |first3=Isidore |last4=Stephanopoulos |first4=Gregory |title=BLOSUM62 miscalculations improve search performance |journal=Nature Biotechnology |date=March 2008 |volume=26 |issue=3 |pages=274–275 |doi=10.1038/nbt0308-274 | pmid=18327232 |s2cid=205266180 }}</ref>
 
One issue with BLOSUM is that it describes observed substitutions, which can be misleading since it ignores the possibility of intermediate substitutions (a consequence of counting changes, equivalent to maximum parismony).<ref name="WAG original paper"/> As a result, it does not describe a true substitution model; its distances do not have a theoretical meaning as evolutionary distances. PMB's (Probability Matrix from Blocks) authors use the observed differences in many BLOSUM matrices to estimate actual substitution frequencies, which they present as the PMB matrix.<ref>{{cite journal |last1=Veerassamy |first1=Shalini |last2=Smith |first2=Andrew |last3=Tillier |first3=Elisabeth R. M. |title=A Transition Probability Model for Amino Acid Substitutions from Blocks |journal=Journal of Computational Biology |date=December 2003 |volume=10 |issue=6 |pages=997–1010 |doi=10.1089/106652703322756195}}</ref>
 
=== Differences between PAM and BLOSUM ===
Line 73 ⟶ 71:
* VTML (2001), a PAM-like matrix based on the alignments in the SYSTERS database, iteratively improved using a maximum likelihood estimator starting from the 1970s Dayhoff PAM model.<ref name="pmid32954566">{{cite journal |last1=Trivedi |first1=R |last2=Nagarajaram |first2=HA |title=Substitution scoring matrices for proteins - An overview. |journal=Protein science : a publication of the Protein Society |date=November 2020 |volume=29 |issue=11 |pages=2150-2163 |doi=10.1002/pro.3954 |pmid=32954566 |pmc=7586916}}</ref>
* WAG (Wheelan And Goldman, 2001) uses a [[maximum likelihood]] estimating procedure instead of any form of MP over a "BRKALN" dataset. The substitution scores are calculated based on the likelihood of a change considering multiple tree topologies derived using [[neighbor-joining]]. The scores correspond to an [[substitution model]] which includes also amino-acid stationary frequencies and a scaling factor in the similarity scoring. There are two versions of the matrix: WAG matrix based on the assumption of the same amino-acid stationary frequencies across all the compared protein and WAG* matrix with different frequencies for each of included [[protein family|protein families]].<ref name="WAG original paper">{{cite journal |last1=Whelan |first1=Simon |last2=Goldman |first2=Nick |title=A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach |journal=Molecular Biology and Evolution |date=1 May 2001 |volume=18 |issue=5 |pages=691–699 |doi=10.1093/oxfordjournals.molbev.a003851 |pmid=11319253 |issn=0737-4038|doi-access=free }}</ref>
One* issuePMB with(Probability BLOSUMMatrix isfrom thatBlocks, it2003), describesa observedset substitutions,of which"true" cansubstitution befrequencies misleadingestimated since it ignoresfrom the possibilityobserved frequencies of intermediateBLOSUM, substitutionstaking (ainto consequenceaccount ofthe countingpossibility changes,of equivalenta tolater maximumsubstitution parismony).<ref name="WAG original paper"/> Asmasking a result,previous itone. doesIt notthus describecreates a true substitutionevolutionary model; itswhere the distances do not have a theoretical meaning as evolutionary distances. PMB's (ProbabilityBLOSUM Matrixdoes fromnot Blocks)have authorsthis usefeature, theunlike observedPAM, differences in many BLOSUM matrices to estimate actual substitution frequenciesWAG, whichand theymost presentother aslater the PMB matrixmatrices).<ref>{{cite journal |last1=Veerassamy |first1=Shalini |last2=Smith |first2=Andrew |last3=Tillier |first3=Elisabeth R. M. |title=A Transition Probability Model for Amino Acid Substitutions from Blocks |journal=Journal of Computational Biology |date=December 2003 |volume=10 |issue=6 |pages=997–1010 |doi=10.1089/106652703322756195}}</ref>
* LG (2008), which uses a larger dataset (Pfam-based) than WAG.<ref>{{cite journal |last1=Le |first1=S. Q. |last2=Gascuel |first2=O. |title=An Improved General Amino Acid Replacement Matrix |journal=Molecular Biology and Evolution |date=3 April 2008 |volume=25 |issue=7 |pages=1307–1320 |doi=10.1093/molbev/msn067}}</ref>
* Matrices using a selection of proteins based on structual relatedness, as proposed by Benner et al. (1994), Fan (2004), and Steven et al. (2004).<ref name="pmid32954566"/>