Substitution matrix: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 07:00, 24 September 2024 edit OlliverWithDoubleL (talk \| contribs) Extended confirmed users 25,394 edits short description Tags: Mobile edit Mobile app edit iOS app edit App section source ← Previous edit		Latest revision as of 17:47, 29 July 2025 edit undo Lynch44 (talk \| contribs) Extended confirmed users, Pending changes reviewers, Rollbackers, Temporary account IP viewers 16,709 edits m Reverted edit by 94.205.38.214 (talk) to last version by Chris the speller Tags: Rollback Mobile edit Mobile web edit
(16 intermediate revisions by 6 users not shown)
Line 44: where <math>M_{i,j}</math> is the probability that amino acid <math>i</math> transforms into amino acid <math>j</math>, and <math>p_i</math>, <math>p_j</math> are the frequencies of amino acids ''i'' and ''j''. The base of the logarithm is not important, and the same substitution matrix is often expressed in different bases. == Example amino-acid matrices == {{anchor\|Example matrices}} === PAM === One of the first amino acid substitution matrices, the PAM ''([[Point accepted mutation\|Point Accepted Mutation]])'' matrix was developed by [[Margaret Oakley Dayhoff\|Margaret Dayhoff]] in the 1970s. This matrix is calculated by observing the differences in closely related proteins. Because the use of very closely related homologs, the observed mutations are not expected to significantly change the common functions of the proteins. Thus the observed substitutions (by point mutations) are considered to be accepted by natural selection. Line 57: It turns out that the BLOSUM62 matrix does an excellent job detecting similarities in distant sequences, and this is the matrix used by default in most recent alignment applications such as [[BLAST (biotechnology)\|BLAST]]. It also turns out the BLOSUM computer code written by Henikoff and Henikoff does not exactly match the description in their paper. Surprisingly, this commonly used "wrong" version has better search performance.<ref name=article>{{cite journal \|last1=Styczynski \|first1=Mark P \|last2=Jensen \|first2=Kyle L \|last3=Rigoutsos \|first3=Isidore \|last4=Stephanopoulos \|first4=Gregory \|title=BLOSUM62 miscalculations improve search performance \|journal=Nature Biotechnology \|date=March 2008 \|volume=26 \|issue=3 \|pages=274–275 \|doi=10.1038/nbt0308-274 \| pmid=18327232 \|s2cid=205266180 }}</ref> === Differences between PAM and BLOSUM === Line 66 ⟶ 68: === Newer matrices === A number of newer substitution matrices have been proposed to deal with inadequacies in earlier designs. * JTT, ~~published~~(1992). Published in the same year as BLOSOM, it also performs clustering and uses an implicit model. This may help reduce the systematic error from maximum parismony (MP), but also wastes sequence information.<ref name="WAG original paper"/> * VTML (2001), a PAM-like matrix based on the alignments in the SYSTERS database, iteratively improved using a maximum likelihood estimator starting from the 1970s Dayhoff PAM model.<ref name="pmid32954566">{{cite journal \|last1=Trivedi \|first1=R \|last2=Nagarajaram \|first2=HA \|title=Substitution scoring matrices for proteins - An overview \|journal=Protein Science \|date=November 2020 \|volume=29 \|issue=11 \|pages=2150–2163 \|doi=10.1002/pro.3954 \|pmid=32954566 \|pmc=7586916}}</ref> * WAG (Wheelan And Goldman), ~~published in~~ 2001,) uses a [[maximum likelihood]] estimating procedure instead of any form of MP over a "BRKALN" dataset. The substitution scores are calculated based on the likelihood of a change considering multiple tree topologies derived using [[neighbor-joining]]. The scores correspond to an [[substitution model]] which includes also amino-acid stationary frequencies and a scaling factor in the similarity scoring. There are two versions of the matrix: WAG matrix based on the assumption of the same amino-acid stationary frequencies across all the compared protein and WAG* matrix with different frequencies for each of included [[protein family\|protein families]].<ref name="WAG original paper">{{cite journal \|last1=Whelan \|first1=Simon \|last2=Goldman \|first2=Nick \|title=A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach \|journal=Molecular Biology and Evolution \|date=1 May 2001 \|volume=18 \|issue=5 \|pages=691–699 \|doi=10.1093/oxfordjournals.molbev.a003851 \|pmid=11319253 \|issn=0737-4038\|doi-access=free }}</ref> * PMB (Probability Matrix from Blocks, 2003), a set of "true" substitution frequencies estimated from the observed frequencies of BLOSUM, taking into account the possibility of a later substitution masking a previous one. It thus creates a evolutionary model where the distances have theoretical meaning (BLOSUM does not have this feature, unlike PAM, WAG, and most other later matrices, and hence is ''not'' recommended for phylogeny by IQ-TREE).<ref>{{cite journal \|last1=Veerassamy \|first1=Shalini \|last2=Smith \|first2=Andrew \|last3=Tillier \|first3=Elisabeth R. M. \|title=A Transition Probability Model for Amino Acid Substitutions from Blocks \|journal=Journal of Computational Biology \|date=December 2003 \|volume=10 \|issue=6 \|pages=997–1010 \|doi=10.1089/106652703322756195\|pmid=14980022 }}</ref> * LG (2008), which uses a larger dataset (Pfam-based) than WAG. An extension of the WAG algorithm is used, with a new PhyML (WAG+Γ4) model taking into account of sites with different evolutionary rates.<ref>{{cite journal \|last1=Le \|first1=S. Q. \|last2=Gascuel \|first2=O. \|title=An Improved General Amino Acid Replacement Matrix \|journal=Molecular Biology and Evolution \|date=3 April 2008 \|volume=25 \|issue=7 \|pages=1307–1320 \|doi=10.1093/molbev/msn067\|pmid=18367465 }}</ref> * Qmaker and nQmaker (2021, 2022), programs with the ability to estimate time-reversible and nonreversible matrices from very large datasets quickly. Each provide a general matrix and 5 specialized matrices, for a total of 12 precalculated substitution matrices.<ref>{{cite journal \|last1=Minh \|first1=Bui Quang \|last2=Dang \|first2=Cuong Cao \|last3=Vinh \|first3=Le Sy \|last4=Lanfear \|first4=Robert \|title=QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution \|journal=Systematic Biology \|date=11 August 2021 \|volume=70 \|issue=5 \|pages=1046–1060 \|doi=10.1093/sysbio/syab010\|pmid=33616668 \|pmc=8357343 }}</ref><ref>{{cite journal \|last1=Dang \|first1=Cuong Cao \|last2=Minh \|first2=Bui Quang \|last3=McShea \|first3=Hanon \|last4=Masel \|first4=Joanna \|last5=James \|first5=Jennifer Eleanor \|last6=Vinh \|first6=Le Sy \|last7=Lanfear \|first7=Robert \|title=nQMaker: Estimating Time Nonreversible Amino Acid Substitution Models \|journal=Systematic Biology \|date=10 August 2022 \|volume=71 \|issue=5 \|pages=1110–1123 \|doi=10.1093/sysbio/syac007\|pmid=35139203 \|pmc=9366462 }}</ref> * Matrices using a selection of proteins based on structual relatedness, as proposed by Benner et al. (1994), Fan (2004), and Steven et al. (2004).<ref name="pmid32954566"/> * Matrices using structual alignments of proteins instead of simple sequence alignment (6 separate publications).<ref name="pmid32954566"/> * Matrices using known physiochemical parameters of amino acid residues (5 separate publications).<ref name="pmid32954566"/> For a list of more models (including irreversible i.e. asymmetric and/or specialized ones), see the documentation for recent bioinformatic software including IQ-Tree,<ref name=iqtree>{{cite web \|title=Substitution Models \|url=https://iqtree.github.io/doc/Substitution-Models \|website=iqtree.github.io \|language=en}}</ref> PhyML,<ref>{{cite web \|title=phyml/doc/phyml-manual.pdf at master · stephaneguindon/phyml \|url=https://github.com/stephaneguindon/phyml/blob/master/doc/phyml-manual.pdf \|website=GitHub \|language=en}}</ref> and RAxML.<ref>{{cite web \|last1=Stamatakis \|first1=Alexandros \|title=The RAxML v8.2.X Manual \|url=https://cme.h-its.org/exelixis/resource/download/NewManual.pdf#page=31 \|date=July 20, 2016}}</ref> == Specialized substitution matrices and their extensions ==▼ ▲==== Specialized substitution matrices and their extensions ==== The real substitution rates in a protein depends not only on the identity of the amino acid, but also on the specific structural or sequence context it is in. Many specialized matrices have been developed for these contexts, such as in transmembrane alpha helices,<ref>{{cite journal \|pmid=11473008 \|year=2001 \|last1=Müller \|first1=T \|last2=Rahmann \|last3=Rehmsmeier \|title=Non-symmetric score matrices and the detection of homologous transmembrane proteins \|volume=17 \|pages=S182–9 \|journal=Bioinformatics \|first2=S \|first3=M \|issue=Suppl 1 \|doi=10.1093/bioinformatics/17.suppl_1.s182\|doi-access=free }}</ref> for combinations of secondary structure states and solvent accessibility states,<ref>{{cite journal \|pmid=9135128 \|year=1997 \|last1=Rice \|first1=DW \|last2=Eisenberg \|title=A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence \|volume=267 \|issue=4 \|pages=1026–38 \|doi=10.1006/jmbi.1997.0924 \|journal=Journal of Molecular Biology \|first2=D\|citeseerx=10.1.1.44.1143 }}</ref><ref>{{cite journal \|pmid=18833291 \|year=2008 \|last1=Gong \|first1=Sungsam \|last2=Blundell \|first2=Tom L. \|title=Discarding functional residues from the substitution table improves predictions of active sites within three-dimensional structures \|volume=4 \|issue=10 \|pages=e1000179 \|doi=10.1371/journal.pcbi.1000179 \|journal=PLOS Computational Biology \|pmc=2527532 \|bibcode=2008PLSCB...4E0179G \|editor1-last=Levitt \|editor1-first=Michael \|doi-access=free }}</ref><ref>{{cite journal \|pmid=18004781 \|year=2008 \|last1=Goonesekere \|first1=NC \|last2=Lee \|title=Context-specific amino acid substitution matrices and their use in the detection of protein homologs \|volume=71 \|issue=2 \|pages=910–9 \|doi=10.1002/prot.21775 \|journal=Proteins \|first2=B\|s2cid=27443393 }}</ref> or for local sequence-structure contexts.<ref>{{cite journal \|pmid=16352653 \|year=2006 \|last1=Huang \|first1=YM \|last2=Bystroff \|title=Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions \|volume=22 \|issue=4 \|pages=413–22 \|doi=10.1093/bioinformatics/bti828 \|journal=Bioinformatics \|first2=C\|doi-access=free }}</ref> These context-specific substitution matrices lead to generally improved alignment quality at some cost of speed but are not yet widely used.<!-- todo: deal with 19087270 --> Since the 2000s, an increasing amount of matrices are defined for subsets of proteins not optimally aligned by traditional "general-purpose" matrices. These include:<ref name="pmid32954566"/> * PfSSM (2008), CBM and CCF (2008) for ''Plasmodium'' proteins, which have a different amino acid evolutionary bias due to the low [[GC content]] of the genome. * Matrices for transmembrane proteins. JTT transmembrane (1994) is the first of the class. Later work include: For alpha-helical transmembrane proteins, PHAT (2000) and SLIM (2001). For beta-barrel transmembrane proteins, bbTM (2008). * Matrices for a specific protein family, including GPCRtm (2015) for the transmembrane (mostly helical) regions of [[GPCR]]s. * Matrices for proteins with a specific role, including Hubsm (2017) for "hub proteins" in protein‐protein interaction networks. * Matrices for [[intrinsically disordered protein]]s, including DUNMat (2002), MidicMat (2009), Disorder (2010), and EDSSMat (2019). Recently, sequence context-specific amino acid similarities have been derived that do not need substitution matrices but that rely on a library of sequence contexts instead. Using this idea, a context-specific extension of the popular [[BLAST (biotechnology)\|BLAST]] program has been demonstrated to achieve a twofold sensitivity improvement for remotely related sequences over BLAST at similar speeds ([[CS-BLAST]]). == Nucleotide matrices == With nucleotides only having four possible values (in most bioinformatic sequences), the emphasis lies not in setting fixed values in the matrix, but in designing parameterized models that fit the observed evolution of the input sequence as it's being aligned. See [[Models of DNA evolution]]. == Terminology == Line 96 ⟶ 121: [[Category:Bioinformatics]] [[Category:Matrices (mathematics)]]