Structural alignment: Difference between revisions

Content deleted Content added
Rescuing 1 sources and tagging 0 as dead.) #IABot (v2.0.8.2) (BrownHairedGirl - 8948
m clean up, typo(s) fixed: Therefore → Therefore,
Line 21:
 
===Evaluating similarity===
Often the purpose of seeking a structural superposition is not so much the superposition itself, but an evaluation of the similarity of two structures or a confidence in a remote alignment.<ref name="casp11"/><ref name="Malmstrom" /><ref name="robetta"/> A subtle but important distinction from maximal structural superposition is the conversion of an alignment to a meaningful similarity score.<ref name="Mammoth" /><ref name="ZhangTMscore"/> Most methods output some sort of "score" indicating the quality of the superposition.<ref name="zemla" /> <ref name="fischer"/> <ref name="poleksic"/><ref name="Mammoth"/><ref name="ZhangTMscore"/> However, what one actually wants is ''not'' merely an ''estimated'' "Z-score" or an ''estimated'' E-value of seeing the observed superposition by chance but instead one desires that the ''estimated'' E-value is tightly correlation to the true E-value. Critically, even if a method's estimated E-value is precisely correct ''on average'', if it lacks a low standard deviation on its estimated value generation process, then the rank ordering of the relative similarities of a query protein to a comparison set will rarely agree with the "true" ordering.<ref name="Mammoth"/><ref name="ZhangTMscore"/>
 
Different methods will superimpose different numbers of residues because they use different quality assurances and different definitions of "overlap"; some only include residues meeting multiple local and global superposition criteria and others are more greedy, flexible, and promiscuous. A greater number of atoms superposed can mean more similarity but it may not always produce the best E-value quantifying the unlikeliness of the superposition and thus not as useful for assessing similarity, especially in remote homologs.<ref name="casp11"/><ref name="Malmstrom" /><ref name="robetta" /><ref name="skolnick" />
Line 82:
| pmc= 2373724
|doi-access=free
}}</ref> approaches the alignment problem from a different objective than almost all other methods. Rather than trying to find an alignment that maximally superimposes the largest number of residues, it seeks the subset of the structural alignment least likely to occur by chance. To do this it marks a local motif alignment with flags to indicate which residues simultaneously satisfy more stringent criteria: 1) Local structure overlap 2) regular secondary structure 3) 3D-superposition 4) same ordering in primary sequence. It converts the statistics of the number of residues with high-confidence matches and the size of the protein to compute an Expectation value for the outcome by chance. It excels at matching remote homologs, particularly structures generated by ab initio structure prediction to structure families such as SCOP, because it emphasizes extracting a statistically reliable sub alignment and not in achieving the maximal sequence alignment or maximal 3D superposition.<ref name="Malmstrom">< /ref><ref name="robetta">{{cite journal
|journal=Nucleic Acids Research
|year= 2004
Line 96:
}}</ref>
 
For every overlapping window of 7 consecutive residues it computes the set of displacement direction unit vectors between adjacent C-alpha residues. All-against-all local motifs are compared based on the URMS score. These values becomes the pair alignment score entries for dynamic programming which produces a seed pair-wise residue alignment. The second phase uses a modified MaxSub algorithm: a single 7 reside aligned pair in each proteins is used to orient the two full length protein structures to maximally superimpose these just these 7 C-alpha, then in this orientation it scans for any additional aligned pairs that are close in 3D. It re-orients the structures to superimpose this expanded set and iterates till no more pairs coincide in 3D. This process is restarted for every 7 residue window in the seed alignment. The output is the maximal number of atoms found from any of these initial seeds. This statistic is converted to a calibrated E-value for the similarity of the proteins.
 
Mammoth makes no attempt to re-iterate the initial alignment or extend the high quality sub-subset. Therefore, the seed alignment it displays can't be fairly compared to DALI or TM align as its was formed simply as a heuristic to prune the search space. (It can be used if one wants an alignment based solely on local structure-motif similarity agnostic of long range rigid body atomic alignment.) Because of that same parsimony, it is well over ten times faster than DALI, CE and TM-align. <ref name="foldclass">{{cite journal
|title=Efficient SCOP-fold classification and retrieval using index-based protein substructure alignments
|authors=Pin-Hao Chi, Bin Pang, Dmitry Korkin, Chi-Ren Shyu
Line 109:
|pmid=19667079
|doi-access=free
}}</ref> It is often used in conjunction with these slower tools to pre-screen large data bases to extract the just the best E-value related structures for more exhaustive superposition or expensive calculations.
<ref name="grishin04">{{cite journal
|journal=BMC Bioinformatics
Line 134:
|pmid=15860561
|doi-access=free
}}</ref>
 
It has been particularly successful at analyzing "decoy" structures from ab initio structure prediction.<ref name="casp11">{{cite journal
Line 142:
|year= 2016
|volume=84
|pages=(Suppl 1):15-1915–19
| doi=10.1002/prot.25005
|pmid=26857434
|pmc=5479680
|doi-access=free
}}</ref><ref name="Malmstrom">< /ref><ref name="robetta">< /ref> These decoys are notorious for getting local fragment motif structure correct, and forming some kernels of correct 3D tertiary structure but getting the full length tertiary structure wrong. In this twilight remote homology regime, Mammoth's e-values for the CASP<ref name="casp11">< /ref> protein structure prediction evaluation have been show to be significantly more correlated with human ranking than SSAP or DALI.<ref name=Mammoth>< /ref> Mammoths ability to extract the multi-criteria partial overlaps with proteins of known structure and rank these with proper E-values, combined with its speed facilitates scanning vast numbers of decoy models against the PDB data base for identifying the most likely correct decoys based on their remote homology to known proteins.
<ref name="Malmstrom">{{cite journal
|title=Superfamily Assignments for the Yeast Proteome through Integration of Structure Prediction with the Gene Ontology