Multiple sequence alignment: Difference between revisions

Content deleted Content added
Changing short description from "Alignment of more than two molecular sequence" to "Alignment of more than two molecular sequences"
Citation bot (talk | contribs)
Add: s2cid. | Use this bot. Report bugs. | Suggested by Abductive | #UCB_toolbar
Line 3:
'''Multiple sequence alignment''' ('''MSA''') may refer to the process or the result of [[sequence alignment]] of three or more [[biological sequence]]s, generally [[protein]], [[DNA]], or [[RNA]]. In many cases, the input set of query sequences are assumed to have an [[evolutionary]] relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence [[homology (biology)|homology]] can be inferred and [[molecular phylogeny|phylogenetic analysis]] can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate [[mutation]] events such as point mutations (single [[amino acid]] or [[nucleotide]] changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations ([[indel]]s or gaps) that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence [[conservation (genetics)|conservation]] of [[protein ___domain]]s, [[tertiary structure|tertiary]] and [[secondary structure|secondary]] structures, and even individual amino acids or nucleotides.
 
Computational [[algorithm]]s are used to produce and analyse the MSAs due to the difficulty and intractability of manually processing the sequences given their biologically-relevant length. MSAs require more sophisticated methodologies than [[sequence alignment|pairwise alignment]] because they are more [[Computational complexity theory|computationally complex]]. Most multiple sequence alignment programs use [[heuristic]] methods rather than [[global optimization]] because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive. On the other hand, heuristic methods generally fail to give guarantees on the solution quality, with heuristic solutions shown to be often far below the optimal solution on benchmark instances.<ref name="thompson2011">{{cite journal | doi = 10.1371/journal.pone.0018093|vauthors= Thompson JD, Linard B, Lecompte O, Poch O | year = 2011 | title = A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives | journal = PLOS ONE | volume = 6 | issue = 3| pages = e18093| pmid = 21483869| pmc = 3069049|bibcode= 2011PLoSO...618093T |doi-access= free }}</ref><ref name="nuin2006" /><ref name="hosseininasab">{{cite journal | doi = 10.1287/ijoc.2019.0937 |vauthors=Hosseininasab A, van Hoeve WJ | year = 2019 | title = Exact Multiple Sequence Alignment by Synchronized Decision Diagrams | journal = INFORMS Journal on Computing |s2cid=109937203 }}</ref>
 
==Problem statement==
Line 61:
Progressive alignment methods are efficient enough to implement on a large scale for many (100s to 1000s) sequences. Progressive alignment services are commonly available on publicly accessible web servers so users need not locally install the applications of interest. The most popular progressive alignment method has been the [[Clustal]] family,<ref name="higgins1988">{{cite journal |author=[[Desmond G. Higgins|Higgins DG]], Sharp PM |year=1988 |title=CLUSTAL: a package for performing multiple sequence alignment on a microcomputer |journal=Gene |volume=73 |issue=1 |pages=237–244 |doi=10.1016/0378-1119(88)90330-7 |pmid=3243435}}</ref> especially the weighted variant ClustalW<ref name="thomson1994">{{cite journal | vauthors = Thompson JD, Higgins DG, Gibson TJ | title = CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice | journal = Nucleic Acids Res. | volume = 22 | issue = 22 | pages = 4673–80 | date = November 1994 | pmid = 7984417 | pmc = 308517 | doi = 10.1093/nar/22.22.4673 }}</ref> to which access is provided by a large number of web portals including [https://web.archive.org/web/20060705082556/http://align.genome.jp/ GenomeNet], [http://www.ebi.ac.uk/Tools/clustalw2/index.html EBI], and [http://www.ch.embnet.org/software/ClustalW.html EMBNet]. Different portals or implementations can vary in user interface and make different parameters accessible to the user. ClustalW is used extensively for phylogenetic tree construction, in spite of the author's explicit warnings that unedited alignments should not be used in such studies and as input for [[protein structure prediction]] by homology modeling. Current version of Clustal family is ClustalW2. EMBL-EBI announced that CLustalW2 will be expired in August 2015. They recommend Clustal Omega which performs based on seeded guide trees and HMM profile-profile techniques for protein alignments. They offer different MSA tools for progressive DNA alignments. One of them is [[MAFFT]] (Multiple Alignment using Fast Fourier Transform).<ref name=EMBL-EBI>{{cite web|title=EMBL-EBI-ClustalW2-Multiple Sequence Alignment|url=http://www.ebi.ac.uk/Tools/msa/clustalw2/|website=CLUSTALW2}}</ref>
 
Another common progressive alignment method called [[T-Coffee]]<ref name="notredame2000">{{cite journal | vauthors = Notredame C, Higgins DG, Heringa J | title = T-Coffee: A novel method for fast and accurate multiple sequence alignment | journal = J. Mol. Biol. | volume = 302 | issue = 1 | pages = 205–17 | date = September 2000 | pmid = 10964570 | doi = 10.1006/jmbi.2000.4042 | s2cid = 10189971 }}</ref> is slower than Clustal and its derivatives but generally produces more accurate alignments for distantly related sequence sets. T-Coffee calculates pairwise alignments by combining the direct alignment of the pair with indirect alignments that aligns each sequence of the pair to a third sequence. It uses the output from Clustal as well as another local alignment program LALIGN, which finds multiple regions of local alignment between two sequences. The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate weighting factors.
 
Because progressive methods are heuristics that are not guaranteed to converge to a global optimum, alignment quality can be difficult to evaluate and their true biological significance can be obscure. A semi-progressive method that improves alignment quality and does not use a lossy heuristic while still running in [[polynomial time]] has been implemented in the program [http://faculty.cs.tamu.edu/shsze/psalign/ PSAlign].<ref name="sze2006">{{cite journal |vauthors=Sze SH, Lu Y, Yang Q |year=2006 |title=A polynomial time solvable formulation of multiple sequence alignment |journal=J Comput Biol |volume=13 |issue=2 |pages=309–319 |doi=10.1089/cmb.2006.13.309 |pmid=16597242}}</ref>