Content deleted Content added
→See also: I added a link to a commonly used algorithm in gene sequence alignment (Smith-Waterman algorithm) |
m remove redundant URL |
||
Line 43:
Alignments are commonly represented both graphically and in text format. In almost all sequence alignment representations, sequences are written in rows arranged so that aligned residues appear in successive columns. In text formats, aligned columns containing identical or similar characters are indicated with a system of conservation symbols. As in the image above, an asterisk or pipe symbol is used to show identity between two columns; other less common symbols include a colon for conservative substitutions and a period for semiconservative substitutions. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements; in DNA and RNA sequences, this equates to assigning each nucleotide its own color. In protein alignments, such as the one in the image above, color is often used to indicate amino acid properties to aid in judging the [[conservation (genetics)|conservation]] of a given amino acid substitution. For multiple sequences the last row in each column is often the [[consensus sequence]] determined by the alignment; the consensus sequence is also often represented in graphical format with a [[sequence logo]] in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation.<ref name=Schneider>{{cite journal| journal=Nucleic Acids Res | volume=18 | pages=6097–6100 | year=1990 |author1=Schneider TD |author2=Stephens RM | title=Sequence logos: a new way to display consensus sequences |pmid=2172928 |pmc=332411 |url=
Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a limited number of input and output formats, such as [[FASTA format]] and [[GenBank]] format and the output is not easily editable. Several conversion programs that provide graphical and/or command line interfaces are available {{Dead link|date=August 2009}}, such as [https://web.archive.org/web/20071024223546/http://bioweb.pasteur.fr/seqanal/interfaces/readseq.html READSEQ] and [[EMBOSS]]. There are also several programming packages which provide this conversion functionality, such as [[BioPython]], [[BioRuby]] and [[BioPerl]]. The [[SAM (file format)|SAM/BAM files]] use the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string format to represent an alignment of a sequence to a reference by encoding a sequence of events (e.g. match/mismatch, insertions, deletions).<ref>{{Cite web|url=https://samtools.github.io/hts-specs/SAMv1.pdf|title=Sequence Alignment/Map Format Specification}}</ref>
Line 97:
Progressive, hierarchical, or tree methods generate a multiple sequence alignment by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons that may include heuristic pairwise alignment methods similar to [[FASTA]]. Progressive alignment results are dependent on the choice of "most related" sequences and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most progressive multiple sequence alignment methods additionally weight the sequences in the query set according to their relatedness, which reduces the likelihood of making a poor choice of initial sequences and thus improves alignment accuracy.
Many variations of the [[Clustal]] progressive implementation<ref name=higgins>{{cite journal | journal=Gene | volume=73 | issue=1 | pages=237–44 | year=1988 | author=[[Desmond G. Higgins|Higgins DG]], Sharp PM | title=CLUSTAL: a package for performing multiple sequence alignment on a microcomputer | pmid=3243435 | doi = 10.1016/0378-1119(88)90330-7 }}</ref><ref name=thompson>{{cite journal | journal=Nucleic Acids Res | volume=22 | pages=4673–80 | year=1994 | author1=Thompson JD| author2-link=Desmond G. Higgins |author2= Higgins DG|author3= Gibson TJ. | title=CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice | pmid=7984417 |pmc=308517 |url=
===Iterative methods===
Line 159:
A more complete list of available software categorized by algorithm and alignment type is available at [[sequence alignment software]], but common software tools used for general sequence alignment tasks include ClustalW2<ref>{{cite web|url=http://www.ebi.ac.uk/Tools/msa/clustalw2/|title=ClustalW2 < Multiple Sequence Alignment < EMBL-EBI|last=EMBL-EBI|website=www.EBI.ac.uk|access-date=12 June 2017}}</ref> and T-coffee<ref>[https://web.archive.org/web/20080918022531/http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi T-coffee]</ref> for alignment, and BLAST<ref>{{cite web|url=http://blast.ncbi.nlm.nih.gov/Blast.cgi|title=BLAST: Basic Local Alignment Search Tool|website=blast.ncbi.nlm.NIH.gov|access-date=12 June 2017}}</ref> and FASTA3x<ref>{{cite web|url=http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml|title=UVA FASTA Server|website=fasta.bioch.Virginia.edu|access-date=12 June 2017}}</ref> for database searching. Commercial tools such as [[DNASTAR|DNASTAR Lasergene]], [[Geneious]], and [[PatternHunter]] are also available. Tools annotated as performing [http://edamontology.org/operation_0292 sequence alignment] are listed in the [https://bio.tools/?page=1&function=%22Sequence%20alignment%22&sort=score bio.tools] registry.
Alignment algorithms and software can be directly compared to one another using a standardized set of [[Benchmark (computing)|benchmark]] reference multiple sequence alignments known as BAliBASE.<ref name=thompson2>{{cite journal | journal=Bioinformatics | volume=15 | pages=87–8 | year=1999 |author1=Thompson JD |author2=Plewniak F |author3=Poch O | title=BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs | url=http://bioinformatics.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=10068696 | pmid=10068696 | doi = 10.1093/bioinformatics/15.1.87 | issue=1 | doi-access=free }}</ref> The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.<ref>[https://web.archive.org/web/20121130084356/http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/prog_scores.html BAliBASE]</ref><ref name=thompson3>{{cite journal | journal=Nucleic Acids Res | volume=27 | pages=2682–90 | year=1999 |author1=Thompson JD |author2=Plewniak F |author3=Poch O. | title=A comprehensive comparison of multiple sequence alignment programs | url=
==See also==
|