Dot plot (bioinformatics): Difference between revisions

Content deleted Content added
Add references
Line 6:
One way to visualize the similarity between two protein or nucleic acid sequences is to use a similarity matrix, known as a dot plot. These were introduced by Gibbs and McIntyre in 1970<ref name="gibbs-mcintyre"/> and are two-dimensional matrices that have the sequences of the proteins being compared along the vertical and horizontal axes. For a simple visual representation of the similarity between two sequences, individual cells in the matrix can be shaded black if residues are identical, so that matching sequence segments appear as runs of diagonal lines across the matrix.
 
>CY003854.1 Influenza A virus (A/mallard/Alberta/77/1977(H2N3)) segment 1, complete sequence
== Interpretation ==
AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCCC
Some idea of the similarity of the two sequences can be gleaned from the number and length of matching segments shown in the matrix. Identical proteins will obviously have a diagonal line in the center of the matrix. Insertions and deletions between sequences give rise to disruptions in this diagonal. Regions of local similarity or repetitive sequences give rise to further diagonal matches in addition to the central diagonal. One way of reducing this noise is to only shade runs or '[[tuple]]s' of residues, e.g. a tuple of 3 corresponds to three residues in a row. This is effective because the probability of matching three residues in a row by chance is much lower than single-residue matches.
GCACCCGCGAGATACTCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAATACACATCAGGAAG
GCAAGAGAAGAACCCCGCACTCAGGATGAAGTGGATGATGGCAATGAAATATCCAATTACTGCAGATAAG
AGAATAATGGAAATGATTCCTGAAAGGAATGAACAAGGACAAACCCTCTGGAGCAAAACAAACGATGCCG
GCTCAGACCGAGTGATGGTATCACCTCTGGCCGTGACATGGTGGAATAGGAATGGACCAACAACAAGTAC
AGTTCACTACCCAAAGGTATATAAAACTTATTTCGAAAAAGTCGAAAGGTTGAAACACGGGACCTTTGGC
CCCGTCCACTTCAGAAATCAAGTTAAGATAAGACGGAGGGTTGACATAAACCCTGGCCACGCAGACCTCA
GTGCCAAAGAGGCACAGGATGTAATCATGGAAGTTGTTTTCCCAAATGAAGTGGGAGCTAGAATACTAAC
ATCGGAGTCACAACTGACAATAACAAAAGAGAAAAAGGAAGAACTCCAGGACTGTAAAATTGCCCCCTTG
ATGGTAGCATACATGCTAGAAAGAGAGTTGGTCCGCAAAACGAGGTTCCTCCCAGTGGCTGGTGGAACAA
GCAGTGTCTATATTGAGGTGTTGCATTTAACCCAGGGGACATGCTGGGAGCAGATGTACACTCCAGGAGG
GGAAGTGAGAAATGATGATGTTGACCAAAGCTTGATTATCGCTGCCAGGAACATAGTAAGAAGAGCAACG
GTATCAGCAGACCCACTAGCATCTCTATTGGAGATGTGCCACAGCACACAGATTGGGGGAATAAGGATGG
TAGACATCCTTCGGCAAAATCCAACAGAGGAACAAGCCGTGGACATATGCAAGGCAGCAATGGGCTTGAG
GATTAGCTCATCTTTCAGCTTTGGTGGATTCACTTTCAAAAGAACAAGCGGGTCGTCAGTTAAGAGAGAA
GAAGAAGTGCTTACGGGCAACCTTCAAACATTGAAAATAAGAGTACATGAGGGGTATGAAGAGTTCACAA
TGGTTGGGAGAAGAGCAACAGCTATTCTAAGAAAGGCAACCAGGAGATTGATCCAGCTAATAGTAAGTGG
GAGAGACGAGCAGTCAATTGCTGAAGCAATAATTGTGGCCATGGTATTTTCACAAGAGGATTGCATGATC
AAGGCAGTTCGGGGTGATCTGAACTTTGTCAATAGGGCAAATCAGCGACTGAACCCCATGCATCAACTCT
TGAGACACTTCCAAAAGGATGCAAAAGTGCTTTTCCAAAACTGGGGAATTGAACCCATTGACAATGTGAT
GGGAATGATCGGAATATTGCCCGACATGACCCCAAGTACTGAGATGTCGCTGAGGGGGATAAGAGTCAGC
AAAATGGGAGTAGATGAATACTCCAGCACAGAAAGGGTGGTGGTGAGCATTGACCGATTTTTAAGGGTTC
GGGATCAACGGGGAAACGTACTATTGTCACCCGAAGAAGTTAGCGAGACACAAGGAACGGAGAAACTGAC
AATAACTTATTCGTCATCAATGATGTGGGAGATCAATGGTCCTGAGTCGGTGTTGGTCAATACTTATCAA
TGGATCATCAGGAACTGGGAGACTGTGAAAATTCAATGGTCACAGGATCCCACAATGTTATATAATAAGA
TGGAATTCGAGCCATTTCAGTCTCTGGTCCCTAAGGCAGCCAGAGGTCAATACAGCGGATTCGTGAGGAC
ACTGTTCCAGCAGATGCGGGATGTGCTTGGAACATTTGACACTGTTCAGATAATAAAACTTCTTCCCTTT
GCTGCTGCTCCACCAGAACAGAGTAGGATGCAGTTCTCCTCCCTGACTGTGAATGTGAGAGGATCAGGAA
TGAGGATACTGGTAAGAGGCAATTCTCCAGTGTTCAATTACAACAAGGCCACCAAGAGGCTTACAGTCCT
TGGAAAAGATGCAGGTGCATTGACCGAAGATCCAGATGAAGGCACAGCTGGAGTGGAGTCTGCTGTTCTA
AGAGGATTCCTCATTTTGGGCAAAGAAGACAAGAGATATGGCCCAGCATTAAGCATCAATGAGCTGAGCA
ATCTTGCAAAAGGAGAGAAGGCTAATGTGCTAATTGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAA
ACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAGTGT
CGAATTGTTTAAAAACGACCTTGTTTCTACT
 
 
Dot plots compare two sequences by organizing one sequence on the x-axis, and another on the y-axis, of a plot. When the residues of both sequences match at the same ___location on the plot, a dot is drawn at the corresponding position. Note, that the sequences can be written backwards or forwards, however the sequences on both axes must be written in the same direction. Also note, that the direction of the sequences on the axes will determine the direction of the line on the dot plot. Once the dots have been plotted, they will combine to form lines. The closeness of the sequences in similarity will determine how close the diagonal line is to what a graph showing a curve demonstrating a [[direct relationship]] is. This relationship is affected by certain sequence features such as frame shifts, direct repeats, and inverted repeats. Frame shifts include insertions, deletions, and mutations. The presence of one of these features, or the presence of multiple features, will cause for multiple lines to be plotted in a various possibility of configurations, depending on the features present in the sequences. A feature that will cause a very different result on the dot plot is the presence of low-complexity region/regions. [[Low-complexity regions]] are regions in the sequence with only a few amino acids, which in turn, causes redundancy within that small or limited region. These regions are typically found around the diagonal, and may or may not have a square in the middle of the dot plot.
>CY003886.1 Influenza A virus (A/mallard duck/ALB/376/1985(H2N3)) segment 1, complete sequence
AGCGAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCCC
GCACTCGCGAGATACTCACCAAAACCACTGTGGACCATATGGCCATAATCAAAAAATACACATCAGGAAG
GCAAGAGAAGAATCCCGCACTCAGGATGAAATGGATGATGGCAATGAAATATCCAATTACAGCGGATAAG
AGGATAATGGAGATGATTCCCGAGAGGAATGAACAAGGGCAAACCCTCTGGAGCAAAACAAATGATGCCG
GCTCAGACCGAGTGATGGTATCACCTCTGGCTGTGACATGGTGGAATAGGAATGGACCAACAACAAGTAC
AATTCACTACCCAAAGGTATATAAAACCTATTTCGAAAAGGTCGAAAGGTTAAAACATGGGACCTTTGGC
CCCGTTCACTTCAGGAATCAAGTTAAGATAAGACGGAGAGTTGACATAAACCCTGGACATGCAGACCTCA
GTGCCAAAGAGGCACAGGATGTAATCATGGAAGTTGTTTTCCCAAATGAAGTGGGGGCCAGGATATTAAC
ATCGGAGTCACAGCTGACAATAACAAAAGAGAAAAAGGAAGAACTCCAAGATTGTAAAATTGCCCCCTTG
ATGGTAGCATACATGCTAGAAAGAGAGTTAGTCCGCAAAACGAGGTTCCTCCCAGTGGCTGGTGGAACAA
GCAGTGTTTATATTGAGGTGTTGCATTTGACCCAGGGAACATGCTGGGAACAAATGTACACTCCAGGAGG
GGAAGTGAGAAATGATGATGTTGACCAAAGCTTAATTATCGCTGCCAGGAATATAGTAAGAAGAGCAACG
GTATCAGCAGACCCACTAGCGTCTCTATTGGAGATGTGCCACAGCACACAGATTGGTGGAATAAGGATGG
TAGACATCCTTAGGCAGAATCCAACAGAGGAACAAGCCGTGGATATATGCAAGGCGGCAATGGGCTTGAG
GATTAGCTCATCTTTCAGCTTCGGTGGATTCACTTTTAAAAGAACAAGTGGGTCGTCAGTCAAAAGAGAA
GAAGAAGTGCTTACGGGCAACCTTCAAACACTGAAAATAAGAGTGCATGAGGGGTATGAAGAATTCACAA
TGGTTGGGAGAAGAGCAACAGCTATTCTCAGGAAGGCAACCAGGAGATTGATTCAGCTAATAGTCAGTGG
GAGAGATGAACAGTCAATTGCTGAAGCAATAATTGTAGCTATGGTATTTTCACAAGAGGATTGCATGATC
AAGGCAGTTCGGGGTGATCTGAACTTTGTCAATAGAGCAAACCAGCGACTGAACCCCATGCATCAACTCT
TGAGACATTTCCAAAAGGATGCAAAAGTGCTTTTCCAAAATTGGGGAATTGAACCCATTGACAATGTGAT
GGGAATGATCGGAATACTACCCGACATGACCCCAAGTACTGAGACGTCATTGAGAGGGATAAGAGTCAGC
AAAATGGGAGTGGATGAATACTCCAGCACAGAGAGAGTGGTGGTGAGCATTGACCGTTTTTTAAGGGTTC
GGGATCAACGGGGAAACGTACTATTGTCACCTGAAGAAGTCAGCGAGACGCAAGGGACGGAAAAGTTGAC
AATAACTTACTCATCATCAATGATGTGGGAGATCAATGGTCCTGAATCAGTGTTGGTCAATACTTACCAG
TGGATCATCAGAAACTGGGAGACTGTGAAAATTCAATGGTCACAGGATCCCACAATGTTGTACAATAAGA
TGGAATTCGAGCCATTTCAGTCTCTGGTCCCTAAGGCAGCTAGAGGTCAATACAGCGGATTCGTGAGGAC
GCTGTTCCAACAAATGCGGGATGTGCTTGGAACATTTGACACTGTTCAGATAATAAAACTTCTCCCCTTT
GCTGCTGCCCCACCAGAACAGAGTAGGATGCAGTTCTCCTCCTTGACTGTGAATGTAAGAGGATCAGGAA
TGAGGATACTGGTAAGAGGCAACTCTCCAGTGTTCAATTACAACAAGGCCACCAAGAGGCTTACAGTCCT
CGGGAAGGATGCAGGTGCATTAACTGAAGACCCAGATGAAGGCACAGCTGGAGTGGAATCTGCTGTTCTG
AGAGGATTCCTCATTTTGGGCAAAGAAGACAAGAGATATGGCCCAGCATTGAGCATCAATGAGCTGAGCA
ATCTTGCAAAAGGAGAGAAGGCTAATGTGCTAATTGGGCAAGGAGACGTGGTGTTGGTAATGAAACGGAA
ACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGGATTCGGATGGCCATCAATTAGTGT
CGAATTGTTTAAAAACGACCTTGTTTCTACT
 
==See also==