CYK algorithm: Difference between revisions

Content deleted Content added
m Disambiguating links to John Cocke (disambiguation) (link changed to John Cocke (computer scientist)) using DisamAssist.
Citation bot (talk | contribs)
Removed URL that duplicated identifier. | Use this bot. Report bugs. | #UCB_CommandLine
 
(22 intermediate revisions by 15 users not shown)
Line 1:
{{Short description|Parsing algorithm for context-free grammars}}
In [[computer science]], the '''Cocke–Younger–Kasami algorithm''' (alternatively called '''CYK''', or '''CKY''') is a [[parsing]] [[algorithm]] for [[context-free grammar]]s published by Itiroo Sakai in 1961.<ref>{{cite book |last1=Grune |first1=Dick |title=Parsing techniques : a practical guide |date=2008 |publisher=Springer |___location=New York |page=579 |isbn=978-0-387-20248-8 |edition=2nd}}</ref> The algorithm is named after some of its rediscoverers: [[John Cocke (computer scientist)|John Cocke]], Daniel Younger, [[Tadao Kasami]], and [[Jacob T. Schwartz]]. It employs [[bottom-up parsing]] and [[dynamic programming]].
{{Redirect|CYK||Cyk (disambiguation)}}
{{Infobox algorithm
|name=Cocke–Younger–Kasami algorithm (CYK)
|class=[[Parsing]] with [[context-free grammar]]s
|data=[[String (computer science)|String]]
|time=<math>\mathcal{O}\left( n^3 \cdot \left| G \right| \right)</math>, where:
* <math>n</math> is length of the string
* <math>|G|</math> is the size of the CNF grammar
}}
 
In [[computer science]], the '''Cocke–Younger–Kasami algorithm''' (alternatively called '''CYK''', or '''CKY''') is a [[parsing]] [[algorithm]] for [[context-free grammar]]s published by Itiroo Sakai in 1961.<ref>{{cite book |last1=Grune |first1=Dick |title=Parsing techniques : a practical guide |date=2008 |publisher=Springer |___location=New York |page=579 |isbn=978-0-387-20248-8 |edition=2nd}}</ref><ref>Itiroo Sakai, “Syntax in universal translation”. In Proceedings 1961 International Conference on Machine Translation of Languages and Applied Language Analysis, Her Majesty’s Stationery Office, London, p. 593-608, 1962.</ref> The algorithm is named after some of its rediscoverers: [[John Cocke (computer scientist)|John Cocke]], Daniel Younger, [[Tadao Kasami]], and [[Jacob T. Schwartz]]. It employs [[bottom-up parsing]] and [[dynamic programming]].
The standard version of CYK operates only on context-free grammars given in [[Chomsky normal form]] (CNF). However any context-free grammar may be transformed (after convention) to a CNF grammar expressing the same language {{harv|Sipser|1997}}.
 
The standard version of CYK operates only on context-free grammars given in [[Chomsky normal form]] (CNF). However any context-free grammar may be algorithmically transformed (after convention) tointo a CNF grammar expressing the same language {{harv|Sipser|1997}}.
The importance of the CYK algorithm stems from its high efficiency in certain situations. Using [[Big O notation]], the [[Analysis of algorithms|worst case running time]] of CYK is <math>\mathcal{O}\left( n^3 \cdot \left| G \right| \right)</math>, where <math>n</math> is the length of the parsed string and <math>\left| G \right|</math> is the size of the CNF grammar <math>G</math> {{harv|Hopcroft|Ullman|1979|p=140}}. This makes it one of the most efficient parsing algorithms in terms of worst-case [[asymptotic complexity]], although other algorithms exist with better average running time in many practical scenarios.
 
The importance of the CYK algorithm stems from its high efficiency in certain situations. Using [[Big O notation|big ''O'' notation]], the [[Analysis of algorithms|worst case running time]] of CYK is <math>\mathcal{O}\left( n^3 \cdot \left| G \right| \right)</math>, where <math>n</math> is the length of the parsed string and <math>\left| G \right|</math> is the size of the CNF grammar <math>G</math> {{harv|Hopcroft|Ullman|1979|p=140}}. This makes it one of the most efficient {{Citation needed|reason=cubic time does not seem efficient at all; other algorithms claim linear execution time|date=August 2023}} parsing algorithms in terms of worst-case [[asymptotic complexity]], although other algorithms exist with better average running time in many practical scenarios.
 
==Standard form==
 
The [[dynamic programming]] algorithm requires the context-free grammar to be rendered into [[Chomsky normal form]] (CNF), because it tests for possibilities to split the current sequence into two smaller sequences. Any context-free grammar that does not generate the empty string can be represented in CNF using only [[Formal grammar#The syntax of grammars|production rules]] of the forms <math>A\rightarrow \alpha</math> and <math>A\rightarrow B C</math>.; to allow for the empty string, one can explicitly allow <math>S\to \varepsilon</math>, where <math>S</math> is the start symbol.<ref>{{CitationCite book |last=Sipser |first=Michael |title=Introduction to the theory of computation needed|date=September2006 2021|publisher=Thomson Course Technology |isbn=0-534-95097-3 |edition=2nd |___location=Boston |at=Definition 2.8 |oclc=58544333}}</ref>
 
==Algorithm==
Line 17 ⟶ 28:
'''let''' the grammar contain ''r'' nonterminal symbols ''R''<sub>1</sub> ... ''R''<sub>''r''</sub>, with start symbol ''R''<sub>1</sub>.
'''let''' ''P''[''n'',''n'',''r''] be an array of booleans. Initialize all elements of ''P'' to false.
'''let''' ''back''[''n'',''n'',''r''] be an array of lists of backpointing triples. Initialize all elements of ''back'' to the empty list.
'''for each''' ''s'' = 1 to ''n''
Line 26 ⟶ 38:
'''for each''' ''p'' = 1 to ''l''-1 ''-- Partition of span''
'''for each''' production ''R''<sub>''a''</sub> &rarr; ''R''<sub>''b''</sub> ''R''<sub>''c''</sub>
'''if''' ''P''[''p'',''s'',''b''] and ''P''[''l''-''p'',''s''+''p'',''c''] '''then'''
'''set''' ''P''[''l'',''s'',''a''] = true,
append <p,b,c> to ''back''[''l'',''s'',''a'']
'''if''' ''P''[n,''1'',''1''] is true '''then'''
''I'' is member of language
'''return''' ''back'' -- by ''retracing the steps through back, one can easily construct all possible parse trees of the string.''
'''else'''
''I'return''' is "not a member of language"
 
<div class="toccolours mw-collapsible mw-collapsed">
Line 51 ⟶ 66:
'''for each''' production ''R''<sub>''a''</sub> &rarr; ''R''<sub>''b''</sub> ''R''<sub>''c''</sub>
prob_splitting = Pr(''R''<sub>''a''</sub> &rarr;''R''<sub>''b''</sub> ''R''<sub>''c''</sub>) * ''P''[''p'',''s'',''b''] * ''P''[''l''-''p'',''s''+''p'',''c'']
'''if''' ''P''[''p'',''s'',''b'']prob_splitting > 0 and ''P''[''l''-''p'',''s''+''p'',''c''] > 0 and ''P''[''l'',''s'',''a''] < prob_splitting '''then'''
'''set''' ''P''[''l'',''s'',''a''] = prob_splitting
'''set''' ''back''[''l'',''s'',''a''] = <p,b,c>
'''if''' ''P''[n,''1'',''1''] > 0 '''then'''
find the parse tree by retracing through ''back''
'''return''' the parse tree
'''else'''
'''return''' "not a member of language"
</div>
</div>
Line 116 ⟶ 137:
===Parsing weighted context-free grammars===
It is also possible to extend the CYK algorithm to parse strings using [[weighted context-free grammar|weighted]] and [[stochastic context-free grammar]]s. Weights (probabilities) are then stored in the table P instead of booleans, so P[i,j,A] will contain the minimum weight (maximum probability) that the substring from i to j can be derived from A. Further extensions of the algorithm allow all parses of a string to be enumerated from lowest to highest weight (highest to lowest probability).
 
==== Numerical stability ====
When the probabilistic CYK algorithm is applied to a long string, the splitting probability can become very small due to multiplying many probabilities together. This can be dealt with by summing log-probability instead of multiplying probabilities.
 
===Valiant's algorithm===
Line 134 ⟶ 158:
== Sources ==
*{{cite conference |title= Syntax in universal translation |last= Sakai |first= Itiroo |date= 1962 |___location= London |publisher= Her Majesty’s Stationery Office |volume= II |pages= 593–608 |conference= 1961 International Conference on Machine Translation of Languages and Applied Language Analysis, Teddington, England}}
*{{cite techreporttech report |last1=Cocke |first1=John |author-link1=John Cocke (computer scientist) |last2=Schwartz |first2=Jacob T. |date=April 1970 |title=Programming languages and their compilers: Preliminary notes |edition=2nd revised |publisher=[[Courant Institute of Mathematical Sciences|CIMS]], [[New York University|NYU]] |url=http://www.softwarepreservation.org/projects/FORTRAN/CockeSchwartz_ProgLangCompilers.pdf}}
* {{cite book | isbn=0-201-02988-X | first1=John E. | last1=Hopcroft | author1-link=John E. Hopcroft | first2=Jeffrey D. | last2=Ullman | author2-link=Jeffrey D. Ullman | title=Introduction to Automata Theory, Languages, and Computation | ___location=Reading/MA | publisher=Addison-Wesley | year=1979 | url=https://archive.org/details/introductiontoau00hopc }}
*{{cite techreporttech report |last1=Kasami |first1=T. |author-link1=Tadao Kasami |year=1965 |title=An efficient recognition and syntax-analysis algorithm for context-free languages |number=65-758 |publisher=[[Air Force Cambridge Research Laboratories|AFCRL]]}}
*{{cite book |last1=Knuth |first1=Donald E. |author-link1=Donald Knuth |title=The Art of Computer Programming Volume 2: Seminumerical Algorithms |publisher=Addison-Wesley Professional |edition=3rd |date=November 14, 1997 |isbn=0-201-89684-2 |pages=501 }}
*{{cite journal |last1=Lang |first1=Bernard |title=Recognition can be harder than parsing |journal=[[Computational Intelligence (journal)|Comput. Intell.]] |year=1994 |volume=10 |issue=4 |pages=486–494 |citeseerx=10.1.1.50.6982 |doi=10.1111/j.1467-8640.1994.tb00011.x |s2cid=5873640 }}
Line 146 ⟶ 170:
 
==External links==
* [https://raw.org/tool/cyk-algorithm/ Interactive Visualization of the CYK algorithm]
* [https://martinlaz.github.io/demos/cky.html CYK parsing demo in JavaScript]
* [httphttps://www.swisseduc.ch/compscienceinformatik/exorciser/ Exorciser is a Java application to generate exercises in the CYK algorithm as well as Finite State Machines, Markov algorithms etc]
 
{{Parsers}}