Lesk algorithm: Difference between revisions

Content deleted Content added
Soshial (talk | contribs)
additions
No edit summary
Tags: Visual edit Mobile edit Mobile web edit
 
(57 intermediate revisions by 40 users not shown)
Line 1:
{{Short description|Natural language processing algorithm}}
The '''Lesk algorithm''' is a classical algorithm for [[word sense disambiguation]] introduced by [[Mike Lesk|Michael E. Lesk]] in 1986.<ref>
<ref>
''Lesk, M. (1986). [http://portal.acm.org/ft_gatewaycitation.cfm?id=318728&typedl=pdfGUIDE,ACM&dlcoll=GUIDE&dlCFID=ACM103485667&CFTOKEN=64768709 Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone]'',. [[MikeIn Lesk|MichaelSIGDOC Lesk]], ACM Special Interest Group for Design of Communication'86: Proceedings of the 5th annual international conference on Systems documentation, p.pages 24 - 26, 1986.New ISBNYork, 0897912241NY, [http://portalUSA.acm ACM.org/citation.cfm?id=318728]
</ref> It operates on the premise that words within a given context are likely to share a common meaning. This algorithm compares the dictionary definitions of an ambiguous word with the words in its surrounding context to determine the most appropriate sense. Variations, such as the Simplified Lesk algorithm, have demonstrated improved precision and efficiency. However, the Lesk algorithm has faced criticism for its sensitivity to definition wording and its reliance on brief glosses. Researchers have sought to enhance its accuracy by incorporating additional resources like thesauruses and syntactic models.
</ref>
 
==Overview==
The Lesk algorithm is based on the assumption that words in a given neighbourhood"neighborhood" (section of text) will tend to share a common topic. A simplified version of the Lesk algorithm is to compare the dictionary definition of an ambiguous word with the terms contained ofin theits neighbourhoodneighborhood. Versions have been adapted to use [[WordNet]].<ref>Satanjeev Banerjee and Ted Pedersen. ''[httphttps://www.cs.cmu.edu/~banerjee/Publications/cicling2002.ps.gz An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet]'', [[Satanjeev Banerjee]] and [[Ted Pedersen]], Lecture Notes Inin Computer Science; Vol. 2276, Pages: 136 - 145, 2002. {{ISBN 3540432191|3-540-43219-1}}
</ref>. ItAn wouldimplementation bemight look like this:
# for every sense of the word being disambiguated one should count the amountnumber of words that are in both neighbourhoodthe neighborhood of that word and in the dictionary definition of eachthat sense in a dictionary
# the sense that is to be chosen is the sense whichthat has the biggestlargest number of this count.
 
QuiteA commonfrequently used '''example''' of working ofillustrating this algorithm is for the context "PINEpine CONEcone". The following dictionary definitions are used:
PINE
1. kinds of evergreen tree with needle-shaped leaves
Line 19:
2. something of this shape whether solid or hollow
3. fruit of certain evergreen trees
As you can seebe seen, the best intersection is Pine #1 ⋂ Cone #3 = 2.
 
==Simplified Lesk algorithm==
Recently, a lot of works appeared which offer different modifications of this algorithm. These works uses other resources for analysis (thesauruses, synonyms dictionaries or morphological and syntaxical models): for instance, it may use such information as synonyms, different derivatives, or words from definitions of words from definitions<ref>Alexander Gelbukh, Grigori Sidorov. Automatic resolution of ambiguity of word senses in dictionary definitions (in Russian). J. Nauchno-Tehnicheskaya Informaciya (NTI), ISSN 0548-0027, ser. 2, N 3, 2004, pp. 10–15.</ref>.
 
In Simplified Lesk algorithm,<ref>Kilgarriff and J. Rosenzweig. 2000. [http://www.lrec-conf.org/proceedings/lrec2000/pdf/8.pdf English SENSEVAL:Report and Results]. In Proceedings of the 2nd International Conference on Language Resourcesand Evaluation, LREC, Athens, Greece.</ref> the correct meaning of each word in a given context is determined individually by locating the sense that overlaps the most between its dictionary definition and the given context. Rather than simultaneously determining the meanings of all words in a given context, this approach tackles each word individually, independent of the meaning of the other words occurring in the same context.
 
"A comparative evaluation performed by Vasilescu et al. (2004)<ref>Florentina Vasilescu, Philippe Langlais, and Guy Lapalme.
2004. [http://www.lrec-conf.org/proceedings/lrec2004/pdf/219.pdf Evaluating Variants of the Lesk Approach for Disambiguating Words]. LREC, Portugal.</ref> has shown that the simplified Lesk algorithm can significantly outperform the original definition of the algorithm, both in terms of precision and efficiency. By evaluating the disambiguation algorithms on the Senseval-2 English all words data, they measure a 58% precision using the simplified Lesk algorithm compared to the only 42% under the original algorithm.
 
Note: Vasilescu et al. implementation considers a back-off strategy for words not covered by the algorithm, consisting of the most frequent sense defined in WordNet. This means that words for which all their possible meanings lead to zero overlap with current context or with other word definitions are by default assigned sense number one in WordNet."<ref>Agirre, Eneko & Philip Edmonds (eds.). 2006. [https://books.google.com/books?id=GLck75U20pAC&q=Lesk Word Sense Disambiguation: Algorithms and Applications]. Dordrecht: Springer. www.wsdbook.org</ref>
 
'''Simplified LESK Algorithm with smart default word sense (Vasilescu et al., 2004)'''<ref>Florentina Vasilescu, Philippe Langlais, and Guy Lapalme.
2004. [http://www.lrec-conf.org/proceedings/lrec2004/pdf/219.pdf Evaluating Variants of the Lesk Approach for Disambiguating Words]. LREC, Portugal.</ref>
{| class="wikitable"
|-
| '''function''' SIMPLIFIED LESK(''word,sentence'')''' returns''' ''best sense of word ''
:''best-sense <- most frequent sense for word ''
:''max-overlap <- 0''
:''context <- set of words in sentence ''
:'''for each''''' sense '''''in''' ''senses of word '''''do'''
::''signature <- set of words in the gloss and examples of sense''
::''overlap'' <- COMPUTEOVERLAP (''signature,context'')
::'''if''' ''overlap > max-overlap'' ''' then'''
:::''max-overlap <- overlap''
:::''best-sense <- sense''
''' end '''
''' return '''(''best-sense'')
|}
The COMPUTEOVERLAP function returns the number of words in common between two sets, ignoring function words or other words on a stop list. The original Lesk algorithm defines the context in a more complex way.
 
==Criticisms==
Unfortunately, Lesk’s approach is very sensitive to the exact wording of definitions, so the absence of a certain word can radically change the results. Further, the algorithm determines overlaps only among the glosses of the senses being considered. This is a significant limitation in that dictionary glosses tend to be fairly short and do not provide sufficient vocabulary to relate fine-grained sense distinctions.
 
Recently, aA lot of workswork has appeared which offeroffering different modifications of this algorithm. These works usesuse other resources for analysis (thesauruses, synonyms dictionaries or morphological and syntaxicalsyntactic models): for instance, it may use such information as synonyms, different derivatives, or words from definitions of words from definitions.<ref>Alexander Gelbukh, Grigori Sidorov. [https://www.gelbukh.com/CV/Publications/2004/NTI-2004-senses.htm Automatic resolution of ambiguity of word senses in dictionary definitions] (in Russian). J. Nauchno-Tehnicheskaya Informaciya (NTI), ISSN 0548-0027, ser. 2, N 3, 2004, pp. 10–15.</ref>.
 
==Lesk variants==
* Original Lesk (Lesk, 1986)
* Adapted/Extended Lesk (Banerjee and Pederson, 2002/2003): In the adaptive lesk algorithm, a word vector is created corresponds to every content word in the wordnet gloss. Concatenating glosses of related concepts in WordNet can be used to augment this vector. The vector contains the co-occurrence counts of words co-occurring with w in a large corpus. Adding all the word vectors for all the content words in its gloss creates the Gloss vector g for a concept. Relatedness is determined by comparing the gloss vector using the [[Cosine similarity]] measure.<ref>{{Cite book|last1=Banerjee|first1=Satanjeev|last2=Pedersen|first2=Ted|title=Computational Linguistics and Intelligent Text Processing |chapter=An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet |date=2002-02-17|series=Lecture Notes in Computer Science|volume=2276 |language=en|publisher=Springer, Berlin, Heidelberg|pages=136–145|doi=10.1007/3-540-45715-1_11|isbn=978-3540457152|citeseerx=10.1.1.118.8359}}</ref>
 
There are a lot of studies concerning Lesk and its extensions:<ref>Roberto Navigli. [http://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf ''Word Sense Disambiguation: A Survey''], ACM Computing Surveys, 41(2), 2009, pp. 1–69.</ref>
 
There were a lot of studying of this method:
* Kwong, 2001;
* Nastase and Szpakowicz, 2001;
* Wilks and Stevenson, 1998, 1999;
* Mahesh et al., 1997;
Line 31 ⟶ 66:
* Yarowsky, 1992;
* Pook and Catlett, 1988;
* Kilgarriff &and Rosensweig, 2000,;
* Kwong, 2001;
* Alexander Gelbukh, Grigori Sidorov, 2004.
* Nastase and Szpakowicz, 2001;
* Alexander Gelbukh, Grigoriand Sidorov, 2004.
 
==See also==
{{Commons}}
{{Portal|Linguistics}}
* [[Word-sense disambiguation]]
 
==AccuracyReferences==
{{reflist|30em}}
Accuracy on ''[[Pride and Prejudice]]'' and selected papers of the [[Associated Press]] was found to be in the 50% to 70% range.
 
Senseval results.
 
== References ==
<references/>
 
[[Category:Natural language processing]]
[[Category:Semantics]]
[[Category:Computational linguistics]]
[[Category:Word -sense disambiguation]]
 
{{ling-stub}}