Automatic summarization: Difference between revisions

Content deleted Content added
Evaluation: Integrated Evaluation subsection into main Evaluation section
Line 82:
===Document summarization===
Like keyphrase extraction, document summarization aims to identify the essence of a text. The only real difference is that now we are dealing with larger text units—whole sentences instead of words and phrases.
 
==== Evaluation ====
The most common way to evaluate summaries is [[ROUGE (metric)|ROUGE]] (Recall-Oriented Understudy for Gisting Evaluation) measure. This is a recall-based measure that determines how well a system-generated summary covers the content present in one or more human-generated model summaries known as references. It is recall-based to encourage systems to include all the important topics in the text. Recall can be computed with respect to unigram, bigram, trigram, or 4-gram matching. For example, ROUGE-1 is computed as division of count of unigrams in reference that appear in system and count of unigrams in reference summary.
 
If there are multiple references, the ROUGE-1 scores are averaged. Because ROUGE is based only on content overlap, it can determine if the same general concepts are discussed between an automatic summary and a reference summary, but it cannot determine if the result is coherent or the sentences flow together in a sensible manner. High-order n-gram ROUGE measures try to judge fluency to some degree. Note that ROUGE is similar to the BLEU measure for machine translation, but BLEU is precision- based, because translation systems favor accuracy.
 
====Supervised learning approaches====
Line 141 ⟶ 136:
* The [[Reddit]] [[Internet bot|bot]] "autotldr",<ref>{{cite web|title=overview for autotldr|url=https://www.reddit.com/user/autotldr|website=reddit|access-date=9 February 2017|language=en}}</ref> created in 2011 summarizes news articles in the comment-section of reddit posts. It was found to be very useful by the reddit community which upvoted its summaries hundreds of thousands of times.<ref>{{cite book|last1=Squire|first1=Megan|author-link = Megan Squire|title=Mastering Data Mining with Python – Find patterns hidden in your data|publisher=Packt Publishing Ltd|isbn=9781785885914|url=https://books.google.com/books?id=_qXWDQAAQBAJ&pg=PA185|access-date=9 February 2017|language=en|date=2016-08-29}}</ref> The name is reference to [[TL;DR]] − [[Internet slang]] for "too long; didn't read".<ref>{{cite web|title=What Is 'TLDR'?|url=https://www.lifewire.com/what-is-tldr-2483633|website=Lifewire|access-date=9 February 2017}}</ref><ref>{{cite web|title=What Does TL;DR Mean? AMA? TIL? Glossary Of Reddit Terms And Abbreviations|url=http://www.ibtimes.com/what-does-tldr-mean-ama-til-glossary-reddit-terms-abbreviations-431704|work=International Business Times|access-date=9 February 2017|date=29 March 2012}}</ref>
 
==Evaluation techniques==
<!-- IMPORTANT: This section needs to be tied in to the above article so it fits in. Currently, it is not clear what the relation of evaluation is to any of the above topics. The following questions need to be answered: First, in the context of automatic summarization, what is evaluation? Second, what is the significance of evaluation? That is, what is evaluation used for?
-->
The most common way to evaluate the informativeness of automatic summaries is to compare them with human-made model summaries.
 
Evaluation techniquescan fall intobe intrinsic andor extrinsic,<ref>[http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings2/sum-mani.pdf Mani, I. Summarization evaluation: an overview]</ref> and inter-textual andor intra-textual.<ref>{{Cite journal | doi=10.3103/S0005105507030041|title = A method for evaluating modern systems of automatic text summarization| journal=Automatic Documentation and Mathematical Linguistics| volume=41| issue=3| pages=93–103|year = 2007|last1 = Yatsko|first1 = V. A.| last2=Vishnyakov| first2=T. N.|s2cid = 7853204}}</ref>
 
=== Intrinsic and extrinsic evaluation ===
An intrinsic evaluation tests the summarization system in and of itself while an extrinsic evaluation tests the summarization based on how it affects the completion of some other task. Intrinsic evaluations have assessed mainly the coherence and informativeness of summaries. Extrinsic evaluations, on the other hand, have tested the impact of summarization on tasks like relevance assessment, reading comprehension, etc.
assessed mainly the coherence and informativeness of summaries. Extrinsic evaluations, on the other hand, have tested the impact of summarization on tasks like relevance assessment, reading comprehension, etc.
 
=== Inter-textual and intra-textual ===
Line 157 ⟶ 151:
Human judgement often has wide variance on what is considered a "good" summary, which means that making the evaluation process automatic is particularly difficult. Manual evaluation can be used, but this is both time and labor-intensive as it requires humans to read not only the summaries but also the source documents. Other issues are those concerning [[coherence (linguistics)|coherence]] and coverage.
 
OneThe ofmost thecommon metricsway usedto evaluate summaries is [[ROUGE (metric)|ROUGE]] (Recall-Oriented Understudy for Gisting Evaluation) measure. It is bery common in [[NIST]]'s annual Document Understanding Conferences, in whichwhere research groups submit their systems for both summarization and translation tasks, is the ROUGE metric (Recall-Oriented Understudy for Gisting Evaluationsystems. [https://web.archive.org/web/20060408135021/http://haydn.isi.edu/ROUGE/]) ROGUE is a recall-based measure of how well a summary covers the content present in one or more human-generated model summaries known as references. It essentially calculates [[n-gram]] overlaps between automatically generated summaries and previously written human summaries. AIt highis levelrecall-based ofto overlapencourage shouldsystems indicateto ainclude highall levelthe ofimportant sharedtopics concepts betweenin the two summariestext. NoteRecall thatcan overlapbe metricscomputed likewith this are unablerespect to provideunigram, anybigram, feedbacktrigram, onor a4-gram summary's coherencematching. [[anaphoraFor (linguistics)|Anaphorexample, resolution]]ROUGE-1 remainsis anothercomputed problemas yetdivision toof becount fullyof solved.unigrams Similarly,in forreference imagethat summarization,appear Tschiatschekin etsystem al.,and developed a Visual-ROUGE score which judges the performancecount of algorithmsunigrams forin imagereference summarizationsummary.<ref>Sebastian Tschiatschek,If Rishabhthere Iyer,are Hoachenmultiple Weireferences, andROUGE-1 Jeffscores Bilmes,are [http://papersaveraged.nips.cc/paper/5415-learning-mixtures-of-submodular-functions-for-image-collection-summarization.pdf LearningA Mixtureshigh level of Submodularoverlap Functionsshould forindicate Imagea Collectionhigh Summarization], In Advanceslevel of Neuralshared Informationconcepts Processingbetween Systemsthe (NIPS),two Montreal, Canada, December - 2014summaries. (PDF)</ref>
 
If there are multiple references, the ROUGE-1 scores are averaged. Because ROUGE is based only on content overlap, it can determine if the same general concepts are discussed between an automatic summary and a reference summary, but it cannot determine if the result is coherent or the sentences flow together in a sensible manner. High-order n-gram ROUGE measures try to judge fluency to some degree. Note that ROUGE is similar to the BLEU measure for machine translation, but BLEU is precision- based, because translation systems favor accuracy.
 
Another problem yet to be fully solved is [[anaphora (linguistics)|Anaphor resolution]]. Similarly, for image summarization, Tschiatschek et al., developed a Visual-ROUGE score which judges the performance of algorithms for image summarization.<ref>Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, [http://papers.nips.cc/paper/5415-learning-mixtures-of-submodular-functions-for-image-collection-summarization.pdf Learning Mixtures of Submodular Functions for Image Collection Summarization], In Advances of Neural Information Processing Systems (NIPS), Montreal, Canada, December - 2014. (PDF)</ref>
 
(ROUGE is similar to the BLEU measure for machine translation, but BLEU is precision-based, because translation systems favor accuracy.)
 
===Domain -specific versus ___domain -independent summarization techniques===
Domain independent summarization techniques generally apply sets of general features which can be used to identify information-rich text segments. Recent research focus has drifted to ___domain-specific summarization techniques that utilize the available knowledge specific to the ___domain of text. For example, automatic summarization research on medical text generally attempts to utilize the various sources of codified medical knowledge and ontologies.<ref>{{Cite book|last1=Sarker|first1=Abeed|last2=Molla|first2=Diego|last3=Paris|first3=Cecile|title=An Approach for Query-focused Text Summarization for Evidence-based medicine|date=2013|volume=7885|pages=295–304|doi=10.1007/978-3-642-38326-7_41|series=Lecture Notes in Computer Science|isbn=978-3-642-38325-0}}</ref>
 
===Qualitative===
===Evaluating summaries qualitatively===
The main drawback of the evaluation systems existing so far is that we need at least one reference summary, and for some methods more than one, to be able to compare automatic summaries with models. This is a hard and expensive task. Much effort has to be done in order to have corpus of texts and their corresponding summaries. Furthermore, for some methods, not only do we need to have human-made summaries available for comparison, but also manual annotation has to be performed in some of them (e.g. SCU in the Pyramid Method). In any case, what the evaluation methods need as an input, is a set of summaries to serve as gold standards and a set of automatic summaries. Moreover, they all perform a quantitative evaluation with regard to different similarity metrics.