Content deleted Content added
→Evaluation: Integrated Evaluation subsection into main Evaluation section |
|||
Line 82:
===Document summarization===
Like keyphrase extraction, document summarization aims to identify the essence of a text. The only real difference is that now we are dealing with larger text units—whole sentences instead of words and phrases.
If there are multiple references, the ROUGE-1 scores are averaged. Because ROUGE is based only on content overlap, it can determine if the same general concepts are discussed between an automatic summary and a reference summary, but it cannot determine if the result is coherent or the sentences flow together in a sensible manner. High-order n-gram ROUGE measures try to judge fluency to some degree. Note that ROUGE is similar to the BLEU measure for machine translation, but BLEU is precision- based, because translation systems favor accuracy.▼
====Supervised learning approaches====
Line 141 ⟶ 136:
* The [[Reddit]] [[Internet bot|bot]] "autotldr",<ref>{{cite web|title=overview for autotldr|url=https://www.reddit.com/user/autotldr|website=reddit|access-date=9 February 2017|language=en}}</ref> created in 2011 summarizes news articles in the comment-section of reddit posts. It was found to be very useful by the reddit community which upvoted its summaries hundreds of thousands of times.<ref>{{cite book|last1=Squire|first1=Megan|author-link = Megan Squire|title=Mastering Data Mining with Python – Find patterns hidden in your data|publisher=Packt Publishing Ltd|isbn=9781785885914|url=https://books.google.com/books?id=_qXWDQAAQBAJ&pg=PA185|access-date=9 February 2017|language=en|date=2016-08-29}}</ref> The name is reference to [[TL;DR]] − [[Internet slang]] for "too long; didn't read".<ref>{{cite web|title=What Is 'TLDR'?|url=https://www.lifewire.com/what-is-tldr-2483633|website=Lifewire|access-date=9 February 2017}}</ref><ref>{{cite web|title=What Does TL;DR Mean? AMA? TIL? Glossary Of Reddit Terms And Abbreviations|url=http://www.ibtimes.com/what-does-tldr-mean-ama-til-glossary-reddit-terms-abbreviations-431704|work=International Business Times|access-date=9 February 2017|date=29 March 2012}}</ref>
==Evaluation
<!-- IMPORTANT: This section needs to be tied in to the above article so it fits in. Currently, it is not clear what the relation of evaluation is to any of the above topics. The following questions need to be answered: First, in the context of automatic summarization, what is evaluation? Second, what is the significance of evaluation? That is, what is evaluation used for?
-->
The most common way to evaluate the informativeness of automatic summaries is to compare them with human-made model summaries.
Evaluation
=== Intrinsic and extrinsic
An intrinsic evaluation tests the summarization system in and of itself while an extrinsic evaluation tests the summarization based on how it affects the completion of some other task. Intrinsic evaluations have assessed mainly the coherence and informativeness of summaries. Extrinsic evaluations, on the other hand, have tested the impact of summarization on tasks like relevance assessment, reading comprehension, etc.
=== Inter-textual and intra-textual ===
Line 157 ⟶ 151:
Human judgement often has wide variance on what is considered a "good" summary, which means that making the evaluation process automatic is particularly difficult. Manual evaluation can be used, but this is both time and labor-intensive as it requires humans to read not only the summaries but also the source documents. Other issues are those concerning [[coherence (linguistics)|coherence]] and coverage.
▲
Another problem yet to be fully solved is [[anaphora (linguistics)|Anaphor resolution]]. Similarly, for image summarization, Tschiatschek et al., developed a Visual-ROUGE score which judges the performance of algorithms for image summarization.<ref>Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, [http://papers.nips.cc/paper/5415-learning-mixtures-of-submodular-functions-for-image-collection-summarization.pdf Learning Mixtures of Submodular Functions for Image Collection Summarization], In Advances of Neural Information Processing Systems (NIPS), Montreal, Canada, December - 2014. (PDF)</ref>
(ROUGE is similar to the BLEU measure for machine translation, but BLEU is precision-based, because translation systems favor accuracy.)
===Domain
Domain independent summarization techniques generally apply sets of general features which can be used to identify information-rich text segments. Recent research focus has drifted to ___domain-specific summarization techniques that utilize the available knowledge specific to the ___domain of text. For example, automatic summarization research on medical text generally attempts to utilize the various sources of codified medical knowledge and ontologies.<ref>{{Cite book|last1=Sarker|first1=Abeed|last2=Molla|first2=Diego|last3=Paris|first3=Cecile|title=An Approach for Query-focused Text Summarization for Evidence-based medicine|date=2013|volume=7885|pages=295–304|doi=10.1007/978-3-642-38326-7_41|series=Lecture Notes in Computer Science|isbn=978-3-642-38325-0}}</ref>
===Qualitative===
The main drawback of the evaluation systems existing so far is that we need at least one reference summary, and for some methods more than one, to be able to compare automatic summaries with models. This is a hard and expensive task. Much effort has to be done in order to have corpus of texts and their corresponding summaries. Furthermore, for some methods, not only do we need to have human-made summaries available for comparison, but also manual annotation has to be performed in some of them (e.g. SCU in the Pyramid Method). In any case, what the evaluation methods need as an input, is a set of summaries to serve as gold standards and a set of automatic summaries. Moreover, they all perform a quantitative evaluation with regard to different similarity metrics.
|