Automatic summarization: Difference between revisions

Content deleted Content added
Evaluation: Integrated Evaluation subsection into main Evaluation section
Evaluation: Major copy-edit
Line 143:
Evaluation can be intrinsic or extrinsic,<ref>[http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings2/sum-mani.pdf Mani, I. Summarization evaluation: an overview]</ref> and inter-textual or intra-textual.<ref>{{Cite journal | doi=10.3103/S0005105507030041|title = A method for evaluating modern systems of automatic text summarization| journal=Automatic Documentation and Mathematical Linguistics| volume=41| issue=3| pages=93–103|year = 2007|last1 = Yatsko|first1 = V. A.| last2=Vishnyakov| first2=T. N.|s2cid = 7853204}}</ref>
 
=== Intrinsic andversus extrinsic ===
An intrinsicIntrinsic evaluation testsassesses the summarizationsummaries system in and of itselfdirectly, while an extrinsic evaluation testsevaluates how the summarization based on how itsystem affects the completion of some other task. Intrinsic evaluations have assessed mainly the coherence and informativeness of summaries. Extrinsic evaluations, on the other hand, have tested the impact of summarization on tasks like relevance assessment, reading comprehension, etc.
 
=== Inter-textual andversus intra-textual ===
Intra-textual methodsevaluation assess the output of a specific summarization system, and thewhile inter-textual onesevaluation focusfocuses on contrastive analysis of outputs of several summarization systems.
 
Human judgement often hasvaries widegreaetly variance onin what isit consideredconsiders a "good" summary, whichso meanscreating thatan making theautomatic evaluation process automatic is particularly difficult. Manual evaluation can be used, but this is both time and labor-intensive, as it requires humans to read not only the summaries but also the source documents. Other issues are those concerning [[coherence (linguistics)|coherence]] and coverage.
 
The most common way to evaluate summaries is [[ROUGE (metric)|ROUGE]] (Recall-Oriented Understudy for Gisting Evaluation) measure. It is beryvery common for summarization and translation systems in [[NIST]]'s annual Document Understanding Conferences, where research groups submit their summarization and translation systems. [https://web.archive.org/web/20060408135021/http://haydn.isi.edu/ROUGE/] ROGUE is a recall-based measure of how well a summary covers the content present in one or moreof human-generated model summaries known as references. It calculates [[n-gram]] overlaps between automatically generated summaries and previously written human summaries. It is recall-based to encourage systemsinclusion to includeof all the important topics in the textsummaries. Recall can be computed with respect to unigram, bigram, trigram, or 4-gram matching. For example, ROUGE-1 is computedthe as divisionfraction of countunigrams ofthat unigramsappear in both the reference thatsummary appearand inthe systemautomatic andsummary countout of all unigrams in the reference summary. If there are multiple referencesreference summaries, ROUGE-1their scores are averaged. A high level of overlap should indicate a high leveldegree of shared concepts between the two summaries.
 
Because ROUGE is based only on content overlap, it can determine if the same general concepts are discussed between an automatic summary and a reference summary, but it cannot determine if the result is coherent, orthat theis if sentences flow together in a sensible mannersensibly. High-order n-gram ROUGE measures try to judge fluencyhelp to some degree.
 
Another unsolved problem yet to be fully solved is [[anaphora (linguistics)|Anaphor resolution]]. Similarly, for image summarization, Tschiatschek et al., developed a Visual-ROUGE score which judges the performance of algorithms for image summarization.<ref>Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, [http://papers.nips.cc/paper/5415-learning-mixtures-of-submodular-functions-for-image-collection-summarization.pdf Learning Mixtures of Submodular Functions for Image Collection Summarization], In Advances of Neural Information Processing Systems (NIPS), Montreal, Canada, December - 2014. (PDF)</ref>
 
(ROUGE is similar to the BLEU measure for machine translation, but BLEU is precision-based, because translation systems favor accuracy.)
 
===Domain-specific versus ___domain-independent summarization===
Domain -independent summarization techniques generally apply sets of general features which can be used to identify information-rich text segments. Recent research focusfocuses has drifted toon ___domain-specific summarization techniques that utilize the availableusing knowledge specific to the ___domain of text.'s For example___domain, automaticsuch summarization research onas medical textknowledge generallyand attemptsontologies tofor utilize the various sources of codifiedsummarizing medical knowledge and ontologiestexts.<ref>{{Cite book|last1=Sarker|first1=Abeed|last2=Molla|first2=Diego|last3=Paris|first3=Cecile|title=An Approach for Query-focused Text Summarization for Evidence-based medicine|date=2013|volume=7885|pages=295–304|doi=10.1007/978-3-642-38326-7_41|series=Lecture Notes in Computer Science|isbn=978-3-642-38325-0}}</ref>
 
===Qualitative===
The main drawback of the evaluation systems existing so far is that we need at least onea reference summary, and (for some methods, more than one), to be able to compare automatic summaries with models. This is a hard and expensive task. Much effort has to be done in ordermade to havecreate corpuscorpora of texts and their corresponding summaries. Furthermore, for some methods, not only do we need to have human-made summaries available for comparison, but alsorequire manual annotation hasof tothe be performed in some of themsummaries (e.g. SCU in the Pyramid Method). In any case, what the evaluation methods need as an input, is a set of summaries to serve as gold standards and a set of automatic summaries. Moreover, they all perform a quantitative evaluation with regard to different similarity metrics.
 
==History==