Revision as of 01:39, 29 October 2022 edit Pgan002 (talk \| contribs) Extended confirmed users 15,370 edits →Evaluation: Integrated Evaluation subsection into main Evaluation section Tag: Visual edit ← Previous edit		Revision as of 17:59, 1 November 2022 edit undo Pgan002 (talk \| contribs) Extended confirmed users 15,370 edits →Evaluation: Major copy-edit Tag: Visual edit Next edit →
Line 143: Evaluation can be intrinsic or extrinsic,<ref>[http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings2/sum-mani.pdf Mani, I. Summarization evaluation: an overview]</ref> and inter-textual or intra-textual.<ref>{{Cite journal \| doi=10.3103/S0005105507030041\|title = A method for evaluating modern systems of automatic text summarization\| journal=Automatic Documentation and Mathematical Linguistics\| volume=41\| issue=3\| pages=93–103\|year = 2007\|last1 = Yatsko\|first1 = V. A.\| last2=Vishnyakov\| first2=T. N.\|s2cid = 7853204}}</ref> === Intrinsic ~~and~~versus extrinsic === ~~An intrinsic~~Intrinsic evaluation ~~tests~~assesses the ~~summarization~~summaries ~~system in and of itself~~directly, while an extrinsic evaluation ~~tests~~evaluates how the summarization ~~based on how it~~system affects the completion of some other task. Intrinsic evaluations have assessed mainly the coherence and informativeness of summaries. Extrinsic evaluations, on the other hand, have tested the impact of summarization on tasks like relevance assessment, reading comprehension, etc. === Inter-textual ~~and~~versus intra-textual === Intra-textual ~~methods~~evaluation assess the output of a specific summarization system, ~~and the~~while inter-textual ~~ones~~evaluation ~~focus~~focuses on contrastive analysis of outputs of several summarization systems. Human judgement often ~~has~~varies ~~wide~~greaetly ~~variance on~~in what isit ~~considered~~considers a "good" summary, ~~which~~so ~~means~~creating ~~that~~an ~~making the~~automatic evaluation process ~~automatic~~ is particularly difficult. Manual evaluation can be used, but this is both time and labor-intensive, as it requires humans to read not only the summaries but also the source documents. Other issues are those concerning [[coherence (linguistics)\|coherence]] and coverage. The most common way to evaluate summaries is [[ROUGE (metric)\|ROUGE]] (Recall-Oriented Understudy for Gisting Evaluation) ~~measure~~. It is ~~bery~~very common for summarization and translation systems in [[NIST]]'s ~~annual~~ Document Understanding Conferences~~, where research groups submit their summarization and translation systems~~. [https://web.archive.org/web/20060408135021/http://haydn.isi.edu/ROUGE/] ROGUE is a recall-based measure of how well a summary covers the content ~~present in one or more~~of human-generated ~~model~~ summaries known as references. It calculates [[n-gram]] overlaps between automatically generated summaries and previously written human summaries. It is recall-based to encourage ~~systems~~inclusion ~~to include~~of all ~~the~~ important topics in ~~the text~~summaries. Recall can be computed with respect to unigram, bigram, trigram, or 4-gram matching. For example, ROUGE-1 is ~~computed~~the ~~as division~~fraction of ~~count~~unigrams ofthat ~~unigrams~~appear in both the reference ~~that~~summary ~~appear~~and inthe ~~system~~automatic ~~and~~summary ~~count~~out of all unigrams in the reference summary. If there are multiple ~~references~~reference summaries, ~~ROUGE-1~~their scores are averaged. A high level of overlap should indicate a high ~~level~~degree of shared concepts between the two summaries. ~~Because~~ ROUGE ~~is based only on content overlap, it can determine if the same general concepts are discussed between an automatic summary and a reference summary, but it~~ cannot determine if the result is coherent, orthat ~~the~~is if sentences flow together in a ~~sensible manner~~sensibly. High-order n-gram ROUGE measures ~~try to judge fluency~~help to some degree. Another unsolved problem ~~yet to be fully solved~~ is [[anaphora (linguistics)\|Anaphor resolution]]. Similarly, for image summarization, Tschiatschek et al., developed a Visual-ROUGE score which judges the performance of algorithms for image summarization.<ref>Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, [http://papers.nips.cc/paper/5415-learning-mixtures-of-submodular-functions-for-image-collection-summarization.pdf Learning Mixtures of Submodular Functions for Image Collection Summarization], In Advances of Neural Information Processing Systems (NIPS), Montreal, Canada, December - 2014. (PDF)</ref> ~~(ROUGE is similar to the BLEU measure for machine translation, but BLEU is precision-based, because translation systems favor accuracy.)~~ ===Domain-specific versus ___domain-independent summarization=== Domain -independent summarization techniques ~~generally~~ apply sets of general features ~~which can be used~~ to identify information-rich text segments. Recent research ~~focus~~focuses ~~has drifted to~~on ___domain-specific summarization ~~techniques that utilize the available~~using knowledge specific to the ~~___domain of~~ text.'s ~~For example~~___domain, ~~automatic~~such ~~summarization research on~~as medical ~~text~~knowledge ~~generally~~and ~~attempts~~ontologies tofor ~~utilize the various sources of codified~~summarizing medical ~~knowledge and ontologies~~texts.<ref>{{Cite book\|last1=Sarker\|first1=Abeed\|last2=Molla\|first2=Diego\|last3=Paris\|first3=Cecile\|title=An Approach for Query-focused Text Summarization for Evidence-based medicine\|date=2013\|volume=7885\|pages=295–304\|doi=10.1007/978-3-642-38326-7_41\|series=Lecture Notes in Computer Science\|isbn=978-3-642-38325-0}}</ref> ===Qualitative=== The main drawback of the evaluation systems ~~existing~~ so far is that we need ~~at least one~~a reference summary~~, and~~ (for some methods, more than one), ~~to be able~~ to compare automatic summaries with models. This is a hard and expensive task. Much effort has to be ~~done in order~~made to ~~have~~create ~~corpus~~corpora of texts and their corresponding summaries. Furthermore, ~~for~~ some methods, ~~not only do we need to have human-made summaries available for comparison, but also~~require manual annotation ~~has~~of tothe ~~be performed in some of them~~summaries (e.g. SCU in the Pyramid Method)~~. In any case, what the evaluation methods need as an input, is a set of summaries to serve as gold standards and a set of automatic summaries~~. Moreover, they all perform a quantitative evaluation with regard to different similarity metrics. ==History==

Automatic summarization: Difference between revisions