Revision as of 17:59, 1 November 2022 edit Pgan002 (talk \| contribs) Extended confirmed users 15,370 edits →Evaluation: Major copy-edit Tag: Visual edit ← Previous edit		Revision as of 14:19, 22 November 2022 edit undo Jameshfisher (talk \| contribs) Extended confirmed users 1,275 edits m →Inter-textual versus intra-textual: typo Next edit →
Line 149: Intra-textual evaluation assess the output of a specific summarization system, while inter-textual evaluation focuses on contrastive analysis of outputs of several summarization systems. Human judgement often varies ~~greaetly~~greatly in what it considers a "good" summary, so creating an automatic evaluation process is particularly difficult. Manual evaluation can be used, but this is both time and labor-intensive, as it requires humans to read not only the summaries but also the source documents. Other issues are those concerning [[coherence (linguistics)\|coherence]] and coverage. The most common way to evaluate summaries is [[ROUGE (metric)\|ROUGE]] (Recall-Oriented Understudy for Gisting Evaluation). It is very common for summarization and translation systems in [[NIST]]'s Document Understanding Conferences.[https://web.archive.org/web/20060408135021/http://haydn.isi.edu/ROUGE/] ROGUE is a recall-based measure of how well a summary covers the content of human-generated summaries known as references. It calculates [[n-gram]] overlaps between automatically generated summaries and previously written human summaries. It is recall-based to encourage inclusion of all important topics in summaries. Recall can be computed with respect to unigram, bigram, trigram, or 4-gram matching. For example, ROUGE-1 is the fraction of unigrams that appear in both the reference summary and the automatic summary out of all unigrams in the reference summary. If there are multiple reference summaries, their scores are averaged. A high level of overlap should indicate a high degree of shared concepts between the two summaries.

Automatic summarization: Difference between revisions