Revision as of 14:19, 22 November 2022 edit Jameshfisher (talk \| contribs) Extended confirmed users 1,275 edits m →Inter-textual versus intra-textual: typo ← Previous edit		Revision as of 14:20, 22 November 2022 edit undo Jameshfisher (talk \| contribs) Extended confirmed users 1,275 edits m →Inter-textual versus intra-textual: presumed typo Next edit →
Line 151: Human judgement often varies greatly in what it considers a "good" summary, so creating an automatic evaluation process is particularly difficult. Manual evaluation can be used, but this is both time and labor-intensive, as it requires humans to read not only the summaries but also the source documents. Other issues are those concerning [[coherence (linguistics)\|coherence]] and coverage. The most common way to evaluate summaries is [[ROUGE (metric)\|ROUGE]] (Recall-Oriented Understudy for Gisting Evaluation). It is very common for summarization and translation systems in [[NIST]]'s Document Understanding Conferences.[https://web.archive.org/web/20060408135021/http://haydn.isi.edu/ROUGE/] ~~ROGUE~~ROUGE is a recall-based measure of how well a summary covers the content of human-generated summaries known as references. It calculates [[n-gram]] overlaps between automatically generated summaries and previously written human summaries. It is recall-based to encourage inclusion of all important topics in summaries. Recall can be computed with respect to unigram, bigram, trigram, or 4-gram matching. For example, ROUGE-1 is the fraction of unigrams that appear in both the reference summary and the automatic summary out of all unigrams in the reference summary. If there are multiple reference summaries, their scores are averaged. A high level of overlap should indicate a high degree of shared concepts between the two summaries. ROUGE cannot determine if the result is coherent, that is if sentences flow together in a sensibly. High-order n-gram ROUGE measures help to some degree.

Automatic summarization: Difference between revisions