Content deleted Content added
→Evaluation: Integrated Evaluation subsection into main Evaluation section |
Citation bot (talk | contribs) Removed URL that duplicated identifier. Removed access-date with no URL. | Use this bot. Report bugs. | #UCB_CommandLine |
||
(39 intermediate revisions by 26 users not shown) | |||
Line 1:
{{Short description|Computer-based method for summarizing a text}}
{{More citations needed|date=April 2022}}
'''Automatic summarization''' is the process of shortening a set of data computationally, to create a subset (a [[Abstract (summary)|summary]]) that represents the most important or relevant information within the original content. [[Artificial intelligence]] [[algorithm]]s are commonly developed and employed to achieve this, specialized for different types of data.
== Commercial products ==
In 2022 [[Google Docs]] released an automatic summarization feature.<ref>{{Cite web |title=Auto-generated Summaries in Google Docs |url=http://ai.googleblog.com/2022/03/auto-generated-summaries-in-google-docs.html |access-date=2022-04-03 |website=Google AI Blog |date=23 March 2022 |language=en}}</ref>
==Approaches==
Line 18:
===Abstractive-based summarization===
Abstractive summarization methods generate new text that did not exist in the original text.<ref>{{Cite book |last=Zhai |first=ChengXiang
===Aided summarization===
Line 48 ⟶ 47:
====Supervised learning approaches====
Beginning with the work of Turney,<ref>{{Cite journal |arxiv = cs/0212020|last1 = Turney|first1 = Peter D|title = Learning Algorithms for Keyphrase Extraction|journal = Information Retrieval
Given a document, we construct an example for each [[unigram]], [[bigram]], and trigram found in the text (though other text units are also possible, as discussed below). We then compute various features describing each example (e.g., does the phrase begin with an upper-case letter?). We assume there are known keyphrases available for a set of training documents. Using the known keyphrases, we can assign positive or negative labels to the examples. Then we learn a classifier that can discriminate between positive and negative examples as a function of the features. Some classifiers make a [[binary classification]] for a test example, while others assign a probability of being a keyphrase. For instance, in the above text, we might learn a rule that says phrases with initial capital letters are likely to be keyphrases.
After training a learner, we can select keyphrases for test documents in the following manner. We apply the same example-generation strategy to the test documents, then run each example through the learner. We can determine the keyphrases by looking at binary classification decisions or probabilities returned from our learned model. If probabilities are given, a threshold is used to select the keyphrases.
Keyphrase extractors are generally evaluated using [[precision and recall]]. Precision measures how
many of the proposed keyphrases are actually correct. Recall measures how many of the true
keyphrases your system proposed. The two measures can be combined in an F-score, which is the
Line 58 ⟶ 57:
Designing a supervised keyphrase extraction system involves deciding on several choices (some of these apply to unsupervised, too). The first choice is exactly how to generate examples. Turney and others have used all possible unigrams, bigrams, and trigrams without intervening punctuation and after removing stopwords. Hulth showed that you can get some improvement by selecting examples to be sequences of tokens that match certain patterns of part-of-speech tags. Ideally, the mechanism for generating examples produces all the known labeled keyphrases as candidates, though this is often not the case. For example, if we use only unigrams, bigrams, and trigrams, then we will never be able to extract a known keyphrase containing four words. Thus, recall may suffer. However, generating too many examples can also lead to low precision.
We also need to create features that describe the examples and are informative enough to allow a learning algorithm to discriminate keyphrases from non- keyphrases. Typically features involve various term frequencies (how many times a phrase appears in the current text or in a larger corpus), the length of the example, relative position of the first occurrence, various
In the end, the system will need to return a list of keyphrases for a test document, so we need to have a way to limit the number. Ensemble methods (i.e., using votes from several classifiers) have been used to produce numeric scores that can be thresholded to provide a user-provided number of keyphrases. This is the technique used by Turney with C4.5 decision trees. Hulth used a single binary classifier so the learning algorithm implicitly determines the appropriate number.
Line 87 ⟶ 86:
====Maximum entropy-based summarization====
During the DUC 2001 and 2002 evaluation workshops, [[Netherlands Organisation for Applied Scientific Research|TNO]] developed a sentence extraction system for multi-document summarization in the news ___domain. The system was based on a hybrid system using a [[
==== Adaptive summarization ====
Line 123 ⟶ 122:
===Submodular functions as generic tools for summarization===
The idea of a [[submodular set function]] has recently emerged as a powerful modeling tool for various summarization problems. Submodular functions naturally model notions of ''coverage'', ''information'', ''representation'' and ''diversity''. Moreover, several important [[combinatorial optimization]] problems occur as special instances of submodular optimization. For example, the [[set cover problem]] is a special case of submodular optimization, since the set cover function is submodular. The set cover function attempts to find a subset of objects which ''cover'' a given set of concepts. For example, in document summarization, one would like the summary to cover all important and relevant concepts in the document. This is an instance of set cover. Similarly, the [[Optimal facility ___location|facility ___location problem]] is a special case of submodular functions. The Facility Location function also naturally models coverage and diversity. Another example of a submodular optimization problem is using a [[determinantal point process]] to model diversity. Similarly, the Maximum-Marginal-Relevance procedure can also be seen as an instance of submodular optimization. All these important models encouraging coverage, diversity and information are all submodular. Moreover, submodular functions can be efficiently combined, and the resulting function is still submodular. Hence, one could combine one submodular function which models diversity, another one which models coverage and use human supervision to learn a right model of a submodular function for the problem.
While submodular functions are fitting problems for summarization, they also admit very efficient algorithms for optimization. For example, a simple [[greedy algorithm]] admits a constant factor guarantee.<ref>Nemhauser, George L., Laurence A. Wolsey, and Marshall L. Fisher. "An analysis of approximations for maximizing submodular set functions—I." Mathematical Programming 14.1 (1978): 265-294.</ref> Moreover, the greedy algorithm is extremely simple to implement and can scale to large datasets, which is very important for summarization problems.
Line 129 ⟶ 128:
Submodular functions have achieved state-of-the-art for almost all summarization problems. For example, work by Lin and Bilmes, 2012<ref>Hui Lin, Jeff Bilmes. "[https://arxiv.org/abs/1210.4871 Learning mixtures of submodular shells with application to document summarization]", UAI, 2012</ref> shows that submodular functions achieve the best results to date on DUC-04, DUC-05, DUC-06 and DUC-07 systems for document summarization. Similarly, work by Lin and Bilmes, 2011,<ref>Hui Lin, Jeff Bilmes. "[http://www.aclweb.org/anthology/P11-1052 A Class of Submodular Functions for Document Summarization]", The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), 2011</ref> shows that many existing systems for automatic summarization are instances of submodular functions. This was a breakthrough result establishing submodular functions as the right models for summarization problems.{{citation needed|date=June 2018}}
Submodular Functions have also been used for other summarization tasks. Tschiatschek et al., 2014 show<ref>Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, [http://papers.nips.cc/paper/5415-learning-mixtures-of-submodular-functions-for-image-collection-summarization.pdf Learning Mixtures of Submodular Functions for Image Collection Summarization], In Advances of Neural Information Processing Systems (NIPS), Montreal, Canada, December - 2014.</ref> that mixtures of submodular functions achieve state-of-the-art results for image collection summarization. Similarly, Bairi et al., 2015<ref>Ramakrishna Bairi, Rishabh Iyer, Ganesh Ramakrishnan and Jeff Bilmes, [http://www.aclweb.org/anthology/P15-1054 Summarizing Multi-Document Topic Hierarchies using Submodular Mixtures], To Appear In the Annual Meeting of the Association for Computational Linguistics (ACL), Beijing, China, July - 2015</ref> show the utility of submodular functions for summarizing multi-document topic hierarchies. Submodular Functions have also successfully been used for summarizing machine learning datasets.<ref>Kai Wei, Rishabh Iyer, and Jeff Bilmes, [http://www.jmlr.org/proceedings/papers/v37/wei15.pdf Submodularity in Data Subset Selection and Active Learning] {{Webarchive|url=https://web.archive.org/web/20170313220928/http://jmlr.org/proceedings/papers/v37/wei15.pdf |date=2017-03-13 }}, To Appear In Proc. International Conference on Machine Learning (ICML), Lille, France, June - 2015</ref>
===Applications===
Line 135 ⟶ 134:
Specific applications of automatic summarization include:
* The [[Reddit]] [[Internet bot|bot]] "autotldr",<ref>{{cite web|title=overview for autotldr|url=https://www.reddit.com/user/autotldr|website=reddit|access-date=9 February 2017|language=en}}</ref> created in 2011 summarizes news articles in the comment-section of reddit posts. It was found to be very useful by the reddit community which upvoted its summaries hundreds of thousands of times.<ref>{{cite book|last1=Squire|first1=Megan|author-link = Megan Squire|title=Mastering Data Mining with Python – Find patterns hidden in your data|publisher=Packt Publishing Ltd|isbn=9781785885914|url=https://books.google.com/books?id=_qXWDQAAQBAJ&pg=PA185|access-date=9 February 2017|language=en|date=2016-08-29}}</ref> The name is reference to [[TL;DR]] − [[Internet slang]] for "too long; didn't read".<ref>{{cite web|title=What Is 'TLDR'?|url=https://www.lifewire.com/what-is-tldr-2483633|website=Lifewire|access-date=9 February 2017}}</ref><ref>{{cite web|title=What Does TL;DR Mean? AMA? TIL? Glossary Of Reddit Terms And Abbreviations|url=http://www.ibtimes.com/what-does-tldr-mean-ama-til-glossary-reddit-terms-abbreviations-431704|work=International Business Times|access-date=9 February 2017|date=29 March 2012}}</ref>
* [[Adversarial stylometry]] may make use of summaries, if the detail lost is not major and the summary is sufficiently stylistically different to the input.{{sfn|Potthast|Hagen|Stein|2016|p=11-12}}
==Evaluation==
Line 143:
Evaluation can be intrinsic or extrinsic,<ref>[http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings2/sum-mani.pdf Mani, I. Summarization evaluation: an overview]</ref> and inter-textual or intra-textual.<ref>{{Cite journal | doi=10.3103/S0005105507030041|title = A method for evaluating modern systems of automatic text summarization| journal=Automatic Documentation and Mathematical Linguistics| volume=41| issue=3| pages=93–103|year = 2007|last1 = Yatsko|first1 = V. A.| last2=Vishnyakov| first2=T. N.|s2cid = 7853204}}</ref>
=== Intrinsic
=== Inter-textual
Intra-textual
Human judgement often
The most common way to evaluate summaries is [[ROUGE (metric)|ROUGE]] (Recall-Oriented Understudy for Gisting Evaluation)
Another unsolved problem
===Domain-specific versus ___domain-independent summarization===
Domain
===Qualitative===
The main drawback of the evaluation systems
==History==
The first publication in the area dates back to 1957 <ref>
===Recent approaches===
Recently the rise of [[Transformer (machine learning model)|
==See also==
Line 178 ⟶ 176:
==References==
{{Reflist|2}}
== Works cited ==
* {{ cite conference | url = https://ceur-ws.org/Vol-1609/16090716.pdf | title = Author Obfuscation: Attacking the State of the Art in Authorship Verification | last1 = Potthast | first1 = Martin | last2 = Hagen | first2 = Matthias | last3 = Stein | first3 = Benno | conference = Conference and Labs of the Evaluation Forum | year = 2016 }}
== Further reading ==
*{{cite book |last=Hercules |first=Dalianis |year=2003 |title=Porting and evaluation of automatic summarization|url=https://www.researchgate.net/publication/277288103}}
*{{cite book |last=Roxana |first=Angheluta |year=2002 |title=The Use of Topic Segmentation for Automatic Summarization|url=https://www.researchgate.net/publication/2553088}}
*{{cite book |last=Anne |first=Buist |year=2004 |title=Automatic Summarization of Meeting Data: A Feasibility Study |url=https://www.cs.ru.nl/~kraaijw/pubs/Biblio/papers/meeting_sum_tno.pdf |access-date=2020-07-19 |archive-date=2021-01-23 |archive-url=https://web.archive.org/web/20210123014007/http://www.cs.ru.nl/~kraaijw/pubs/Biblio/papers/meeting_sum_tno.pdf |url-status=dead }}
*{{cite book |last=Annie |first=Louis |year=2009 |title=Performance Confidence Estimation for Automatic Summarization|url=https://repository.upenn.edu/cgi/viewcontent.cgi?article=1762&context=cis_papers}}
*{{cite book |last=Elena |first=Lloret and Manuel, Palomar
*{{cite book |last=Andrew |first=Goldberg |year=2007 |title=Automatic Summarization}}
*{{cite book |last=Alrehamy |first=Hassan
*{{cite book |last=Endres-Niggemeyer |first=Brigitte |year=1998 |title=Summarizing Information |publisher=Springer |url=https://archive.org/details/springer_10.1007-978-3-642-72025-3 |isbn=978-3-540-63735-6}}
*{{cite book |last=Marcu |first=Daniel |year=2000 |title=The Theory and Practice of Discourse Parsing and Summarization |publisher=MIT Press |isbn=978-0-262-13372-2}}
*{{cite book |last=Mani |first=Inderjeet |year=2001 |title=Automatic Summarization |isbn=978-1-58811-060-2}}
*{{cite book |last=Huff |first=Jason |year=2010 |title=AutoSummarize |url=http://www.jason-huff.com/projects/autosummarize/}}, Conceptual artwork using automatic summarization software in Microsoft Word 2008.
|