Automatic summarization: Difference between revisions

Content deleted Content added
Rescuing 2 sources and tagging 0 as dead.) #IABot (v2.0.9.5) (Whoop whoop pull up - 14403
Citation bot (talk | contribs)
Removed URL that duplicated identifier. Removed access-date with no URL. | Use this bot. Report bugs. | #UCB_CommandLine
 
(15 intermediate revisions by 12 users not shown)
Line 1:
{{Short description|Computer-based method for summarizing a text}}
{{More citations needed|date=April 2022}}
'''Automatic summarization''' is the process of shortening a set of data computationally, to create a subset (a [[Abstract (summary)|summary]]) that represents the most important or relevant information within the original content. [[Artificial intelligence]] [[algorithm|algorithms]]s are commonly developed and employed to achieve this, specialized for different types of data.
 
[[Plain text|Text]] summarization is usually implemented by [[natural language processing]] methods, designed to locate the most informative sentences in a given document.<ref name="Torres2014">{{cite book|author1=Torres-Moreno, Juan-Manuel|title=Automatic Text Summarization|url=https://www.wiley.com/en-gb/Automatic+Text+Summarization-p-9781848216686|date=1 October 2014|publisher=Wiley|isbn=978-1-848-21668-6|pages=320–}}</ref> On the other hand, visual content can be summarized using [[computer vision]] algorithms. [[Image]] summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.<ref>{{Cite journal|last1=Pan|first1=Xingjia|last2=Tang|first2=Fan|last3=Dong|first3=Weiming|last4=Ma|first4=Chongyang|last5=Meng|first5=Yiping|last6=Huang|first6=Feiyue|last7=Lee|first7=Tong-Yee|last8=Xu|first8=Changsheng|date=2021-04-01|title=Content-Based Visual Summarization for Image Collection|journal=IEEE Transactions on Visualization and Computer Graphics|volume=27|issue=4|pages=2298–2312|doi=10.1109/tvcg.2019.2948611|pmid=31647438|s2cid=204865221|issn=1077-2626}}</ref><ref>{{Cite news|date=January 10, 2018|title=WIPO PUBLISHES PATENT OF KT FOR "IMAGE SUMMARIZATION SYSTEM AND METHOD" (SOUTH KOREAN INVENTORS)|work=US Fed News Service|url=https://www.proquest.com/docview/1986931333|access-date=January 22, 2021|id={{ProQuest|1986931333}}}}</ref><ref>{{Cite journal|last1=Li Tan|last2=Yangqiu Song|last3=Shixia Liu|author3-link=Shixia Liu|last4=Lexing Xie|date=February 2012|title=ImageHive: Interactive Content-Aware Image Summarization|journal=IEEE Computer Graphics and Applications|volume=32|issue=1|pages=46–55|doi=10.1109/mcg.2011.89|pmid=24808292|s2cid=7668289|issn=0272-1716}}</ref> Video summarization algorithms identify and extract from the original video content the most important frames (''key-frames''), and/or the most important video segments (''key-shots''), normally in a temporally ordered fashion.<ref name="PalPetrosino2012">{{cite book|author1=Sankar K. Pal|author2=Alfredo Petrosino|author3=Lucia Maddalena|title=Handbook on Soft Computing for Video Surveillance|url=https://books.google.com/books?id=O0fNBQAAQBAJ&q=video+surveillance+summarization&pg=PA81|date=25 January 2012|publisher=CRC Press|isbn=978-1-4398-5685-7|pages=81–}}</ref><ref name="Elhamifar2012">{{cite book |last1=Elhamifar |first1=Ehsan |last2=Sapiro |first2=Guillermo |last3=Vidal |first3=Rene |title=2012 IEEE Conference on Computer Vision and Pattern Recognition |chapter=See all by looking at a few: Sparse modeling for finding representative objects |url=https://ieeexplore.ieee.org/document/6247852 |website=ieeexplore.ieee.org |year=2012 |pages=1600–1607 |publisher=IEEE |doi=10.1109/CVPR.2012.6247852 |isbn=978-1-4673-1228-8 |s2cid=5909301 |access-date=4 December 2022}}</ref><ref name="Mademlis2016">{{cite journal |last1=Mademlis |first1=Ioannis |last2=Tefas |first2=Anastasios |last3=Nikolaidis |first3=Nikos |last4=Pitas |first4=Ioannis |title=Multimodal stereoscopic movie summarization conforming to narrative characteristics |url=https://ieeexploreresearch-information.ieeebris.orgac.uk/documentfiles/7583677111433536/Ioannis_Pitas_Multimodal_Stereoscopic_Movie_Summarization_Conforming_to_Narrative_Characteristics.pdf |journal=IEEE Transactions on Image Processing |year=2016 |volume=25 |issue=12 |pages=5828–5840 |publisher=IEEE |doi=10.1109/TIP.2016.2615289 |pmid=28113502 |bibcode=2016ITIP...25.5828M |hdl=1983/2bcdd7a5-825f-4ac9-90ec-f2f538bfcb72 |s2cid=18566122 |access-date=4 December 2022}}</ref><ref name="Mademlis2018">{{cite journal |last1=Mademlis |first1=Ioannis |last2=Tefas |first2=Anastasios |last3=Pitas |first3=Ioannis |title=A salient dictionary learning framework for activity video summarization via key-frame extraction |url=https://www.sciencedirect.com/science/article/abs/pii/S0020025517311398 |journal=Information Sciences |year=2018 |volume=432 |pages=319–331 |publisher=Elsevier |doi=10.1016/j.ins.2017.12.020 |access-date=4 December 2022|url-access=subscription }}</ref> Video summaries simply retain a carefully selected subset of the original video frames and, therefore, are not identical to the output of [[video synopsis]] algorithms, where ''new'' video frames are being synthesized based on the original video content.
 
== Commercial products ==
Line 18:
===Abstractive-based summarization===
 
Abstractive summarization methods generate new text that did not exist in the original text.<ref>{{Cite book |last=Zhai |first=ChengXiang |url=https://www.worldcat.org/oclc/957355971 |title=Text data management and analysis : a practical introduction to information retrieval and text mining |date=2016 |others=Sean Massung |isbn=978-1-970001-19-8 |page=321 |___location=[New York, NY] |oclc=957355971}}</ref> This has been applied mainly for text. Abstractive methods build an internal semantic representation of the original content (often called a language model), and then use this representation to create a summary that is closer to what a human might express. Abstraction may transform the extracted content by [[automated paraphrasing|paraphrasing]] sections of the source document, to condense a text more strongly than extraction. Such transformation, however, is computationally much more challenging than extraction, involving both [[natural language processing]] and often a deep understanding of the ___domain of the original text in cases where the original document relates to a special field of knowledge. "Paraphrasing" is even more difficult to apply to images and videos, which is why most summarization systems are extractive.
 
===Aided summarization===
Line 57:
Designing a supervised keyphrase extraction system involves deciding on several choices (some of these apply to unsupervised, too). The first choice is exactly how to generate examples. Turney and others have used all possible unigrams, bigrams, and trigrams without intervening punctuation and after removing stopwords. Hulth showed that you can get some improvement by selecting examples to be sequences of tokens that match certain patterns of part-of-speech tags. Ideally, the mechanism for generating examples produces all the known labeled keyphrases as candidates, though this is often not the case. For example, if we use only unigrams, bigrams, and trigrams, then we will never be able to extract a known keyphrase containing four words. Thus, recall may suffer. However, generating too many examples can also lead to low precision.
 
We also need to create features that describe the examples and are informative enough to allow a learning algorithm to discriminate keyphrases from non- keyphrases. Typically features involve various term frequencies (how many times a phrase appears in the current text or in a larger corpus), the length of the example, relative position of the first occurrence, various booleanBoolean syntactic features (e.g., contains all caps), etc. The Turney paper used about 12 such features. Hulth uses a reduced set of features, which were found most successful in the KEA (Keyphrase Extraction Algorithm) work derived from Turney's seminal paper.
 
In the end, the system will need to return a list of keyphrases for a test document, so we need to have a way to limit the number. Ensemble methods (i.e., using votes from several classifiers) have been used to produce numeric scores that can be thresholded to provide a user-provided number of keyphrases. This is the technique used by Turney with C4.5 decision trees. Hulth used a single binary classifier so the learning algorithm implicitly determines the appropriate number.
Line 122:
 
===Submodular functions as generic tools for summarization===
The idea of a [[submodular set function]] has recently emerged as a powerful modeling tool for various summarization problems. Submodular functions naturally model notions of ''coverage'', ''information'', ''representation'' and ''diversity''. Moreover, several important [[combinatorial optimization]] problems occur as special instances of submodular optimization. For example, the [[set cover problem]] is a special case of submodular optimization, since the set cover function is submodular. The set cover function attempts to find a subset of objects which ''cover'' a given set of concepts. For example, in document summarization, one would like the summary to cover all important and relevant concepts in the document. This is an instance of set cover. Similarly, the [[Optimal facility ___location|facility ___location problem]] is a special case of submodular functions. The Facility Location function also naturally models coverage and diversity. Another example of a submodular optimization problem is using a [[determinantal point process]] to model diversity. Similarly, the Maximum-Marginal-Relevance procedure can also be seen as an instance of submodular optimization. All these important models encouraging coverage, diversity and information are all submodular. Moreover, submodular functions can be efficiently combined, and the resulting function is still submodular. Hence, one could combine one submodular function which models diversity, another one which models coverage and use human supervision to learn a right model of a submodular function for the problem.
 
While submodular functions are fitting problems for summarization, they also admit very efficient algorithms for optimization. For example, a simple [[greedy algorithm]] admits a constant factor guarantee.<ref>Nemhauser, George L., Laurence A. Wolsey, and Marshall L. Fisher. "An analysis of approximations for maximizing submodular set functions—I." Mathematical Programming 14.1 (1978): 265-294.</ref> Moreover, the greedy algorithm is extremely simple to implement and can scale to large datasets, which is very important for summarization problems.
Line 158:
 
===Domain-specific versus ___domain-independent summarization===
Domain-independent summarization techniques apply sets of general features to identify information-rich text segments. Recent research focuses on ___domain-specific summarization using knowledge specific to the text's ___domain, such as medical knowledge and ontologies for summarizing medical texts.<ref>{{Cite book|last1=Sarker|first1=Abeed|last2=Molla|first2=Diego|last3=Paris|first3=Cecile|title=Artificial Intelligence in Medicine |chapter=An Approach for Query-focusedFocused Text SummarizationSummarisation for Evidence-based medicineBased Medicine |date=2013|volume=7885|pages=295–304|doi=10.1007/978-3-642-38326-7_41|series=Lecture Notes in Computer Science|isbn=978-3-642-38325-0}}</ref>
 
===Qualitative===
Line 164:
 
==History==
The first publication in the area dates back to 1957 <ref> Luhn, Hans Peter (1957). "A Statistical Approach to Mechanized Encoding and Searching of Literary Information" (PDF). IBM Journal of Research and Development. 1 (4): 309–317. doi:10.1147/rd.14.0309.</ref> ([[Hans Peter Luhn]]), starting with a statistical technique. Research increased significantly in 2015. [[Term frequency–inverse document frequency]] had been used by 2016. Pattern-based summarization was the most powerful option for multi-document summarization found by 2016. In the following year it was surpassed by [[latent semantic analysis]] (LSA) combined with [[non-negative matrix factorization]] (NMF). Although they did not replace other approaches and are often combined with them, by 2019 machine learning methods dominated the extractive summarization of single documents, which was considered to be nearing maturity. By 2020, the field was still very active and research is shifting towards abstractive summation and real-time summarization.<ref>{{Cite journal|date=2020-05-20|title=Review of automatic text summarization techniques & methods|journal=Journal of King Saud University - Computer and Information Sciences|language=en|doi=10.1016/j.jksuci.2020.05.006|issn=1319-1578|last1=Widyassari|first1=Adhika Pramita|last2=Rustad|first2=Supriadi|last3=Shidik|first3=Guruh Fajar|last4=Noersasongko|first4=Edi|last5=Syukur|first5=Abdul|last6=Affandy|first6=Affandy|last7=Setiadi|first7=De Rosal Ignatius Moses|volume=34 |issue=4 |pages=1029–1046 |doi-access=free}}</ref>
 
===Recent approaches===
Line 183:
*{{cite book |last=Hercules |first=Dalianis |year=2003 |title=Porting and evaluation of automatic summarization|url=https://www.researchgate.net/publication/277288103}}
*{{cite book |last=Roxana |first=Angheluta |year=2002 |title=The Use of Topic Segmentation for Automatic Summarization|url=https://www.researchgate.net/publication/2553088}}
*{{cite book |last=Anne |first=Buist |year=2004 |title=Automatic Summarization of Meeting Data: A Feasibility Study |url=https://www.cs.ru.nl/~kraaijw/pubs/Biblio/papers/meeting_sum_tno.pdf |access-date=2020-07-19 |archive-date=2021-01-23 |archive-url=https://web.archive.org/web/20210123014007/http://www.cs.ru.nl/~kraaijw/pubs/Biblio/papers/meeting_sum_tno.pdf |url-status=dead }}
*{{cite book |last=Annie |first=Louis |year=2009 |title=Performance Confidence Estimation for Automatic Summarization|url=https://repository.upenn.edu/cgi/viewcontent.cgi?article=1762&context=cis_papers}}
*{{cite book |last=Elena |first=Lloret and Manuel, Palomar |year=2009 |title=Challenging Issues of Automatic Summarization: Relevance Detection and Quality-based Evaluation |url=http://www.informatica.si/ojs-2.4.3/index.php/informatica/article/download/273/269 |access-date=2018-10-03 |archive-date=2018-10-03 |archive-url=https://web.archive.org/web/20181003061926/http://www.informatica.si/ojs-2.4.3/index.php/informatica/article/download/273/269 |url-status=dead }}
*{{cite book |last=Andrew |first=Goldberg |year=2007 |title=Automatic Summarization}}
*{{cite book |last=Alrehamy |first=Hassan |year=2017 |title=AutomaticAdvances Keyphrasesin ExtractionComputational Intelligence Systems |volume=650 |pages=222–235 |doi=10.1007/978-3-319-66939-7_19 |chapter=SemCluster: Unsupervised Automatic Keyphrase Extraction Using Affinity Propagation |series=Advances in Intelligent Systems and Computing |date=2018 |isbn=978-3-319-66938-0 }}
*{{cite book |last=Endres-Niggemeyer |first=Brigitte |year=1998 |title=Summarizing Information |publisher=Springer |url=https://archive.org/details/springer_10.1007-978-3-642-72025-3 |isbn=978-3-540-63735-6}}
*{{cite book |last=Marcu |first=Daniel |year=2000 |title=The Theory and Practice of Discourse Parsing and Summarization |publisher=MIT Press |isbn=978-0-262-13372-2}}
*{{cite book |last=Mani |first=Inderjeet |year=2001 |title=Automatic Summarization |isbn=978-1-58811-060-2}}
*{{cite book |last=Huff |first=Jason |year=2010 |title=AutoSummarize |url=http://www.jason-huff.com/projects/autosummarize/}}, Conceptual artwork using automatic summarization software in Microsoft Word 2008.