Automatic Summarization is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text.
Access to coherent and correctly-developed text summaries can be of great use, especially in our time of information overload, in which the amount of information electronically available to us, grows every day. A good example of the use of summarization technology could be search engines such as Google.
Technologies that can make a coherent summary, of any kind of text, need to take into account several variables such as length, writing-style and syntax to make a useful summary.
Extraction and Abstraction
Broadly, one distinguishes two approaches: extraction and abstraction.
Extraction techniques merely copy the information deemed most important by the system to the summary, while abstraction involves paraphrasing sections of the source document. In general, abstraction can condense a text more strongly than extraction, but the programs that can do this are harder to develop.
Types of Summaries
There are different types of summaries depending what the summarization program focuses on to make the summary of the text, for example generic summaries or query relevant summaries (sometimes called query-biased summaries).
Nowadays, summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs. Summarization of multimedia documents, e.g. pictures or movies are also possible.
Aided summarization
Machine learning techniques from closely related fields such as Information Retrieval or Text mining have been successfully adapted to help automatic summarization.
Apart from Fully Automated Summarizers (FAS), there are systems that aid users with the task of summarization (MAHS = Machine Aided Human Summarization), for example by highlighting candidate passages to be included in the summary, and there are systems that depend on post-processing by a human (HAMS = Human Aided Machine Summarization).
Issues
One of the major issues of the current automatic summarization research is, how to evaluate the summarization system ? Or, in other words, how to automatically and systematically tell summary A is better than summary B ? Also, what is the "better" ? Would us prefer a perfect grammatical summary but contains very few information, or a no-so-well-written summary that has a lot of information ?
Further Reading
- Endres-Niggemeyer, Brigitte (1998): Summarizing Information (ISBN 3540637354)
- Marcu, Daniel (2000): The Theory and Practice of Discource Parsing and Summarization (ISBN 0262133725)
- Mani, Inderjeet (2001): Automatic Summarization (ISBN 1588110605)
See also
External link
- Presentation about statistical summarization methods as wel as an on-line summarizer
- Text Summarization
- ACM Special Interest Group on Information Retrieval
- Pertinence Summarizer, a commercial webpage summarization demo (lauch a query, click "summary", login: "google", password: "google")