Text segmentation: Difference between revisions

Content deleted Content added
Script-assisted fixes: per MOS:NUM, MOS:CAPS, MOS:LINK
Metasyn (talk | contribs)
m Topic segmentation: adding the link to topic modeling
Line 31:
=== Topic segmentation ===
{{main|Topic analysis|Document classification}}
Topic analysis consists of two main tasks: topic identification and text segmentation. While the first is a simple [[machine learning|classification]] of a specific text, the latter case implies that a document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other cases, one needs to use techniques similar to those used in [[document classification]].
 
Segmenting the text into [[topic (linguistics)|topic]]s or [[discourse]] turns might be useful in some natural processing tasks: it can improve information retrieval or speech recognition significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in [[topic detection]] and tracking systems and [[text summarization|text summarizing]] problems.
Line 51:
| format = PDF
| accessdate = 2007-11-08
}}</ref> e.g. [[Hidden Markov model|HMM]], [[lexical chains]], passage similarity using word [[co-occurrence]], [[cluster analysis|clustering]], [[topic modeling]], etc.
 
It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem.