Content deleted Content added
→Word segmentation: minor fixes, mostly disambig links using AWB |
|||
Line 28:
As with word segmentation, not all written languages contain punctuation characters which are useful for approximating sentence boundaries.
===
{{main|Topic analysis|Document classification}}
Topic analysis consists of two main tasks: topic identification and text segmentation. While the first is a simple [[machine learning|classification]] of a specific text, the latter case implies that a document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other cases, one needs to use techniques similar to those used in [[document classification]].
|