Text segmentation

This is an old revision of this page, as edited by Jorge Stolfi (talk | contribs) at 15:45, 4 March 2006 (moved Word segmentation to Text segmentation: "Word segmentation" suggests the segmentation of words into morphemes, which is not what the article was meant to be about. "Text segmentation" may not be ideal either, but...). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Written text segmentation is the process of diving written text into words or other similar meaningful units. The term applies to mental processes used by humans when reading text, and to artificial processes implemented in computerss, which are the subject natural language processing.

The problem is relatively trivial for written languages that have explicit word boudary markers, such as the word spaces of written English of the distinctive initial, medial and final letter shapes of Arabic. When such clues are not consistently available, the task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints.

See also