Content deleted Content added
segmentation also applies to words; say so in lede |
→Word segmentation: explain languages with explicit delimiter, clean up & clarify a bit |
||
Line 6:
=== Word segmentation ===
Word segmentation is the problem of dividing a string of written language into its component [[word]]s.
In [[English language|English]] and many other However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include [[Chinese language|Chinese]], [[Japanese language|Japanese]], where [[sentences]] but not words are delimited, and [[Thai language|Thai]] and [[Lao language|Lao]], where phrases and sentences but not words are delimited. In some writing systems however, such as the [[Ge'ez script]] used for [[Amharic]] and [[Tigrinya]] among other languages, words are explicitly delimited (at least historically) with a non-[[whitespace]] character.
'''Word splitting''' is the process of [[parsing]] [[concatenated]] text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.
Word splitting may also refer to the process of [[hyphenation]].
=== Sentence segmentation ===
|