Content deleted Content added
No edit summary |
→Word segmentation: minor fixes, mostly disambig links using AWB |
||
Line 14:
However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include [[Chinese language|Chinese]], [[Japanese language|Japanese]], where [[sentences]] but not words are delimited, [[Thai language|Thai]] and [[Lao language|Lao]], where phrases and sentences but not words are delimited, and [[Vietnamese language|Vietnamese]], where syllables but not words are delimited.
In some writing systems however, such as the [[Ge'ez script]] used for [[Amharic]] and [[Tigrinya language|Tigrinya]] among other languages, words are explicitly delimited (at least historically) with a non-[[Space (punctuation)|whitespace]] character.
The [[Unicode Consortium]] has published a [http://www.unicode.org/reports/tr29/ Standard Annex] on Text Segmentation, exploring the issues of segmentation in multiscript texts.
|