Content deleted Content added
comma |
reference needed |
||
Line 12:
In [[English language|English]] and many other languages using some form of the [[Latin alphabet]], the [[Space (punctuation)|space]] is a good approximation of a [[word divider]] (word [[delimiter]]). (Some examples where the space character alone may not be sufficient include contractions like ''can't'' for ''can not''.)
However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include [[Chinese language|Chinese]]<ref>{{cite conference |author1=Aaron L.-F. Han |author2=Derek F. Wong |author3=Lidia S. Chao |author4=Liangye He |author5=Ling Zhu |author6=Shuo Li |year=2013 |title=A Study of Chinese Word Segmentation Based on the Characteristics of Chinese |conference=Language Processing and Knowledge in the Web - LNCS Volume 8105 |pages=111-118 |url=http://link.springer.com/chapter/10.1007/978-3-642-40722-2_12}}</ref>, [[Japanese language|Japanese]], where [[sentences]] but not words are delimited, [[Thai language|Thai]] and [[Lao language|Lao]], where phrases and sentences but not words are delimited, and [[Vietnamese language|Vietnamese]], where syllables but not words are delimited.
In some writing systems however, such as the [[Ge'ez script]] used for [[Amharic]] and [[Tigrinya]] among other languages, words are explicitly delimited (at least historically) with a non-[[Space (punctuation)|whitespace]] character.
|