Revision as of 03:34, 8 September 2015 edit Jarble (talk \| contribs) Autopatrolled, Extended confirmed users 150,084 edits editing a link ← Previous edit		Revision as of 18:26, 2 October 2015 edit undo Renamed user 23o2iqy4ewqoiudh (talk \| contribs) 4,613 edits m →Word segmentation Next edit →
Line 7: == Segmentation problems == === Word segmentation === ~~:''~~{{See also~~: [[~~\|Word#Word boundaries~~\|Word > Word boundary]]~~}} Word segmentation is the problem of dividing a string of written language into its component [[word]]s. In [[English language\|English]] and many other languages using some form of the [[Latin alphabet]], the [[Space (punctuation)\|space]] is a good approximation of a [[word divider]] (word [[delimiter]]). (Some examples where the space character alone may not be sufficient include contractions like ''can't'' for ''can not''.) However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include [[Chinese language\|Chinese]], [[Japanese language\|Japanese]], where [[sentences]] but not words are delimited, [[Thai language\|Thai]] and [[Lao language\|Lao]], where phrases and sentences but not words are delimited, and [[Vietnamese language\|Vietnamese]], where syllables but not words are delimited. Line 18: The [[Unicode Consortium]] has published a [http://www.unicode.org/reports/tr29/ Standard Annex] on Text Segmentation, exploring the issues of segmentation in multiscript texts. '''Word splitting''' is the process of [[parsing]] [[concatenated]] text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist. Word splitting may also refer to the process of [[Syllabification\|hyphenation]].

Text segmentation: Difference between revisions