Revision as of 00:38, 23 April 2009 edit David Eppstein (talk \| contribs) Autopatrolled, Administrators 235,802 edits segmentation also applies to words; say so in lede ← Previous edit		Revision as of 22:28, 16 May 2009 edit undo Babbage (talk \| contribs) Autopatrolled, Extended confirmed users, Pending changes reviewers 7,816 edits →Word segmentation: explain languages with explicit delimiter, clean up & clarify a bit Next edit →
Line 6: === Word segmentation === Word segmentation is the problem of dividing a string of written language into its component [[word]]s. In [[English language\|English]] and many other ~~modern~~ languages using some form of the [[Latin alphabet]] ~~dividing text using~~, the [[Space (punctuation)\|space ~~character~~]] is a good approximation toof a word ~~segmentation~~delimiter. (Some examples where the space character alone may not be sufficient include contractions like ''can't'' for ''can not''.) However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include [[Chinese language\|Chinese]], [[Japanese language\|Japanese]], where [[sentences]] but not words are delimited, and [[Thai language\|Thai]] and [[Lao language\|Lao]], where phrases and sentences but not words are delimited. In some writing systems however, such as the [[Ge'ez script]] used for [[Amharic]] and [[Tigrinya]] among other languages, words are explicitly delimited (at least historically) with a non-[[whitespace]] character. '''Word splitting''' is the process of [[parsing]] [[concatenated]] text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist. Word splitting may also refer to the process of [[hyphenation]]. === Sentence segmentation ===

Text segmentation: Difference between revisions