Revision as of 22:10, 17 July 2017 edit 76.80.114.170 (talk) →Intent segmentation ← Previous edit		Revision as of 22:51, 25 July 2017 edit undo Quercus solaris (talk \| contribs) Extended confirmed users 18,028 edits →Word segmentation: Bad example ("won't" is a single word synonymous with "will not", not an orthographic representation of "will not"); replaced with good example (tool box/toolbox, ice box/icebox). Next edit →
Line 11: Word segmentation is the problem of dividing a string of written language into its component words. In English and many other languages using some form of the [[Latin alphabet]], the [[Space (punctuation)\|space]] is a good approximation of a [[word divider]] (word [[delimiter]])., ~~(Some~~although ~~examples~~this ~~where~~concept has limits because of the ~~space~~variability ~~character~~with ~~alone~~which ~~may~~languages ~~not~~[[emic beand ~~sufficient~~etic\|emically]] ~~include~~regard ~~contractions~~[[collocation]]s ~~like~~and [[compound (linguistics)\|compounds]]. Many [[English compound#Compound nouns\|English compound nouns]] are variably written (for example, ''~~won't~~[[icebox\|ice box = ice-box = icebox]]'' ~~for~~; ''~~will~~[[sty\|pig sty ~~not~~= pig-sty = pigsty]]''.) with a corresponding variation in whether speakers think of them as [[noun phrase]]s or single nouns; there are trends in how norms are set, such as that open compounds often tend eventually to solidify by widespread convention, but variation remains systemic. In contrast, [[German nouns#Compounds\|German compound nouns]] show less orthographic variation, with solidification being a stronger norm. However, the equivalent to ~~this~~the word space character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where [[sentences]] but not words are delimited, [[Thai language\|Thai]] and [[Lao language\|Lao]], where phrases and sentences but not words are delimited, and [[Vietnamese language\|Vietnamese]], where syllables but not words are delimited. In some writing systems however, such as the [[Ge'ez script]] used for [[Amharic]] and [[Tigrinya language\|Tigrinya]] among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.

Text segmentation: Difference between revisions