Revision as of 20:36, 16 October 2016 edit Metasyn (talk \| contribs) 9 edits m →Topic segmentation: adding the link to topic modeling Tag: Visual edit ← Previous edit		Revision as of 13:37, 22 May 2017 edit undo Silas S. Brown (talk \| contribs) Extended confirmed users 16,607 edits →Word segmentation: add reference for the Unicode Consortium annex Next edit →
Line 17: In some writing systems however, such as the [[Ge'ez script]] used for [[Amharic]] and [[Tigrinya language\|Tigrinya]] among other languages, words are explicitly delimited (at least historically) with a non-whitespace character. The [[Unicode Consortium]] has published a ''Standard Annex on Text Segmentation'',<ref>[http://unicode.org/reports/tr29/ UAX #29]</ref> exploring the issues of segmentation in multiscript texts. '''Word splitting''' is the process of [[parsing]] [[concatenated]] text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.

Text segmentation: Difference between revisions