Content deleted Content added
m →Topic segmentation: adding the link to topic modeling |
→Word segmentation: add reference for the Unicode Consortium annex |
||
Line 17:
In some writing systems however, such as the [[Ge'ez script]] used for [[Amharic]] and [[Tigrinya language|Tigrinya]] among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.
The [[Unicode Consortium]] has published a ''Standard Annex on Text Segmentation'',<ref>[http://unicode.org/reports/tr29/ UAX #29]</ref> exploring the issues of segmentation in multiscript texts.
'''Word splitting''' is the process of [[parsing]] [[concatenated]] text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.
|