Text segmentation: Difference between revisions

Content deleted Content added
Jausel (talk | contribs)
mNo edit summary
No edit summary
Line 1:
'''Written text segmentation''' is the process of dividing written text into [[word]]s or other similar meaningful units. The term applies to [[human mind|mental]] processes used by humans when reading text, and to artificial processes implemented in [[computers]], which are the subject of [[natural language processing]].
 
The problem ismay appear relatively trivial for written languages that have explicit word boundary markers, such as the word spaces of written [[English language|English]] or the distinctive initial, medial and final letter shapes of [[Arabic language|Arabic]]. When such clues are not consistently available, the task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints.
 
''[[Natural Language Processing]] (NLP) text segmentation techniques'' involves determining the boundaries between words and sentences. This process is not as simple as finding periods (a period may appear for example in a dollar amount), semicolons (may appear for example in an XML entity tag), etc.
 
When processing plain text, tables of abbreviations that contain periods (Mr. for example) can help prevent incorrect assignment of sentence boundaries. Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
 
==See also==