Text segmentation: Difference between revisions

Content deleted Content added
No edit summary
No edit summary
Line 6:
 
When processing plain text, tables of abbreviations that contain periods (Mr. for example) can help prevent incorrect assignment of sentence boundaries. Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
 
Effective [[Natural Language Processing]] systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.
 
The process of writing text segmentation tools starts with collecting a large corpus of text in an application ___domain. There are two general approaches:
* Manual analysis of text and writing custom software
* Annotate the sample corpus with boundary information and use [[Machine Learning]]
 
==See also==
* [[Natural Language Processing]]
* [[Artificial Intelligence]]
* [[Speech segmentation]]
* [[Hyphenation]]