Text segmentation: Difference between revisions

Content deleted Content added
No edit summary
Re-order and clarify extensively
Line 1:
'''Text segmentation''' is the process of dividing written text into meaningful units, such as [[sentence]]s or [[topic]]s. The term applies to [[human mind|mental]] processes used by humans when reading text, and to artificial processes implemented in [[computers]], which are the subject of [[natural language processing]]. The problem may appear relatively trivial for written languages that have explicit word boundary markers, such as the word spaces of written [[English language|English]] or the distinctive initial, medial and final letter shapes of [[Arabic language|Arabic]] but these signals are not present in all written languages.
 
Compare [[speech segmentation]], the process of dividing speech into linguistically meaningful portions.
The problem may appear relatively trivial for written languages that have explicit word boundary markers, such as the word spaces of written [[English language|English]] or the distinctive initial, medial and final letter shapes of [[Arabic language|Arabic]]. When such clues are not consistently available, the task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints.
 
== Automatic segmentation ==
''[[Natural Language Processing]] (NLP) text segmentation techniques'' involves determining the boundaries between words and sentences. This process is not as simple as finding periods (a period may appear for example in a dollar amount), semicolons (may appear for example in an XML entity tag), etc.
 
Automatic segmentation is the problem in [[natural language processing]] of implementing a computer process to segment text.
When processing plain text, tables of abbreviations that contain periods (Mr. for example) can help prevent incorrect assignment of sentence boundaries. Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
 
=== Word segmentation ===
 
Word segmentation is the problem of dividing a string of written language into its component [[word]]s. In English and many other modern languages using some form of the [[Latin alphabet]] dividing text using the [[Space (punctuation)|space character]] is a good approximation to word segmentation. (Some examples where the space character alone may not be sufficient include contractions like ''can't'' for ''can not''.) However the equivalent to this character is not found in all written scripts and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include [[Thai]].
 
=== Sentence segmentation ===
 
Sentence segmentation is the problem of dividing a string of written language into its component [[sentences]]. In English and some other languages, using punctuation, particularly the [[full stop]] character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example ''Mr.'' is not its own sentence in "''Mr. Smith went to the shops in Jones Street."'' When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
 
As with word segmentation, not all written languages contain punctuation characters which are useful for approximating sentence boundaries.
 
=== Other segmentation problems ===
 
Processes may be required to segment text into segments besides words, including [[morpheme]]s (a task usually called [[morphological analysis]]), [[paragraph]]s, [[topic]]s or [[discourse]] turns.
 
A document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly.
Line 28 ⟶ 42:
}}</ref>
 
=== Approaches ===
Effective [[Natural Language Processing]] systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.
 
When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective [[Naturalnatural Languagelanguage Processing]]processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.
 
The process of developing text segmentation tools starts with collecting a large corpus of text in an application ___domain. There are two general approaches:
* Manual analysis of text and writing custom software
* Annotate the sample corpus with boundary information and use [[Machine Learning]]
 
When processing plain text, tables of abbreviations that contain periods (Mr. for example) can help prevent incorrect assignment of sentence boundaries. Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
 
== See also ==