Text segmentation: Difference between revisions

Content deleted Content added
Re-order and clarify extensively
 
(98 intermediate revisions by 70 users not shown)
Line 1:
{{Short description|Human writing practice}}
'''Text segmentation''' is the process of dividing written text into meaningful units, such as [[sentence]]s or [[topic]]s. The term applies to [[human mind|mental]] processes used by humans when reading text, and to artificial processes implemented in [[computers]], which are the subject of [[natural language processing]]. The problem may appear relatively trivial for written languages that have explicit word boundary markers, such as the word spaces of written [[English language|English]] or the distinctive initial, medial and final letter shapes of [[Arabic language|Arabic]] but these signals are not present in all written languages.
{{Use dmy dates|date=March 2016}}
{{Refimprove|date=October 2011}}
 
'''Text segmentation''' is the process of dividing written text into meaningful units, such as words, [[sentence (linguistics)|sentence]]s, or [[topic (linguistics)|topic]]s. The term applies both to [[mental process]]es used by humans when reading text, and to artificial processes implemented in computers, which are the subject of [[natural language processing]]. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of [[Arabic language|Arabic]], such signals are sometimes ambiguous and not present in all written languages.
 
Compare [[speech segmentation]], the process of dividing speech into linguistically meaningful portions.
 
== AutomaticSegmentation segmentationproblems ==
=== Word segmentation ===
{{See also|Word#Word boundaries}}
Word segmentation is the problem of dividing a string of written language into its component words.
 
In English and many other languages using some form of the [[Latin alphabet]], the [[Space (punctuation)|space]] is a good approximation of a [[word divider]] (word [[delimiter]]), although this concept has limits because of the variability with which languages [[emic and etic|emically]] regard [[collocation]]s and [[compound (linguistics)|compounds]]. Many [[English compound#Compound nouns|English compound nouns]] are variably written (for example, ''[[icebox|ice box = ice-box = icebox]]''; ''[[sty|pig sty = pig-sty = pigsty]]'') with a corresponding variation in whether speakers think of them as [[noun phrase]]s or single nouns; there are trends in how norms are set, such as that open compounds often tend eventually to solidify by widespread convention, but variation remains systemic. In contrast, [[German nouns#Compounds|German compound nouns]] show less orthographic variation, with solidification being a stronger norm.
Automatic segmentation is the problem in [[natural language processing]] of implementing a computer process to segment text.
 
However, the equivalent to the word space character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where [[sentences]] but not words are delimited, [[Thai language|Thai]] and [[Lao language|Lao]], where phrases and sentences but not words are delimited, and [[Vietnamese language|Vietnamese]], where syllables but not words are delimited.
=== Word segmentation ===
 
In some writing systems however, such as the [[Ge'ez script]] used for [[Amharic]] and [[Tigrinya language|Tigrinya]] among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.
Word segmentation is the problem of dividing a string of written language into its component [[word]]s. In English and many other modern languages using some form of the [[Latin alphabet]] dividing text using the [[Space (punctuation)|space character]] is a good approximation to word segmentation. (Some examples where the space character alone may not be sufficient include contractions like ''can't'' for ''can not''.) However the equivalent to this character is not found in all written scripts and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include [[Thai]].
 
The [[Unicode Consortium]] has published a ''Standard Annex on Text Segmentation'',<ref>[http://unicode.org/reports/tr29/ UAX #29]</ref> exploring the issues of segmentation in multiscript texts.
=== Sentence segmentation ===
 
'''Word splitting''' is the process of [[parsing]] [[concatenated]] text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.
Sentence segmentation is the problem of dividing a string of written language into its component [[sentences]]. In English and some other languages, using punctuation, particularly the [[full stop]] character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example ''Mr.'' is not its own sentence in "''Mr. Smith went to the shops in Jones Street."'' When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
 
Word splitting may also refer to the process of [[Syllabification|hyphenation]].
As with word segmentation, not all written languages contain punctuation characters which are useful for approximating sentence boundaries.
 
Some scholars have suggested that modern Chinese should be written in word segmentation, with
=== Other segmentation problems ===
spaces between words like written English.<ref>{{cite journal |last=Zhang |first=Xiao-heng |journal=中文信息学报 |date=1998 |script-title=zh:也谈汉语书面语的分词问题——分词连写十大好处 |trans-title=Written Chinese Word-Segmentation Revisited: Ten advantages of word-segmented writing |url=http://jcip.cipsc.org.cn/CN/Y1998/V12/I3/58 |language=zh-Hans |script-journal=zh:中文信息学报 |trans-journal=[[Journal of Chinese Information Processing]] |volume=12 |issue=3 |pages=58–64 |access-date=2025-03-31}}</ref> Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国 会 不同意。" (The US will not agree.) or "美 国会 不同意。" (The US Congress does not agree). For more details, see [[Chinese word-segmented writing]].
 
=== Intent segmentation ===
Processes may be required to segment text into segments besides words, including [[morpheme]]s (a task usually called [[morphological analysis]]), [[paragraph]]s, [[topic]]s or [[discourse]] turns.
{{Confusing section|date=September 2019}}
Intent segmentation is the problem of dividing written words into keyphrases (2 or more group of words).
 
In English and all other languages the core intent or desire is identified and become the corner-stone of the keyphrase Intent segmentation. Core product/service, idea, action & or thought anchor the keyphrase.
A document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly.
 
The topic boundaries may be apparent from section titles and paragraphs.
"[All things are made of '''atoms''']. [Little '''particles''' that move] [around in perpetual '''motion'''], [attracting each '''other'''] [when they are a little '''distance''' apart], [but '''repelling'''] [upon being '''squeezed'''] [into '''one another''']."
In other cases one needs to use techniques similar to those used in [[document classification]].
 
Many different approaches have been tried.<ref>{{Cite conference
=== Sentence segmentation ===
| author = Freddy Y. Y. Choi
{{See also|Sentence boundary disambiguation}}
| title = Advances in ___domain independent linear text segmentation
Sentence segmentation is the problem of dividing a string of written language into its component [[Sentence (linguistics)|sentences]]. In English and some other languages, using punctuation, particularly the [[full stop]]/period character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example, ''Mr.'' is not its own sentence in "''Mr. Smith went to the shops in Jones Street."'' When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
| booktitle = Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00)
 
As with word segmentation, not all written languages contain punctuation characters that are useful for approximating sentence boundaries.
 
=== Topic segmentation ===
{{See also|Document classification}}
Topic analysis consists of two main tasks: topic identification and text segmentation. While the first is a simple [[machine learning|classification]] of a specific text, the latter case implies that a document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other cases, one needs to use techniques similar to those used in [[document classification]].
 
Segmenting the text into [[topic (linguistics)|topic]]s or [[discourse]] turns might be useful in some natural processing tasks: it can improve [[information retrieval]] or [[speech recognition]] significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in [[topic detection]] and tracking systems and [[text summarization|text summarizing]] problems.
 
Many different approaches have been tried:<ref>{{cite conference
| last = Choi | first = Freddy Y. Y.
| year = 2000
| url = https://aclanthology.org/A00-2004/
| pages = 26&ndash;33
| title = Advances in ___domain independent linear text segmentation
| url = http://acl.ldc.upenn.edu/A/A00/A00-2004.pdf
| book-title = Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00)
}}</ref><ref>{{cite paper
| pages = 26–33
| author = [[Jeffrey C. Reynar]]
| arxiv=cs/0003083
| title = Topic Segmentation: Algorithms and Applications
| versionaccess-date = IRCS2025-9803-2131
}}</ref><ref>{{cite thesis
| publisher = [[University of Pennsylvania]]
| last = Reynar | first = Jeffrey C.
| date = 1998
| year = 1998
| url = http://repository.upenn.edu/cgi/viewcontent.cgi?article=1068&context=ircs_reports
| url = https://repository.upenn.edu/handle/20.500.14332/37673
| format = PDF
| title = Topic Segmentation: Algorithms and Applications
| accessdate = 2007-11-8
| format = PDF
}}</ref>
| degree = PhD
| publisher = [[University of Pennsylvania]]
| id = IRCS-98-21
| access-date = 2025-03-31
}}</ref> e.g. [[hidden Markov model|HMM]], [[lexical chain|lexical chains]], passage similarity using word [[co-occurrence]], [[cluster analysis|clustering]], [[topic modeling]], etc.
 
It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem.
=== Approaches ===
<!-- <math> WindowDiff(ref,hyp) {{=}} 1 \over{N-k} \sum |b(ref_i,ref_{i+k})-b(hyp_i,hyp_{i+k})|</math> -->
 
=== Other segmentation problems ===
Processes may be required to segment text into segments besides mentioned, including [[morpheme]]s (a task usually called [[morphology (linguistics)|morphological analysis]]) or [[paragraph]]s.
 
== Automatic segmentation approaches ==
 
Automatic segmentation is the problem in [[natural language processing]] of implementing a computer process to segment text.
 
When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.
Line 48 ⟶ 82:
The process of developing text segmentation tools starts with collecting a large corpus of text in an application ___domain. There are two general approaches:
* Manual analysis of text and writing custom software
* Annotate the sample corpus with boundary information and use [[Machinemachine Learninglearning]]
 
Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
 
== See also ==
* [[Syllabification|Hyphenation]]
* [[Natural Language Processing]]
* [[Natural language processing]]
* [[Speech segmentation]]
* [[HyphenationLexical analysis]]
* [[Word count]]
* [[Line wrap and word wrap|Line breaking]]
 
* [[Image segmentation]]
==External links==
* [http://sporkforge.com/text/word_count.php Web-based text segmenter example]
 
== References ==
Line 65 ⟶ 99:
 
 
[[Category:{{Natural languageLanguage processing]]Processing}}
{{writingsystem-stub}}
{{compu-AI-stub}}
 
[[Category:Tasks of natural language processing]]
[[bn:টেক্সট খণ্ডায়ন]]
[[de:Morphologische Analyse (Computerlinguistik)]]
[[ja:形態素解析]]
[[nn:Segmentering av tekst]]