Text segmentation: Difference between revisions

Content deleted Content added
Line 3:
{{Refimprove|date=October 2011}}
 
'''Text segmentation''' is the process of dividing written text into meaningful units, such as words, [[Sentencesentence (linguistics)|sentence]]s, or [[topic (linguistics)|topic]]s. The term applies both to [[mental process]]es used by humans when reading text, and to artificial processes implemented in computers, which are the subject of [[natural language processing]]. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of [[Arabic language|Arabic]], such signals are sometimes ambiguous and not present in all written languages.
 
Compare [[speech segmentation]], the process of dividing speech into linguistically meaningful portions.
Line 25:
 
Some scholars have suggested that modern Chinese should be written in word segmentation, with
spaces between words like written English.<ref>{{Citecite journal |last=Zhang |first=Xiao-heng |date=1998 |script-title=zh:也谈汉语书面语的分词问题——分词连写十大好处 (|trans-title=Written Chinese Word -Segmentation Revisited: Ten advantages of word-segmented writing) |url=http://jcip.cipsc.org.cn/CN/Y1998/V12/I3/58 |language=zh-Hans |script-journal=zh:中文信息学报 |trans-journal=[[Journal of Chinese Information Processing]] |volume=12 (1998) |issue=3 |pages=58–64 |access-date=2025-03-31}}</ref> Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国 会 不同意。" (The US will not agree.) or "美 国会 不同意。" (The US Congress does not agree). For more details, see [[Chinese word-segmented writing]].
Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国 会 不同意。" (The US will not agree.) or "美 国会 不同意。" (The US Congress does not agree). For more details, see [[Chinese word-segmented writing]].
 
=== Intent segmentation ===
Line 48 ⟶ 47:
Segmenting the text into [[topic (linguistics)|topic]]s or [[discourse]] turns might be useful in some natural processing tasks: it can improve [[information retrieval]] or [[speech recognition]] significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in [[topic detection]] and tracking systems and [[text summarization|text summarizing]] problems.
 
Many different approaches have been tried:<ref>{{Citecite conference
| authorlast = Choi | first = Freddy Y. Y. Choi
| year = 2000
| url = httphttps://www.aclwebaclanthology.org/anthology/A00-2004/
| title = Advances in ___domain independent linear text segmentation
| format = PDF
| book-title = Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00)
| yearpages = 200026–33
| doi = 10.48550/arXiv.cs/0003083
| pages = 26&ndash;33
| access-date = 2025-03-31
| url = http://www.aclweb.org/anthology/A00-2004
|format=PDF}}</ref><ref>{{Citecite journalthesis
| last = Reynar | authorfirst = Jeffrey C. Reynar
| year = 1998
| author-link = Jeffrey C. Reynar
| url = https://repository.upenn.edu/handle/20.500.14332/37673
| title = Topic Segmentation: Algorithms and Applications
| version = IRCS-98-21
| format = PDF
| publisher = [[University of Pennsylvania]]
| yeardegree = 1998PhD
| publisher = [[University of Pennsylvania]]
| url = http://repository.upenn.edu/cgi/viewcontent.cgi?article=1068&context=ircs_reports
| formatid = PDFIRCS-98-21
| access-date = 20072025-1103-0831
}}</ref> e.g. [[Hiddenhidden Markov model|HMM]], [[Lexicallexical chain|lexical chains]], passage similarity using word [[co-occurrence]], [[cluster analysis|clustering]], [[topic modeling]], etc.
 
It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem.
<!-- <math> WindowDiff(ref,hyp) {{=}} 1 \over{N-k} \sum |b(ref_i,ref_{i+k})-b(hyp_i,hyp_{i+k})|</math> -->
 
Line 92 ⟶ 94:
* [[Word count]]
* [[Line wrap and word wrap|Line breaking]]
* [[Image segmentation]]
 
{{Natural Language Processing}}
 
== References ==
{{Reflist}}
 
 
{{Natural Language Processing}}
 
[[Category:Tasks of natural language processing]]