Content deleted Content added
m Changing short description from "Process of dividing written text into meaningful units, such as words, sentences, or topics" to "Human writing practice" |
|||
(10 intermediate revisions by 6 users not shown) | |||
Line 3:
{{Refimprove|date=October 2011}}
'''Text segmentation''' is the process of dividing written text into meaningful units, such as words, [[
Compare [[speech segmentation]], the process of dividing speech into linguistically meaningful portions.
Line 23:
Word splitting may also refer to the process of [[Syllabification|hyphenation]].
Some scholars have suggested that modern Chinese should be written in word segmentation, with
spaces between words like written English.<ref>{{cite journal |last=Zhang |first=Xiao-heng |journal=中文信息学报 |date=1998 |script-title=zh:也谈汉语书面语的分词问题——分词连写十大好处 |trans-title=Written Chinese Word-Segmentation Revisited: Ten advantages of word-segmented writing |url=http://jcip.cipsc.org.cn/CN/Y1998/V12/I3/58 |language=zh-Hans |script-journal=zh:中文信息学报 |trans-journal=[[Journal of Chinese Information Processing]] |volume=12 |issue=3 |pages=58–64 |access-date=2025-03-31}}</ref> Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国 会 不同意。" (The US will not agree.) or "美 国会 不同意。" (The US Congress does not agree). For more details, see [[Chinese word-segmented writing]].
=== Intent segmentation ===
Line 44 ⟶ 47:
Segmenting the text into [[topic (linguistics)|topic]]s or [[discourse]] turns might be useful in some natural processing tasks: it can improve [[information retrieval]] or [[speech recognition]] significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in [[topic detection]] and tracking systems and [[text summarization|text summarizing]] problems.
Many different approaches have been tried:<ref>{{
|
| year = 2000
| title = Advances in ___domain independent linear text segmentation
| book-title = Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00)
|
| arxiv=cs/0003083
| access-date = 2025-03-31
▲ | url = http://www.aclweb.org/anthology/A00-2004
| last = Reynar |
| year = 1998
| url = https://repository.upenn.edu/handle/20.500.14332/37673
| format = PDF
| publisher = [[University of Pennsylvania]]▼
}}</ref> e.g. [[
It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem.
<!-- <math> WindowDiff(ref,hyp) {{=}} 1 \over{N-k} \sum |b(ref_i,ref_{i+k})-b(hyp_i,hyp_{i+k})|</math> -->
Line 88 ⟶ 93:
* [[Word count]]
* [[Line wrap and word wrap|Line breaking]]
* [[Image segmentation]]
{{Natural Language Processing}}▼
== References ==
{{Reflist}}
▲{{Natural Language Processing}}
[[Category:Tasks of natural language processing]]
|