Revision as of 15:19, 9 May 2024 edit 204.137.248.87 (talk) →Word segmentation ← Previous edit		Revision as of 04:23, 1 April 2025 edit undo Uzume (talk \| contribs) Extended confirmed users, Page movers, Template editors 12,664 edits →See also: +Image segmentation Next edit →
Line 3: {{Refimprove\|date=October 2011}} '''Text segmentation''' is the process of dividing written text into meaningful units, such as words, [[~~Sentence~~sentence (linguistics)\|sentence]]s, or [[topic (linguistics)\|topic]]s. The term applies both to [[mental process]]es used by humans when reading text, and to artificial processes implemented in computers, which are the subject of [[natural language processing]]. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of [[Arabic language\|Arabic]], such signals are sometimes ambiguous and not present in all written languages. Compare [[speech segmentation]], the process of dividing speech into linguistically meaningful portions. Line 25: Some scholars have suggested that modern Chinese should be written in word segmentation, with spaces between words like written English.<ref>{{~~Cite~~cite journal \|last=Zhang \|first=Xiao-heng \|date=1998 \|script-title=zh:也谈汉语书面语的分词问题——分词连写十大好处 (\|trans-title=Written Chinese Word -Segmentation Revisited: Ten advantages of word-segmented writing) \|url=http://jcip.cipsc.org.cn/CN/Y1998/V12/I3/58 \|language=zh-Hans \|script-journal=zh:中文信息学报 \|trans-journal=[[Journal of Chinese Information Processing]] \|volume=12 ~~(1998)~~ \|issue=3 \|pages=58–64 \|access-date=2025-03-31}}</ref> Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国会不同意。" (The US will not agree.) or "美国会不同意。" (The US Congress does not agree). For more details, see [[Chinese word-segmented writing]]. Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国会不同意。" (The US will not agree.) or "美国会不同意。" (The US Congress does not agree). For more details, see [[Chinese word-segmented writing]]. === Intent segmentation === Line 48 ⟶ 47: Segmenting the text into [[topic (linguistics)\|topic]]s or [[discourse]] turns might be useful in some natural processing tasks: it can improve [[information retrieval]] or [[speech recognition]] significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in [[topic detection]] and tracking systems and [[text summarization\|text summarizing]] problems. Many different approaches have been tried:<ref>{{~~Cite~~cite conference \| ~~author~~last = Choi \| first = Freddy Y. Y. ~~Choi~~ \| year = 2000 \| url = ~~http~~https://~~www.aclweb~~aclanthology.org~~/anthology~~/A00-2004/▼ \| title = Advances in ___domain independent linear text segmentation \| format = PDF \| book-title = Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00) \| ~~year~~pages = ~~2000~~26–33 \| doi = 10.48550/arXiv.cs/0003083 ~~\| pages = 26–33~~ \| access-date = 2025-03-31 ▲ \| url = http://www.aclweb.org/anthology/A00-2004 ~~\|format=PDF~~}}</ref><ref>{{~~Cite~~cite ~~journal~~thesis \| last = Reynar \| ~~author~~first = Jeffrey C. ~~Reynar~~ \| year = 1998 ~~\| author-link = Jeffrey C. Reynar~~ \| url = https://repository.upenn.edu/handle/20.500.14332/37673 \| title = Topic Segmentation: Algorithms and Applications ~~\| version = IRCS-98-21~~ \| format = PDF \| publisher = [[University of Pennsylvania]]▼ \| ~~year~~degree = ~~1998~~PhD ▲ \| publisher = [[University of Pennsylvania]] ~~\| url = http://repository.upenn.edu/cgi/viewcontent.cgi?article=1068&context=ircs_reports~~ \| ~~format~~id = ~~PDF~~IRCS-98-21 \| access-date = ~~2007~~2025-1103-0831 }}</ref> e.g. [[~~Hidden~~hidden Markov model\|HMM]], [[~~Lexical~~lexical chain\|lexical chains]], passage similarity using word [[co-occurrence]], [[cluster analysis\|clustering]], [[topic modeling]], etc. It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem. <!-- <math> WindowDiff(ref,hyp) {{=}} 1 \over{N-k} \sum \|b(ref_i,ref_{i+k})-b(hyp_i,hyp_{i+k})\|</math> --> Line 92 ⟶ 94: * [[Word count]] * [[Line wrap and word wrap\|Line breaking]] * [[Image segmentation]] {{Natural Language Processing}}▼ == References == {{Reflist}} ▲{{Natural Language Processing}} [[Category:Tasks of natural language processing]]

Text segmentation: Difference between revisions