Text segmentation: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 23:15, 19 November 2017 edit Tom.Reding (talk \| contribs) Autopatrolled, Extended confirmed users, Page movers, Template editors 4,364,437 edits m Rep typographic ligature "ﬁ" with plain text; possible ref cleanup; WP:GenFixes on, Enum'd 1 author/editor WL,, replaced: ﬁ → fi, typo(s) fixed: For example → For example, using AWB ← Previous edit		Latest revision as of 14:19, 30 April 2025 edit undo Headbomb (talk \| contribs) Edit filter managers, Autopatrolled, Extended confirmed users, Page movers, File movers, New page reviewers, Pending changes reviewers, Rollbackers, Template editors 473,365 edits →Topic segmentation: ce
(25 intermediate revisions by 20 users not shown)
Line 1: {{Short description\|Human writing practice}} {{Use dmy dates\|date=March 2016}} {{Refimprove\|date=October 2011}} '''Text segmentation''' is the process of dividing written text into meaningful units, such as words, [[~~Sentence~~sentence (linguistics)\|sentence]]s, or [[topic (linguistics)\|topic]]s. The term applies both to [[mental process]]es used by humans when reading text, and to artificial processes implemented in computers, which are the subject of [[natural language processing]]. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of [[Arabic language\|Arabic]], such signals are sometimes ambiguous and not present in all written languages. Compare [[speech segmentation]], the process of dividing speech into linguistically meaningful portions. Line 22 ⟶ 23: Word splitting may also refer to the process of [[Syllabification\|hyphenation]]. Some scholars have suggested that modern Chinese should be written in word segmentation, with spaces between words like written English.<ref>{{cite journal \|last=Zhang \|first=Xiao-heng \|journal=中文信息学报 \|date=1998 \|script-title=zh:也谈汉语书面语的分词问题——分词连写十大好处 \|trans-title=Written Chinese Word-Segmentation Revisited: Ten advantages of word-segmented writing \|url=http://jcip.cipsc.org.cn/CN/Y1998/V12/I3/58 \|language=zh-Hans \|script-journal=zh:中文信息学报 \|trans-journal=[[Journal of Chinese Information Processing]] \|volume=12 \|issue=3 \|pages=58–64 \|access-date=2025-03-31}}</ref> Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国会不同意。" (The US will not agree.) or "美国会不同意。" (The US Congress does not agree). For more details, see [[Chinese word-segmented writing]]. === Intent segmentation === {{Confusing section\|date=September 2019}} ~~{{See also\|Tri-box method}}~~ Intent segmentation is the problem of dividing written words into keyphrases (2 or more group of words). In English and all other languages the core intent or desire is identified and become the corner-stone of the keyphrase Intent segmentation. Core product/service, idea, action & or thought anchor the keyphrase. "[All things are made of '''atoms''']. [Little '''particles''' that move] [around in perpetual '''motion'''], [~~attraction~~attracting each '''other'''] [when they are a little '''distance''' apart], [but '''repelling'''] [upon being '''squeezed'''] [into '''one another''']." === Sentence segmentation === {{See also\|Sentence boundary disambiguation}} Sentence segmentation is the problem of dividing a string of written language into its component [[Sentence (linguistics)\|sentences]]. In English and some other languages, using punctuation, particularly the [[full stop]]/period character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example, ''Mr.'' is not its own sentence in "''Mr. Smith went to the shops in Jones Street."'' When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries. As with word segmentation, not all written languages contain punctuation characters ~~which~~that are useful for approximating sentence boundaries. === Topic segmentation === {{~~main\|Topic~~See ~~analysis~~also\|Document classification}} Topic analysis consists of two main tasks: topic identification and text segmentation. While the first is a simple [[machine learning\|classification]] of a specific text, the latter case implies that a document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other cases, one needs to use techniques similar to those used in [[document classification]]. Segmenting the text into [[topic (linguistics)\|topic]]s or [[discourse]] turns might be useful in some natural processing tasks: it can improve [[information retrieval]] or [[speech recognition]] significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in [[topic detection]] and tracking systems and [[text summarization\|text summarizing]] problems. Many different approaches have been tried:<ref>{{~~Cite~~cite conference \| ~~author~~last = Choi \| first = Freddy Y. Y. ~~Choi~~ \| title = Advances in ___domain independent linear text segmentation▼ \| booktitle = Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00)▼ \| year = 2000 \| url = ~~http~~https://~~www.aclweb~~aclanthology.org~~/anthology~~/A00-2004/▼ \| pages = 26–33▼ ▲ \| title = Advances in ___domain independent linear text segmentation ▲ \| url = http://www.aclweb.org/anthology/A00-2004 ▲ \| ~~booktitle~~book-title = Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00) \|format=PDF}}</ref><ref>{{Cite journal▼ ▲ \| pages = ~~26–33~~26–33 ~~\| author = Jeffrey C. Reynar~~ \| arxiv=cs/0003083 ~~\| author-link = Jeffrey C. Reynar~~ \| access-date = 2025-03-31 \| title = Topic Segmentation: Algorithms and Applications▼ ▲~~\|format=PDF~~}}</ref><ref>{{~~Cite~~cite ~~journal~~thesis \| version = IRCS-98-21▼ \| last = Reynar \| first = Jeffrey C. \| publisher = [[University of Pennsylvania]]▼ \| year = 1998 \| url = ~~http~~https://repository.upenn.edu/~~cgi~~handle/~~viewcontent~~20.~~cgi?article=1068&context=ircs_reports~~500.14332/37673 ▲ \| title = Topic Segmentation: Algorithms and Applications \| format = PDF ~~\| accessdate = 2007-11-08~~ \| degree = PhD }}</ref> e.g. [[Hidden Markov model\|HMM]], [[lexical chains]], passage similarity using word [[co-occurrence]], [[cluster analysis\|clustering]], [[topic modeling]], etc.▼ ▲ \| publisher = [[University of Pennsylvania]] ▲ \| ~~version~~id = IRCS-98-21 \| access-date = 2025-03-31 ▲}}</ref> e.g. [[~~Hidden~~hidden Markov model\|HMM]], [[lexical chain\|lexical chains]], passage similarity using word [[co-occurrence]], [[cluster analysis\|clustering]], [[topic modeling]], etc. It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem. <!-- <math> WindowDiff(ref,hyp) {{=}} 1 \over{N-k} \sum \|b(ref_i,ref_{i+k})-b(hyp_i,hyp_{i+k})\|</math> --> Line 87 ⟶ 93: * [[Word count]] * [[Line wrap and word wrap\|Line breaking]] * [[Image segmentation]] {{Natural Language Processing}}▼ == References == {{Reflist}} ~~==External links==~~ ▲{{Natural Language Processing}} * [http://wordseg.codeplex.com/ Word Segment] An open source software tool for word segmentation in Chinese. * [http://www.whitemagicsoftware.com/software/java/wordsplit/ Word Split] An open source software tool designed to split conjoined words into human-readable text. * [http://nlp.stanford.edu/software/segmenter.shtml Stanford Segmenter] An open source software tool for word segmentation in Chinese or morpheme segmentation in Arabic. * [http://www.phontron.com/kytea KyTea] An open source software tool for word segmentation in Japanese and Chinese. * [http://chinesenotes.com/ Chinese Notes] A Chinese–English dictionary that also does word segmentation. * [http://www.zhihuita.org/service/tokenizer Zhihuita Segmentor] A high precision and high performance Chinese segmentation freeware. * [http://www.grantjenks.com/docs/wordsegment/ Python wordsegment module] An open source Python module for English word segmentation. [[Category:Tasks of natural language processing]]