Content deleted Content added
editing a link |
|||
(37 intermediate revisions by 29 users not shown) | |||
Line 1:
{{Short description|Human writing practice}}
{{Use dmy dates|date=March 2016}}
{{Refimprove|date=October 2011}}
'''Text segmentation''' is the process of dividing
Compare [[speech segmentation]], the process of dividing speech into linguistically meaningful portions.
Line 7 ⟶ 9:
== Segmentation problems ==
=== Word segmentation ===
Word segmentation is the problem of dividing a string of written language into its component
In
However, the equivalent to
In some writing systems however, such as the [[Ge'ez script]] used for [[Amharic]] and [[Tigrinya language|Tigrinya]] among other languages, words are explicitly delimited (at least historically) with a non-
The [[Unicode Consortium]] has published a ''Standard Annex on Text Segmentation'',<ref>[http://
'''Word splitting''' is the process of [[parsing]] [[concatenated]] text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.
Word splitting may also refer to the process of [[Syllabification|hyphenation]].
Some scholars have suggested that modern Chinese should be written in word segmentation, with
spaces between words like written English.<ref>{{cite journal |last=Zhang |first=Xiao-heng |journal=中文信息学报 |date=1998 |script-title=zh:也谈汉语书面语的分词问题——分词连写十大好处 |trans-title=Written Chinese Word-Segmentation Revisited: Ten advantages of word-segmented writing |url=http://jcip.cipsc.org.cn/CN/Y1998/V12/I3/58 |language=zh-Hans |script-journal=zh:中文信息学报 |trans-journal=[[Journal of Chinese Information Processing]] |volume=12 |issue=3 |pages=58–64 |access-date=2025-03-31}}</ref> Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国 会 不同意。" (The US will not agree.) or "美 国会 不同意。" (The US Congress does not agree). For more details, see [[Chinese word-segmented writing]].
=== Intent segmentation ===
{{Confusing section|date=September 2019}}
Intent segmentation is the problem of dividing written words into keyphrases (2 or more group of words).
In English and all other languages the core intent or desire is identified and become the corner-stone of the keyphrase Intent segmentation. Core product/service, idea, action & or thought anchor the keyphrase.
"[All things are made of '''atoms''']. [Little '''particles''' that move] [around in perpetual '''motion'''], [attracting each '''other'''] [when they are a little '''distance''' apart], [but '''repelling'''] [upon being '''squeezed'''] [into '''one another''']."
=== Sentence segmentation ===
{{See also|Sentence boundary disambiguation}}
Sentence segmentation is the problem of dividing a string of written language into its component [[Sentence (linguistics)|sentences]]. In English and some other languages, using punctuation, particularly the [[full stop]]/period character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example, ''Mr.'' is not its own sentence in "''Mr. Smith went to the shops in Jones Street."'' When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
As with word segmentation, not all written languages contain punctuation characters
=== Topic segmentation ===
{{
Topic analysis consists of two main tasks: topic
Segmenting the text into [[topic (linguistics)|topic]]s or [[discourse]] turns might be useful in some natural processing tasks: it can improve [[information retrieval]] or [[speech recognition]] significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in [[
Many different approaches have been tried:<ref>{{
|
| title = Advances in ___domain independent linear text segmentation▼
| booktitle = Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00)▼
| year = 2000
| pages = 26–33▼
▲ | title = Advances in ___domain independent linear text segmentation
▲ | url = http://www.aclweb.org/anthology/A00-2004
▲ |
|format=PDF}}</ref><ref>{{Cite journal▼
| arxiv=cs/0003083
| title = Topic Segmentation: Algorithms and Applications▼
| publisher = [[University of Pennsylvania]]▼
| last = Reynar | first = Jeffrey C.
| url = https://repository.upenn.edu/handle/20.500.14332/37673
| format = PDF▼
}}</ref> e.g. [[Hidden Markov model|HMM]], [[lexical chains]], passage similarity using word [[co-occurrence]], [[cluster analysis|clustering]] etc. ▼
| degree = PhD
| id = IRCS-98-21
| access-date = 2025-03-31
▲}}</ref> e.g. [[
It is quite an ambiguous task
<!-- <math> WindowDiff(ref,hyp) {{=}} 1 \over{N-k} \sum |b(ref_i,ref_{i+k})-b(hyp_i,hyp_{i+k})|</math> -->
Line 66 ⟶ 82:
The process of developing text segmentation tools starts with collecting a large corpus of text in an application ___domain. There are two general approaches:
* Manual analysis of text and writing custom software
* Annotate the sample corpus with boundary information and use [[
Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
Line 76 ⟶ 92:
* [[Lexical analysis]]
* [[Word count]]
* [[Line wrap and word wrap|Line breaking]]
* [[Image segmentation]]
{{Natural Language Processing}}▼
== References ==
{{Reflist}}
▲{{Natural Language Processing}}
[[Category:Tasks of natural language processing]]
|