Text segmentation: Difference between revisions

Content deleted Content added
Word segmentation: Bad example ("won't" is a single word synonymous with "will not", not an orthographic representation of "will not"); replaced with good example (tool box/toolbox, ice box/icebox).
m Rep typographic ligature "fi" with plain text; possible ref cleanup; WP:GenFixes on, Enum'd 1 author/editor WL,, replaced: fi → fi, typo(s) fixed: For example → For example, using AWB
Line 11:
Word segmentation is the problem of dividing a string of written language into its component words.
 
In English and many other languages using some form of the [[Latin alphabet]], the [[Space (punctuation)|space]] is a good approximation of a [[word divider]] (word [[delimiter]]), although this concept has limits because of the variability with which languages [[emic and etic|emically]] regard [[collocation]]s and [[compound (linguistics)|compounds]]. Many [[English compound#Compound nouns|English compound nouns]] are variably written (for example, ''[[icebox|ice box = ice-box = icebox]]''; ''[[sty|pig sty = pig-sty = pigsty]]'') with a corresponding variation in whether speakers think of them as [[noun phrase]]s or single nouns; there are trends in how norms are set, such as that open compounds often tend eventually to solidify by widespread convention, but variation remains systemic. In contrast, [[German nouns#Compounds|German compound nouns]] show less orthographic variation, with solidification being a stronger norm.
 
However, the equivalent to the word space character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where [[sentences]] but not words are delimited, [[Thai language|Thai]] and [[Lao language|Lao]], where phrases and sentences but not words are delimited, and [[Vietnamese language|Vietnamese]], where syllables but not words are delimited.
Line 27:
Intent segmentation is the problem of dividing written words into keyphrases (2 or more group of words).
 
In English and all other languages the core intent or desire is identified and become the corner-stone of the keyphrase Intent segmentation. Core product/service, idea, action & or thought anchor the keyphrase.
 
"[All things are made of '''atoms''']. [Little '''particles''' that move] [around in perpetual '''motion'''], [attraction each '''other'''] [when they are a little '''distance''' apart], [but '''repelling'''] [upon being '''squeezed'''] [into '''one another''']."
Line 33:
=== Sentence segmentation ===
{{See also|Sentence boundary disambiguation}}
Sentence segmentation is the problem of dividing a string of written language into its component [[sentences]]. In English and some other languages, using punctuation, particularly the [[full stop]]/period character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example, ''Mr.'' is not its own sentence in "''Mr. Smith went to the shops in Jones Street."'' When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
 
As with word segmentation, not all written languages contain punctuation characters which are useful for approximating sentence boundaries.
Line 39:
=== Topic segmentation ===
{{main|Topic analysis|Document classification}}
Topic analysis consists of two main tasks: topic identificationidentification and text segmentation. While the first is a simple [[machine learning|classification]] of a specific text, the latter case implies that a document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other cases, one needs to use techniques similar to those used in [[document classification]].
 
Segmenting the text into [[topic (linguistics)|topic]]s or [[discourse]] turns might be useful in some natural processing tasks: it can improve information retrieval or speech recognition significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in [[topic detection]] and tracking systems and [[text summarization|text summarizing]] problems.
Line 51:
| url = http://www.aclweb.org/anthology/A00-2004
|format=PDF}}</ref><ref>{{Cite journal
| author = [[Jeffrey C. Reynar]]
| author-link = Jeffrey C. Reynar
| title = Topic Segmentation: Algorithms and Applications
| version = IRCS-98-21
Line 59 ⟶ 60:
| format = PDF
| accessdate = 2007-11-08
}}</ref> e.g. [[Hidden Markov model|HMM]], [[lexical chains]], passage similarity using word [[co-occurrence]], [[cluster analysis|clustering]], [[topic modeling]], etc.
 
It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem.
Line 93 ⟶ 94:
 
==External links==
 
* [http://wordseg.codeplex.com/ Word Segment] An open source software tool for word segmentation in Chinese.
* [http://www.whitemagicsoftware.com/software/java/wordsplit/ Word Split] An open source software tool designed to split conjoined words into human-readable text.