Content deleted Content added
Line 1:
{{Use dmy dates|date=March 2016}}
{{Refimprove|date=October 2011}}
'''Text segmentation''' is the process of dividing
Compare [[speech segmentation]], the process of dividing speech into linguistically meaningful portions.
Line 8 ⟶ 9:
=== Word segmentation ===
{{See also|Word#Word boundaries}}
Word segmentation is the problem of dividing a string of written language into its component
In
However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include
In some writing systems however, such as the [[Ge'ez script]] used for [[Amharic]] and [[Tigrinya language|Tigrinya]] among other languages, words are explicitly delimited (at least historically) with a non-
The [[Unicode Consortium]] has published a
'''Word splitting''' is the process of [[parsing]] [[concatenated]] text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.
Line 24 ⟶ 25:
=== Sentence segmentation ===
{{See also|Sentence boundary disambiguation}}
Sentence segmentation is the problem of dividing a string of written language into its component [[sentences]]. In English and some other languages, using punctuation, particularly the [[full stop]]/period character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example ''Mr.'' is not its own sentence in "''Mr. Smith went to the shops in Jones Street."'' When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
As with word segmentation, not all written languages contain punctuation characters which are useful for approximating sentence boundaries.
Line 32 ⟶ 33:
Topic analysis consists of two main tasks: topic identification and text segmentation. While the first is a simple [[machine learning|classification]] of a specific text, the latter case implies that a document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other cases, one needs to use techniques similar to those used in [[document classification]].
Segmenting the text into [[topic (linguistics)|topic]]s or [[discourse]] turns might be useful in some natural processing tasks: it can improve information retrieval or speech recognition significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in [[
Many different approaches have been tried:<ref>{{Cite conference
Line 52 ⟶ 53:
}}</ref> e.g. [[Hidden Markov model|HMM]], [[lexical chains]], passage similarity using word [[co-occurrence]], [[cluster analysis|clustering]] etc.
It is quite an ambiguous task
<!-- <math> WindowDiff(ref,hyp) {{=}} 1 \over{N-k} \sum |b(ref_i,ref_{i+k})-b(hyp_i,hyp_{i+k})|</math> -->
Line 66 ⟶ 67:
The process of developing text segmentation tools starts with collecting a large corpus of text in an application ___domain. There are two general approaches:
* Manual analysis of text and writing custom software
* Annotate the sample corpus with boundary information and use [[
Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
Line 89 ⟶ 90:
* [http://nlp.stanford.edu/software/segmenter.shtml Stanford Segmenter] An open source software tool for word segmentation in Chinese or morpheme segmentation in Arabic.
* [http://www.phontron.com/kytea KyTea] An open source software tool for word segmentation in Japanese and Chinese.
* [http://chinesenotes.com/ Chinese Notes] A
* [http://www.zhihuita.org/service/tokenizer Zhihuita Segmentor] A high precision and high performance Chinese segmentation freeware.
* [http://www.grantjenks.com/docs/wordsegment/ Python wordsegment module] An open source Python module for English word segmentation.
|