Content deleted Content added
m robot Adding: bn:টেক্সট খণ্ডায়ন |
Text segmentation for topics. |
||
Line 1:
'''Text segmentation''' is the process of dividing written text into [[word]]s or other similar meaningful units, such as [[sentence]]s or [[topic]]s. The term applies to [[human mind|mental]] processes used by humans when reading text, and to artificial processes implemented in [[computers]], which are the subject of [[natural language processing]].
The problem may appear relatively trivial for written languages that have explicit word boundary markers, such as the word spaces of written [[English language|English]] or the distinctive initial, medial and final letter shapes of [[Arabic language|Arabic]]. When such clues are not consistently available, the task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints.
Line 6:
When processing plain text, tables of abbreviations that contain periods (Mr. for example) can help prevent incorrect assignment of sentence boundaries. Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
A document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly.
The topic boundaries may be apparent from section titles and paragraphs.
In other cases one needs to use techniques similiar to those used in [[document classification]].
Many different approaches have been tried.<ref>{{Cite conference
| author = Freddy Y. Y. Choi
| title = Advances in ___domain independent linear text segmentation
| booktitle = Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00)
| year = 2000
| pages = 26–33
| url = http://acl.ldc.upenn.edu/A/A00/A00-2004.pdf
}}</ref>
Effective [[Natural Language Processing]] systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.
Line 18 ⟶ 30:
* [[Hyphenation]]
* [[Word count]]
== References ==
{{Reflist}}
[[Category:Natural language processing]]
|