Content deleted Content added
m Dating maintenance tags: {{Cn}} |
m Remove arabic from languages that have it words not delineated by whitespace; Arabic has it words delineated by whitespaces |
||
Line 114:
===Challenges in natural language processing===
; Word boundary ambiguity: Native [[English language|English]] speakers may at first consider tokenization to be a straightforward task, but this is not the case with designing a [[multilingual]] indexer. In digital form, the texts of other languages such as [[Chinese language|Chinese]]
; Language ambiguity: To assist with properly ranking matching documents, many search engines collect additional information about each word, such as its [[language]] or [[lexical category]] ([[part of speech]]). These techniques are language-dependent, as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document.
|