Search engine indexing: Difference between revisions

Content deleted Content added
AnomieBOT (talk | contribs)
m Dating maintenance tags: {{Cn}}
m Remove arabic from languages that have it words not delineated by whitespace; Arabic has it words delineated by whitespaces
Line 114:
 
===Challenges in natural language processing===
; Word boundary ambiguity: Native [[English language|English]] speakers may at first consider tokenization to be a straightforward task, but this is not the case with designing a [[multilingual]] indexer. In digital form, the texts of other languages such as [[Chinese language|Chinese]], or [[Japanese language|Japanese]] or [[Arabic language|Arabic]] represent a greater challenge, as words are not clearly delineated by [[Whitespace (computer science)|whitespace]]. The goal during tokenization is to identify words for which users will search. Language-specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax).
 
; Language ambiguity: To assist with properly ranking matching documents, many search engines collect additional information about each word, such as its [[language]] or [[lexical category]] ([[part of speech]]). These techniques are language-dependent, as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document.