Search engine indexing: Difference between revisions

Content deleted Content added
m Remove arabic from languages that have it words not delineated by whitespace; Arabic has it words delineated by whitespaces
Tag: Reverted
Line 123:
 
===Tokenization===
Unlike [[literacy|literate]] humans, computers do not understand the structure of a natural language (Aside from LLMs) document and cannot automatically recognize words and sentences. To a computer, a document is only a sequence of bytes. Computers do not 'know' that a space character separates words in a document. Instead, humans must program the computer to identify what constitutes an individual or distinct word referred to as a token. Such a program is commonly called a [[tokenizer]] or [[parser]] or [[Lexical analysis|lexer]]. Many search engines, as well as other natural language processing software, incorporate [[Comparison of parser generators|specialized programs]] for parsing, such as [[YACC]] or [[Lex programming tool|Lex]].
 
During tokenization, the parser identifies sequences of characters which represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters. The parser can also identify [[Entity extraction|entities]] such as [[email]] addresses, phone numbers, and [[Uniform Resource Locator|URL]]s. When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number.