Revision as of 08:28, 15 February 2024 edit DamenaKo (talk \| contribs) 3 edits m Remove arabic from languages that have it words not delineated by whitespace; Arabic has it words delineated by whitespaces ← Previous edit		Revision as of 07:14, 1 March 2024 edit undo 207.134.161.195 (talk) →Tokenization Tag: Reverted Next edit →
Line 123: ===Tokenization=== Unlike [[literacy\|literate]] humans, computers do not understand the structure of a natural language (Aside from LLMs) document and cannot automatically recognize words and sentences. To a computer, a document is only a sequence of bytes. Computers do not 'know' that a space character separates words in a document. Instead, humans must program the computer to identify what constitutes an individual or distinct word referred to as a token. Such a program is commonly called a [[tokenizer]] or [[parser]] or [[Lexical analysis\|lexer]]. Many search engines, as well as other natural language processing software, incorporate [[Comparison of parser generators\|specialized programs]] for parsing, such as [[YACC]] or [[Lex programming tool\|Lex]]. During tokenization, the parser identifies sequences of characters which represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters. The parser can also identify [[Entity extraction\|entities]] such as [[email]] addresses, phone numbers, and [[Uniform Resource Locator\|URL]]s. When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number.

Search engine indexing: Difference between revisions