Revision as of 07:14, 1 March 2024 edit 207.134.161.195 (talk) →Tokenization Tag: Reverted ← Previous edit		Revision as of 07:16, 1 March 2024 edit undo 9yz (talk \| contribs) Extended confirmed users 867 edits Reverted 1 edit by 207.134.161.195 (talk): Irrelevant to the section (and llms don't really understand things anyways) Tags: Twinkle Undo Next edit →
Line 123: ===Tokenization=== Unlike [[literacy\|literate]] humans, computers do not understand the structure of a natural language ~~(Aside from LLMs)~~ document and cannot automatically recognize words and sentences. To a computer, a document is only a sequence of bytes. Computers do not 'know' that a space character separates words in a document. Instead, humans must program the computer to identify what constitutes an individual or distinct word referred to as a token. Such a program is commonly called a [[tokenizer]] or [[parser]] or [[Lexical analysis\|lexer]]. Many search engines, as well as other natural language processing software, incorporate [[Comparison of parser generators\|specialized programs]] for parsing, such as [[YACC]] or [[Lex programming tool\|Lex]]. During tokenization, the parser identifies sequences of characters which represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters. The parser can also identify [[Entity extraction\|entities]] such as [[email]] addresses, phone numbers, and [[Uniform Resource Locator\|URL]]s. When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number.

Search engine indexing: Difference between revisions