Content deleted Content added
Tag: Reverted |
Reverted 1 edit by 207.134.161.195 (talk): Irrelevant to the section (and llms don't really *understand* things anyways) |
||
Line 123:
===Tokenization===
Unlike [[literacy|literate]] humans, computers do not understand the structure of a natural language
During tokenization, the parser identifies sequences of characters which represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters. The parser can also identify [[Entity extraction|entities]] such as [[email]] addresses, phone numbers, and [[Uniform Resource Locator|URL]]s. When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number.
|