Large language model: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Altered template type. Add: class, date, title, eprint, authors 1-4. Removed proxy/dead URL that duplicated identifier. Removed access-date with no URL. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Artem.G | #UCB_webform
Line 24:
Using a modification of [[byte pair encoding|byte-pair encoding]], in the first step, all unique characters (including blanks and [[punctuation mark]]s) are treated as an initial set of [[n-gram|''n''-grams]] (i.e. initial set of uni-grams). Successively the most frequent pair of adjacent characters is merged into a bi-gram and all instances of the pair are replaced by it. All occurrences of adjacent pairs of (previously merged) ''n''-grams that most frequently occur together are then again merged into even lengthier ''n''-gram repeatedly until a vocabulary of prescribed size is obtained (in case of [[GPT-3]], the size is 50257).<ref name="xbiWb">{{Cite web |title=OpenAI API |url=https://platform.openai.com/ |archive-url=https://web.archive.org/web/20230423211308/https://platform.openai.com/tokenizer |archive-date=April 23, 2023 |access-date=2023-04-30 |website=platform.openai.com |language=en}}</ref> Token vocabulary consists of [[integers]], spanning from zero up to the size of the token vocabulary. New words can always be interpreted as combinations of the tokens and the initial-set uni-grams.<ref name="2022Book_">{{cite book |last1=Paaß |first1=Gerhard |chapter-url= https://link.springer.com/chapter/10.1007/978-3-031-23190-2_2 |title=Foundation Models for Natural Language Processing |last2=Giesselbach |first2=Sven |chapter=Pre-trained Language Models |series=Artificial Intelligence: Foundations, Theory, and Algorithms |date= 2022 |pages=19–78 |doi=10.1007/978-3-031-23190-2_2 |isbn=9783031231902 |access-date=3 August 2023}}</ref>
 
A token vocabulary based on the frequencies extracted from mainly English corpora uses as few tokens as possible for an average English word. An average word in another language encoded by such an English-optimized tokenizer is however split into suboptimal amount of tokens. GPT-2 tokenizer can use up to 15 times more tokens per word for some languages, for example for [[Shan language]] from [[Myanmar]]. Even more widespread languages such as Portuguese and German have "a premium of 50%" compared to English.<ref>{{cite arxivarXiv |arxiveprint=2305.15425 |last1=Petrov |first1=Aleksandar |author2=Emanuele La Malfa |last3=Torr |first3=Philip H. S. |last4=Bibi |first4=Adel |title=Language Model Tokenizers Introduce Unfairness Between Languages |date=2023 |class=cs.CL }}</ref>
 
<small><code>tokenizer: texts -> series of numerical "tokens"</code></small> may be split into:
Line 39:
===Dataset cleaning===
{{Main|Data cleansing}}
In the context of training LLMs, datasets are typically cleaned by removing toxic passages from the dataset, discarding low-quality data, and de-duplication.<ref name="aYNg4">{{Cite arXiv |eprint=2104.08758 |class=cs.CL |first1=Jesse |last1=Dodge |first2=Maarten |last2=Sap |title=Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus |last3=Marasović |first3=Ana |last4=Agnew |first4=William |last5=Ilharco |first5=Gabriel |last6=Groeneveld |first6=Dirk |last7=Mitchell |first7=Margaret |last8=Gardner |first8=Matt |year=2021}}</ref> Cleaned datasets can increase training efficiency and lead to improved downstream performance.<ref>{{cite journal |last1=Lee |first1=Katherine |last2=Ippolito |first2=Daphne |last3=Nystrom |first3=Andrew |last4=Zhang |first4=Chiyuan |last5=Eck |first5=Douglas |last6=Callison-Burch |first6=Chris |last7=Carlini |first7=Nicholas |date=May 2022 |title=Deduplicating Training Data Makes Language Models Better |url=https://aclanthology.org/2022.acl-long.577.pdf |journal=Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics |volume=1: Long Papers |pages=8424–8445 |doi=10.18653/v1/2022.acl-long.577}}</ref><ref>{{Citation |last1=Li |first1=Yuanzhi |title=Textbooks Are All You Need II: phi-1.5 technical report |date=2023-09-11 |url=http://arxiv.org/abs/2309.05463 |access-date=2024-01-20 |arxiv=2309.05463 |last2=Bubeck |first2=Sébastien |last3=Eldan |first3=Ronen |last4=Del Giorno |first4=Allie |last5=Gunasekar |first5=Suriya |last6=Lee |first6=Yin Tat}}</ref>
 
With the increasing proportion of LLM-generated content on the web, data cleaning in the future may include filtering out such content. LLM-generated content can pose a problem if the content is similar to human text (making filtering difficult) but of lower quality (degrading performance of models trained on it).<ref name="qbFw1">{{Cite arXiv |eprint=2005.14165 |class=cs.CL |first1=Tom B. |last1=Brown |first2=Benjamin |last2=Mann |title=Language Models are Few-Shot Learners |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbert-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |last16=Ramesh |first16=Aditya |last17=Ziegler |first17=Daniel M. |last18=Wu |first18=Jeffrey |last19=Winter |first19=Clemens |last20=Hesse |first20=Christopher |last21=Chen |first21=Mark |last22=Sigler |first22=Eric |last23=Litwin |first23=Mateusz |last24=Gray |first24=Scott |last25=Chess |first25=Benjamin |last26=Clark |first26=Jack |last27=Berner |first27=Christopher |last28=McCandlish |first28=Sam |last29=Radford |first29=Alec |last30=Sutskever |first30=Ilya |year=2020 |display-authors=1}}</ref>