Content deleted Content added
Citation bot (talk | contribs) Altered template type. Add: class, date, title, eprint, authors 1-4. Removed proxy/dead URL that duplicated identifier. Removed access-date with no URL. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Artem.G | #UCB_webform |
m spelling |
||
Line 63:
In order to find out which tokens are relevant to each other within the scope of the context window, the attention mechanism calculates "soft" weights for each token, more precisely for its embedding, by using multiple attention heads, each with its own "relevance" for calculating its own soft weights. For example, the small (i.e. 117M parameter sized) [[GPT-2]] model, has had twelve attention heads and a context window of only 1k token.<ref name="Jay_Allamar_GPT2">{{Cite web | last=Allamar | first=Jay | title=The Illustrated GPT-2 (Visualizing Transformer Language Models) |url=https://jalammar.github.io/illustrated-gpt2/ |access-date=2023-08-01 |language=en}}</ref> In its medium version it has 345M parameters and contains 24 layers, each with 12 attention heads. For the training with gradient descent a batch size of 512 was utilized.<ref name="2022Book_"/>
The largest models, such as Google's [[Gemini (language model)|Gemini 1.5]], presented in February 2024, can have a context window sized up to 1 million (context window of 10 million was also "
Length of a conversation that the model can take into account when generating its next answer is limited by the size of a context window, as well. If the length of a conversation, for example with [[Chat-GPT]], is longer than its context window, only the parts inside the context window are taken into account when generating the next answer, or the model needs to apply some algorithm to summarize the too distant parts of conversation.
|