Revision as of 01:30, 23 April 2025 edit Dimawik (talk \| contribs) Extended confirmed users 2,445 edits →Sources: added source ← Previous edit		Revision as of 20:18, 29 April 2025 edit undo Headbomb (talk \| contribs) Edit filter managers, Autopatrolled, Extended confirmed users, Page movers, File movers, New page reviewers, Pending changes reviewers, Rollbackers, Template editors 472,932 edits m clean up Tag: AWB Next edit →
Line 1: A '''1.58-bit Large Language Model''' ('''1.58-bit LLM''', also '''ternary LLM''') is a version of a [[Transformer (deep learning architecture)\|transformer]] [[large language model]] with weights using only three values: -1, 0, and +1. This restriction theoretically allows the model to replace costly multiplications with additions and reduce the storage memory. Since the end-task performance and [[Perplexity (LLM)\|perplexity]] of the 1.58-bit LLMs, at least for smaller model sizes (up to 3-4B parameters), are close to their "full precision" (16-bit [[FP16]] or [[BF16]]) counterparts, this design allows reaching the same [[artificial intelligence]] goals with much lower hardware requirements, latency, and training effort.{{sfn\|Ma\|Wang\|Ma\|Wang\|2024\|p=1}}{{sfn\|Friha\|Amine Ferrag\|Kantarci\|Cakmak\|2024\|p=5822}}{{sfn\|Hutson\|2024}} The name comes from a fact that a single [[Ternary numeral system\|trit]], a [[ternary arithmetic]] equivalent of a bit that can take the {-1, 0, 1} values, carries <math>log_2 3 \approx 1.58</math> [[bits of information]]. The 1.58-bit LLM models are also called '''1-bit LLMs'''{{sfn\|Ma\|Wang\|Ma\|Wang\|2024\|p=1}}{{sfn\|Morales\|2025}} (the true 1-bit models also exist). == BitNet == In 2024, Ma et al., researchers at [[Microsoft]] declared that their 1.58-bit model, '''''BitNet''' b1.58'' is comparable in performance to the 16-bit [[Llama 2]] and opens the era of 1-bit LLM.{{sfn\|Huyen\|2024\|p=330}} BitNet creators did not use the post-training quantization of weights but instead relied on the new ''BitLinear'' transform that replaced the ''nn.Linear'' layer of the traditional transformer design.{{sfn\|Wang\|Ma\|Dong\|Huang\|2023\|p=1}} In 2025, Microsoft researchers had released an [[open-weights]] and [[open inference code]] model ''BitNet b1.58 2B4T'' demonstrating performance competitive to the full precision models at 2B parameters and 4T training tokens.{{sfn\|Ma\|Wang\|Huang\|Zhang\|2025\|p=}} Line 16: ==Sources== * {{cite arXiv \|last=Ma \|first=Shuming \|last2=Wang \|first2=Hongyu \|last3=Ma \|first3=Lingxiao \|last4=Wang \|first4=Lei \|last5=Wang \|first5=Wenhui \|last6=Huang \|first6=Shaohan \|last7=Dong \|first7=Li \|last8=Wang \|first8=Ruiping \|last9=Xue \|first9=Jilong \|last10=Wei \|first10=Furu \|title=The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits \|arxiv=2402.17764 \|date=2024-02-27 }} * {{~~cite~~citation \|last=Ma \|first=Shuming \|last2=Wang \|first2=Hongyu \|last3=Huang \|first3=Shaohan \|last4=Zhang \|first4=Xingxing \|last5=Hu \|first5=Ying \|last6=Song \|first6=Ting \|last7=Xia \|first7=Yan \|last8=Wei \|first8=Furu \|title=BitNet b1.58 2B4T Technical Report \|date=2025 \|doi=10.48550/ARXIV.2504.12285 \|url=https://arxiv.org/abs/2504.12285 \|access-date=2025-04-22}} * {{cite journal \|last=Friha \|first=Othmane \|last2=Amine Ferrag \|first2=Mohamed \|last3=Kantarci \|first3=Burak \|last4=Cakmak \|first4=Burak \|last5=Ozgun \|first5=Arda \|last6=Ghoualmi-Zine \|first6=Nassira \|title=LLM-Based Edge Intelligence: A Comprehensive Survey on Architectures, Applications, Security and Trustworthiness \|journal=IEEE Open Journal of the Communications Society \|volume=5 \|date=2024 \|issn=2644-125X \|doi=10.1109/OJCOMS.2024.3456549 \|doi-access=free \|pages=5799–5856}} * {{cite journal \|title=1-bit LLMs Could Solve ~~AI’s~~AI's Energy Demands \|journal=IEEE Spectrum \|date=2024-05-30 \|url=https://spectrum.ieee.org/1-bit-llm \|first=Matthew\|last=Hutson\|access-date=2025-04-22}} * {{cite book \|last=Huyen \|first=Chip \|title=AI Engineering \|publisher="O'Reilly Media, Inc." \|date=2024-12-04 \|isbn=978-1-0981-6627-4 \|url=https://www.google.com/books/edition/AI_Engineering/S7M1EQAAQBAJ?hl=en&gbpv=1&pg=PA330 \|access-date=2025-04-22}} * {{~~cite~~citation \|last=Kumar \|first=Tanishq \|last2=Ankner \|first2=Zachary \|last3=Spector \|first3=Benjamin F. \|last4=Bordelon \|first4=Blake \|last5=Muennighoff \|first5=Niklas \|last6=Paul \|first6=Mansheej \|last7=Pehlevan \|first7=Cengiz \|last8=Ré \|first8=Christopher \|last9=Raghunathan \|first9=Aditi \|title=Scaling Laws for Precision \|date=2024 \|doi=10.48550/ARXIV.2411.04330 \|doi-access=free \|url=http://arxiv.org/pdf/2411.04330 \|access-date=2025-04-22}} * {{cite web \|last=Morales \|first=Jowi \|title=Microsoft researchers build 1-bit AI LLM with 2B parameters \|website=Tom's Hardware \|date=2025-04-17 \|url=https://www.tomshardware.com/tech-industry/artificial-intelligence/microsoft-researchers-build-1-bit-ai-llm-with-2b-parameters-model-small-enough-to-run-on-some-cpus \|access-date=2025-04-21}} * {{~~cite~~citation \|last=Ouyang \|first=Xu \|last2=Ge \|first2=Tao \|last3=Hartvigsen \|first3=Thomas \|last4=Zhang \|first4=Zhisong \|last5=Mi \|first5=Haitao \|last6=Yu \|first6=Dong \|title=Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens \|date=2024 \|doi=10.48550/ARXIV.2411.17691 \|doi-access=free \|url=http://arxiv.org/pdf/2411.17691 \|access-date=2025-04-22}} * {{~~cite~~citation \|last=Wang \|first=Hongyu \|last2=Ma \|first2=Shuming \|last3=Dong \|first3=Li \|last4=Huang \|first4=Shaohan \|last5=Wang \|first5=Huaijie \|last6=Ma \|first6=Lingxiao \|last7=Yang \|first7=Fan \|last8=Wang \|first8=Ruiping \|last9=Wu \|first9=Yi \|last10=Wei \|first10=Furu \|title=BitNet: Scaling 1-bit Transformers for Large Language Models \|date=2023 \|doi=10.48550/ARXIV.2310.11453 \|doi-access=free \|url=https://arxiv.org/abs/2310.11453 \|access-date=2025-04-23}} [[Category:Large language models]] {{ai-stub}}

1.58-bit large language model: Difference between revisions