1.58-bit large language model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 22:51, 21 April 2025 edit Dimawik (talk \| contribs) Extended confirmed users 2,445 edits →Sources: added source ← Previous edit		Latest revision as of 01:58, 28 July 2025 edit undo TokenByToken (talk \| contribs) Extended confirmed users 1,392 edits revise lead Tag: Visual edit
(35 intermediate revisions by 9 users not shown)
Line 1: {{Short description\|Large language model with ternary weights}} ~~{{in use}}~~ A '''1.58-bit Large Language Model''' ('''1.58-bit LLM''') is a version of a [[Transformer (deep learning architecture)\|transformer]] [[large language model]] with weights using only three values: -1, 0, and +1. This restriction allows the model to replace costly multiplications with additions and reduce the storage memory. Since the end-task performance and [[Perplexity (LLM)\|perplexity]] of the 1.58-bit LLMs are close to their "full precision" (16-bit [[FP16]] or [[BF16]]) counterparts, this design allows reaching the same [[artificial intelligence]] goals with much lower hardware requirements, latency, and training effort.{{sfn\|Ma\|Wang\|Ma\|Wang\|2024\|p=1}} The name comes from a fact that a single [[Ternary numeral system\|trit]], a [[ternary arithmetic]] equivalent of a bit that can take the {-1, 0, 1} values, carries <math>log_2 3 \approx 1.58</math> [[bits of information]]. The 1.58-bit LLM models are also called '''1-bit LLMs'''.{{sfn\|Ma\|Wang\|Ma\|Wang\|2024\|p=1}}{{sfn\|Morales\|2025}} A '''1.58-bit large language model''' (also known as a '''ternary LLM''') is a type of [[large language model]] (LLM) designed to be computationally efficient. It achieves this by using [[neural network#in machine learning\|weights]] that are restricted to only three values: -1, 0, and +1. This restriction significantly reduces the model's memory footprint and allows for faster processing, as complex multiplication operations can be replaced with simpler additions. This contrasts with traditional models that use 16-bit floating-point numbers ([[FP16]] or [[BF16]]) for their weights. Studies have shown that for models up to several billion parameters, the performance of 1.58-bit LLMs on various tasks is comparable to their full-precision counterparts.{{sfn\|Ma\|Wang\|Ma\|Wang\|2024\|p=1}}{{sfn\|Hutson\|2024}} This approach could enable powerful AI to run on less specialized and lower-power hardware.{{sfn\|Friha\|Amine Ferrag\|Kantarci\|Cakmak\|2024\|p=5822}} The name "1.58-bit" comes from the fact that a system with three states contains <math>\log_2 3 \approx 1.58</math> [[bit\|bits]] of [[information theory\|information]]. These models are sometimes also referred to as '''1-bit LLMs''' in research papers, although this term can also refer to true binary models (with weights of -1 and +1).{{sfn\|Ma\|Wang\|Ma\|Wang\|2024\|p=1}}{{sfn\|Morales\|2025}} == BitNet == {{redirect\|BitNet\|a computer network\|BITNET}} In 2024, Ma et al., researchers at [[Microsoft]], declared that their 1.58-bit model, '''''BitNet''' b1.58'' is comparable in performance to the 16-bit [[Llama 2]] and opens the era of 1-bit LLM.{{sfn\|Huyen\|2024\|p=330}} BitNet creators did not use the post-training quantization of weights but instead relied on the new ''BitLinear'' transform that replaced the ''nn.Linear'' layer of the traditional transformer design.{{sfn\|Wang\|Ma\|Dong\|Huang\|2023\|p=1}} In 2025, Microsoft researchers had released an open-weights and open inference code model ''BitNet b1.58 2B4T'' demonstrating performance competitive with the full precision models at 2B parameters and 4T training tokens.{{sfn\|Ma\|Wang\|Huang\|Zhang\|2025\|p=}} == Post-training quantization == BitNet derives its performance from being trained natively in 1.58 bit instead of being quantized from a full-precision model after training. Still, training is an expensive process and it would be desirable to be able to somehow convert an existing model to 1.58 bits. In 2024, HuggingFace reported a way to gradually ramp up the 1.58-bit quantization in fine-tuning an existing model down to 1.58 bits.<ref>{{cite web \|title=Fine-tuning LLMs to 1.58bit: extreme quantization made easy \|url=https://huggingface.co/blog/1_58_llm_extreme_quantization#pre-training-results-in-158b}}</ref> == Critique == Some researchers{{sfn\|Ouyang\|Ge\|Hartvigsen\|Zhang\|2024\|p=}} point out that the scaling laws{{sfn\|Kumar\|Ankner\|Spector\|Bordelon\|2024\|p=}} of large language models favor the low-bit weights only in case of undertrained models. As the number of training tokens increases, the deficiencies of low-bit quantization surface. ==References== Line 8 ⟶ 23: ==Sources== * {{cite arXiv \|~~last~~last1=Ma \|~~first~~first1=Shuming \|last2=Wang \|first2=Hongyu \|last3=Ma \|first3=Lingxiao \|last4=Wang \|first4=Lei \|last5=Wang \|first5=Wenhui \|last6=Huang \|first6=Shaohan \|last7=Dong \|first7=Li \|last8=Wang \|first8=Ruiping \|last9=Xue \|first9=Jilong \|last10=Wei \|first10=Furu \|title=The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits \|~~arxiv~~eprint=2402.17764 \|date=2024-02-27 \|class=cs.CL }} * {{cite arXiv \|eprint=2504.12285 \|last1=Ma \|first1=Shuming \|last2=Wang \|first2=Hongyu \|last3=Huang \|first3=Shaohan \|last4=Zhang \|first4=Xingxing \|last5=Hu \|first5=Ying \|last6=Song \|first6=Ting \|last7=Xia \|first7=Yan \|last8=Wei \|first8=Furu \|title=BitNet b1.58 2B4T Technical Report \|date=2025 \|class=cs.CL }} * {{cite journal \|last1=Friha \|first1=Othmane \|last2=Amine Ferrag \|first2=Mohamed \|last3=Kantarci \|first3=Burak \|last4=Cakmak \|first4=Burak \|last5=Ozgun \|first5=Arda \|last6=Ghoualmi-Zine \|first6=Nassira \|title=LLM-Based Edge Intelligence: A Comprehensive Survey on Architectures, Applications, Security and Trustworthiness \|journal=IEEE Open Journal of the Communications Society \|volume=5 \|date=2024 \|issn=2644-125X \|doi=10.1109/OJCOMS.2024.3456549 \|doi-access=free \|pages=5799–5856}} * {{cite journal \|title=1-bit LLMs Could Solve AI's Energy Demands \|journal=IEEE Spectrum \|date=2024-05-30 \|url=https://spectrum.ieee.org/1-bit-llm \|first=Matthew\|last=Hutson\|access-date=2025-04-22}} * {{cite book \|last=Huyen \|first=Chip \|title=AI Engineering \|publisher="O'Reilly Media, Inc." \|date=2024-12-04 \|isbn=978-1-0981-6627-4 \|url=https://books.google.com/books?id=S7M1EQAAQBAJ&pg=PA330 \|access-date=2025-04-22}} * {{cite arXiv \|eprint=2411.04330 \|last1=Kumar \|first1=Tanishq \|last2=Ankner \|first2=Zachary \|last3=Spector \|first3=Benjamin F. \|last4=Bordelon \|first4=Blake \|last5=Muennighoff \|first5=Niklas \|last6=Paul \|first6=Mansheej \|last7=Pehlevan \|first7=Cengiz \|last8=Ré \|first8=Christopher \|last9=Raghunathan \|first9=Aditi \|title=Scaling Laws for Precision \|date=2024 \|class=cs.LG }} * {{cite web \|last=Morales \|first=Jowi \|title=Microsoft researchers build 1-bit AI LLM with 2B parameters \|website=Tom's Hardware \|date=2025-04-17 \|url=https://www.tomshardware.com/tech-industry/artificial-intelligence/microsoft-researchers-build-1-bit-ai-llm-with-2b-parameters-model-small-enough-to-run-on-some-cpus \|access-date=2025-04-21}} * {{cite arXiv \|eprint=2411.17691 \|last1=Ouyang \|first1=Xu \|last2=Ge \|first2=Tao \|last3=Hartvigsen \|first3=Thomas \|last4=Zhang \|first4=Zhisong \|last5=Mi \|first5=Haitao \|last6=Yu \|first6=Dong \|title=Low-Bit Quantization Favors Undertrained LLMS: Scaling Laws for Quantized LLMS with 100T Training Tokens \|date=2024 \|class=cs.LG }} * {{cite arXiv \|eprint=2310.11453 \|last1=Wang \|first1=Hongyu \|last2=Ma \|first2=Shuming \|last3=Dong \|first3=Li \|last4=Huang \|first4=Shaohan \|last5=Wang \|first5=Huaijie \|last6=Ma \|first6=Lingxiao \|last7=Yang \|first7=Fan \|last8=Wang \|first8=Ruiping \|last9=Wu \|first9=Yi \|last10=Wei \|first10=Furu \|title=BitNet: Scaling 1-bit Transformers for Large Language Models \|date=2023 \|class=cs.CL }} {{Generative AI}} [[Category:Large language models]]▼ ▲[[Category:Large language models]] ~~{{ai-stub}}~~