Reasoning language model: Difference between revisions

Content deleted Content added
Minor fixes; mention PPO
Added History section from Reflection (artificial intelligence)
Line 2:
 
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that have been further trained to solve multi-step reasoning tasks.<ref>{{cite arXiv |title=Reasoning Language Models: A Blueprint |last=Besta |first=Maciej |date=2025-01-23 |eprint=2501.11223 |class=cs.CL}}</ref> These models perform better on logical, mathematical or programmatic tasks than traditional autoregressive LLMs, have the ability to [[Backtracking|backtrack]], and employ test-time compute as an additional [[Neural scaling law|scaling axis]] beyond [[Training, validation, and test data sets|training examples]], parameter count, and train-time compute.
 
== History ==
=== 2024 ===
o1-preview, an LLM with enhanced reasoning, was released in September 2024.<ref>{{Cite web |last=Edwards |first=Benj |date=2024-09-12 |title=OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini |url=https://arstechnica.com/information-technology/2024/09/openais-new-reasoning-ai-models-are-here-o1-preview-and-o1-mini/ |access-date=2025-02-06 |website=Ars Technica |language=en-US}}</ref> The full version, [[OpenAI o1|o1]], followed in December 2024. OpenAI also began sharing results on its successor, [[OpenAI o3|o3]].<ref>{{Cite web |last= |first= |date=2024-12-20 |title=OpenAI confirms new frontier models o3 and o3-mini |url=https://venturebeat.com/ai/openai-confirms-new-frontier-models-o3-and-o3-mini/ |access-date=2025-02-06 |website=VentureBeat |language=en-US}}</ref>
 
The development of reasoning LLMs has illustrated what [[Richard S. Sutton|Rich Sutton]] termed the "bitter lesson": that general methods leveraging computation often outperform those relying on specific human insights.<ref>{{Cite web |last=Sutton |first=Richard S. |title=The Bitter Lesson |url=http://www.incompleteideas.net/IncIdeas/BitterLesson.html |access-date=2025-02-27 |website=Incomplete Ideas}}</ref> For instance, some research groups, such as the Generative AI Research Lab (GAIR), initially explored complex techniques like tree search and reinforcement learning in attempts to replicate o1's capabilities. However, they found, as documented in their "o1 Replication Journey" papers, that [[knowledge distillation]] — training a smaller model to mimic o1's outputs – was surprisingly effective. This highlighted the power of distillation in this context.
 
[[Alibaba Group|Alibaba]] also released reasoning versions of its [[Qwen]] LLMs in November 2024.
 
In December 2024, Google introduced [[Deep research|Deep Research]] in [[Gemini (chatbot)|Gemini]],<ref>{{Cite web |date=2024-12-11 |title=Try Deep Research and our new experimental model in Gemini, your AI assistant |url=https://blog.google/products/gemini/google-gemini-deep-research/ |access-date=2025-02-05 |website=Google |language=en-us}}</ref> a feature in Gemini that conducts multi-step research tasks.
 
On December 16, 2024, an experiment using a [[Llama (language model)|Llama]] 3B model demonstrated that by scaling test-time compute, a relatively small model could outperform a much larger Llama 70B model on challenging reasoning tasks. This result highlighted that improved inference strategies can unlock latent reasoning capabilities even in compact models.<ref>{{Cite web |title=Scaling test-time compute - a Hugging Face Space by HuggingFaceH4 |url=https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute |access-date=2025-02-05 |website=huggingface.co}}</ref>
 
=== 2025 ===
In January 2025, DeepSeek released R1, a model competitive with o1 at lower cost, highlighting the effectiveness of GRPO.<ref>{{Cite web |last=Orland |first=Kyle |date=2025-01-28 |title=How does DeepSeek R1 really fare against OpenAI's best reasoning models? |url=https://arstechnica.com/ai/2025/01/how-does-deepseek-r1-really-fare-against-openais-best-reasoning-models/ |access-date=2025-02-06 |website=Ars Technica |language=en-US}}</ref> On January 25, 2025, [[DeepSeek]] launched a feature in their DeepSeek R1 model, enabling the simultaneous use of search and reasoning capabilities, which allows for more efficient integration of data retrieval with reflective reasoning processes. OpenAI subsequently released o3-mini, followed by [[ChatGPT Deep Research|Deep Research]] which is based on [[OpenAI o3|o3]].<ref>{{Cite news |last=Milmo |first=Dan |date=2025-02-03 |title=OpenAI launches 'deep research' tool that it says can match research analyst |url=https://www.theguardian.com/technology/2025/feb/03/openai-deep-research-agent-chatgpt-deepseek |access-date=2025-03-16 |work=The Guardian |language=en-GB |issn=0261-3077}}</ref> The power of distillation was further demonstrated by s1-32B, achieving strong performance with budget forcing and scaling techniques.<ref>{{Citation |last1=Muennighoff |first1=Niklas |title=s1: Simple test-time scaling |date=2025-02-03 |arxiv=2501.19393 |last2=Yang |first2=Zitong |last3=Shi |first3=Weijia |last4=Li |first4=Xiang Lisa |last5=Fei-Fei |first5=Li |last6=Hajishirzi |first6=Hannaneh |last7=Zettlemoyer |first7=Luke |last8=Liang |first8=Percy |last9=Candès |first9=Emmanuel}}</ref>
 
On February 2, 2025, OpenAI released deep research,<ref>{{Cite web |date=2025-02-02 |title=Introducing deep research |url=https://openai.com/index/introducing-deep-research/ |access-date=2025-02-05 |website=OpenAI |language=en-US}}</ref> a tool that integrates reasoning and web search in a unified workflow, allowing users to perform complex research tasks that require multi-step reasoning and data synthesis from multiple sources. It is based on [[OpenAI o3|o3]] and can take from 5 to 30 minutes to generate comprehensive reports.<ref>{{Cite web |last=Ha |first=Anthony |date=2025-02-03 |title=OpenAI unveils a new ChatGPT agent for 'deep research' |url=https://techcrunch.com/2025/02/02/openai-unveils-a-new-chatgpt-agent-for-deep-research/ |access-date=2025-02-06 |website=TechCrunch |language=en-US}}</ref>
 
== Supervised finetuning ==