Revision as of 11:53, 27 July 2024 edit Itdoesntmatteranymore (talk \| contribs) Extended confirmed users 2,843 edits m replace advert with advert section and put it at the top of the section Tag: Visual edit: Switched ← Previous edit		Revision as of 18:31, 3 August 2024 edit undo Alenoach (talk \| contribs) Extended confirmed users 5,805 edits Not sure it was excessive, but removed the mention of Anthropic Tag: Visual edit Next edit →
Line 48: === Interpretability === ~~{{Advert section\|date=July 2024}}~~ Scholars sometimes use the term "mechanistic interpretability" to refer to the process of [[Reverse engineering\|reverse-engineering]] [[artificial neural networks]] to understand their internal decision-making mechanisms and components, similar to how one might analyze a complex machine or computer program.<ref>{{Cite web \|last=Olah \|first=Chris \|date=June 27, 2022 \|title=Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases \|url=https://www.transformer-circuits.pub/2022/mech-interp-essay \|access-date=2024-07-10 \|website=www.transformer-circuits.pub}}</ref> Interpretability research often focuses on generative pretrained transformers. It is particularly relevant for [[AI safety]] and [[AI alignment\|alignment]], as it may enable to identify signs of undesired behaviors such as [[sycophancy]], deceptiveness or bias, and to better steer AI models.<ref>{{Cite web \|last=Mittal \|first=Aayush \|date=2024-06-17 \|title=Understanding Sparse Autoencoders, GPT-4 & Claude 3 : An In-Depth Technical Exploration \|url=https://www.unite.ai/understanding-sparse-autoencoders-gpt-4-claude-3-an-in-depth-technical-exploration/ \|access-date=2024-07-10 \|website=Unite.AI \|language=en-US}}</ref> Studying the interpretability of the most advanced [[Foundation model\|foundation models]] ~~is one of the primary goals of the company [[Anthropic]]. This work~~ often involves searching for an automated way to identify "features" in generative pretrained transformers ~~like [[Claude (language model)\|Claude]]~~. In a [[Neural network (machine learning)\|neural network]], a feature is a pattern of neuron activations that corresponds to a concept. ~~Using a~~A compute-intensive technique called "[[dictionary learning]]", ~~Anthropic~~makes ~~was~~it ~~able~~possible to identify ~~millions of~~ features into ~~Claude,~~some ~~including, for example, one associated with the [[Golden Gate Bridge]], and another representing the concept of a [[software bug]]~~degree. Enhancing the ability to identify and edit features is expected to significantly improve the [[AI safety\|safety]] of [[Frontier model\|frontier AI models]].<ref>{{Cite web \|last=Ropek \|first=Lucas \|date=2024-05-21 \|title=New Anthropic Research Sheds Light on AI's 'Black Box' \|url=https://gizmodo.com/new-anthropic-research-sheds-light-on-ais-black-box-1851491333 \|access-date=2024-05-23 \|website=Gizmodo \|language=en}}</ref><ref>{{Cite magazine \|last=Perrigo \|first=Billy \|date=2024-05-21 \|title=Artificial Intelligence Is a 'Black Box.' Maybe Not For Long \|url=https://time.com/6980210/anthropic-interpretability-ai-safety-research/ \|access-date=2024-05-24 \|magazine=Time \|language=en}}</ref><ref>Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan {{citation \|title=Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet \|url=https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html \|access-date=24 May 2024 \|publisher=Anthropic}}</ref> For [[Convolutional neural network\|convolutional neural networks]], [[DeepDream]] can generate images that strongly activate a particular neuron, providing a visual hint about what the neuron is trained to identify.<ref>{{Cite magazine \|last=Barber \|first=Gregory \|title=Inside the 'Black Box' of a Neural Network \|url=https://www.wired.com/story/inside-black-box-of-neural-network/ \|access-date=2024-07-10 \|magazine=Wired \|language=en-US \|issn=1059-1028}}</ref>

Explainable artificial intelligence: Difference between revisions