Explainable artificial intelligence: Difference between revisions

Content deleted Content added
m replace advert with advert section and put it at the top of the section
Not sure it was excessive, but removed the mention of Anthropic
Line 48:
 
=== Interpretability ===
{{Advert section|date=July 2024}}
 
Scholars sometimes use the term "mechanistic interpretability" to refer to the process of [[Reverse engineering|reverse-engineering]] [[artificial neural networks]] to understand their internal decision-making mechanisms and components, similar to how one might analyze a complex machine or computer program.<ref>{{Cite web |last=Olah |first=Chris |date=June 27, 2022 |title=Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases |url=https://www.transformer-circuits.pub/2022/mech-interp-essay |access-date=2024-07-10 |website=www.transformer-circuits.pub}}</ref>
 
Interpretability research often focuses on generative pretrained transformers. It is particularly relevant for [[AI safety]] and [[AI alignment|alignment]], as it may enable to identify signs of undesired behaviors such as [[sycophancy]], deceptiveness or bias, and to better steer AI models.<ref>{{Cite web |last=Mittal |first=Aayush |date=2024-06-17 |title=Understanding Sparse Autoencoders, GPT-4 & Claude 3 : An In-Depth Technical Exploration |url=https://www.unite.ai/understanding-sparse-autoencoders-gpt-4-claude-3-an-in-depth-technical-exploration/ |access-date=2024-07-10 |website=Unite.AI |language=en-US}}</ref>
 
Studying the interpretability of the most advanced [[Foundation model|foundation models]] is one of the primary goals of the company [[Anthropic]]. This work often involves searching for an automated way to identify "features" in generative pretrained transformers like [[Claude (language model)|Claude]]. In a [[Neural network (machine learning)|neural network]], a feature is a pattern of neuron activations that corresponds to a concept. Using aA compute-intensive technique called "[[dictionary learning]]", Anthropicmakes wasit ablepossible to identify millions of features into Claude,some including, for example, one associated with the [[Golden Gate Bridge]], and another representing the concept of a [[software bug]]degree. Enhancing the ability to identify and edit features is expected to significantly improve the [[AI safety|safety]] of [[Frontier model|frontier AI models]].<ref>{{Cite web |last=Ropek |first=Lucas |date=2024-05-21 |title=New Anthropic Research Sheds Light on AI's 'Black Box' |url=https://gizmodo.com/new-anthropic-research-sheds-light-on-ais-black-box-1851491333 |access-date=2024-05-23 |website=Gizmodo |language=en}}</ref><ref>{{Cite magazine |last=Perrigo |first=Billy |date=2024-05-21 |title=Artificial Intelligence Is a 'Black Box.' Maybe Not For Long |url=https://time.com/6980210/anthropic-interpretability-ai-safety-research/ |access-date=2024-05-24 |magazine=Time |language=en}}</ref><ref>Adly Templeton*, Tom Conerly*, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan {{citation |title=Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet |url=https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html |access-date=24 May 2024 |publisher=Anthropic}}</ref>
 
For [[Convolutional neural network|convolutional neural networks]], [[DeepDream]] can generate images that strongly activate a particular neuron, providing a visual hint about what the neuron is trained to identify.<ref>{{Cite magazine |last=Barber |first=Gregory |title=Inside the 'Black Box' of a Neural Network |url=https://www.wired.com/story/inside-black-box-of-neural-network/ |access-date=2024-07-10 |magazine=Wired |language=en-US |issn=1059-1028}}</ref>