Content deleted Content added
m replace advert with advert section and put it at the top of the section |
Not sure it was excessive, but removed the mention of Anthropic |
||
Line 48:
=== Interpretability ===
Scholars sometimes use the term "mechanistic interpretability" to refer to the process of [[Reverse engineering|reverse-engineering]] [[artificial neural networks]] to understand their internal decision-making mechanisms and components, similar to how one might analyze a complex machine or computer program.<ref>{{Cite web |last=Olah |first=Chris |date=June 27, 2022 |title=Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases |url=https://www.transformer-circuits.pub/2022/mech-interp-essay |access-date=2024-07-10 |website=www.transformer-circuits.pub}}</ref>
Interpretability research often focuses on generative pretrained transformers. It is particularly relevant for [[AI safety]] and [[AI alignment|alignment]], as it may enable to identify signs of undesired behaviors such as [[sycophancy]], deceptiveness or bias, and to better steer AI models.<ref>{{Cite web |last=Mittal |first=Aayush |date=2024-06-17 |title=Understanding Sparse Autoencoders, GPT-4 & Claude 3 : An In-Depth Technical Exploration |url=https://www.unite.ai/understanding-sparse-autoencoders-gpt-4-claude-3-an-in-depth-technical-exploration/ |access-date=2024-07-10 |website=Unite.AI |language=en-US}}</ref>
Studying the interpretability of the most advanced [[Foundation model|foundation models]]
For [[Convolutional neural network|convolutional neural networks]], [[DeepDream]] can generate images that strongly activate a particular neuron, providing a visual hint about what the neuron is trained to identify.<ref>{{Cite magazine |last=Barber |first=Gregory |title=Inside the 'Black Box' of a Neural Network |url=https://www.wired.com/story/inside-black-box-of-neural-network/ |access-date=2024-07-10 |magazine=Wired |language=en-US |issn=1059-1028}}</ref>
|