Content deleted Content added
case fix |
Citation bot (talk | contribs) Alter: template type, title. Add: magazine, arxiv, authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar |
||
Line 45:
For images, [[Saliency map|saliency maps]] highlight the parts of an image that most influenced the result.<ref>{{Cite web |last=Sharma |first=Abhishek |date=2018-07-11 |title=What Are Saliency Maps In Deep Learning? |url=https://analyticsindiamag.com/what-are-saliency-maps-in-deep-learning/ |access-date=2024-07-10 |website=Analytics India Magazine |language=en-US}}</ref>
However, these techniques are not very suitable for [[Language model|language models]] like [[Generative pre-trained transformer|generative pretrained transformers]]. Since these models generate language, they can provide an explanation, but which may not be reliable. Other techniques include attention analysis (examining how the model focuses on different parts of the input), probing methods (testing what information is captured in the model's representations), causal tracing (tracing the flow of information through the model) and circuit discovery (identifying specific subnetworks responsible for certain behaviors). Explainability research in this area overlaps significantly with interpretability and [[AI alignment|alignment]] research.<ref>{{Citation |
=== Interpretability ===
Line 52:
Interpretability research often focuses on generative pretrained transformers. It is particularly relevant for [[AI safety]] and [[AI alignment|alignment]], as it may enable to identify signs of undesired behaviors such as [[sycophancy]], deceptiveness or bias, and to better steer AI models.<ref>{{Cite web |last=Mittal |first=Aayush |date=2024-06-17 |title=Understanding Sparse Autoencoders, GPT-4 & Claude 3 : An In-Depth Technical Exploration |url=https://www.unite.ai/understanding-sparse-autoencoders-gpt-4-claude-3-an-in-depth-technical-exploration/ |access-date=2024-07-10 |website=Unite.AI |language=en-US}}</ref>
Studying the interpretability of the most advanced [[Foundation model|foundation models]] is one of the primary goals of the company [[Anthropic]]{{Advert|date=July 2024}}. This work often involves searching for an automated way to identify "features" in generative pretrained transformers like [[Claude (language model)|Claude]]. In a [[Neural network (machine learning)|neural network]], a feature is a pattern of neuron activations that corresponds to a concept. Using a compute-intensive technique called "[[dictionary learning]]", Anthropic was able to identify millions of features in Claude, including, for example, one associated with the [[Golden Gate Bridge]], and another representing the concept of a [[software bug]]. Enhancing the ability to identify and edit features is expected to significantly improve the [[AI safety|safety]] of [[Frontier model|frontier AI models]].<ref>{{Cite web |last=Ropek |first=Lucas |date=2024-05-21 |title=New Anthropic Research Sheds Light on AI's 'Black Box' |url=https://gizmodo.com/new-anthropic-research-sheds-light-on-ais-black-box-1851491333 |access-date=2024-05-23 |website=Gizmodo |language=en}}</ref><ref>{{Cite
For [[Convolutional neural network|convolutional neural networks]], [[DeepDream]] can generate images that strongly activate a particular neuron, providing a visual hint about what the neuron is trained to identify.<ref>{{Cite
== History and methods ==
|