Revision as of 05:13, 14 July 2024 edit Dicklyon (talk \| contribs) Autopatrolled, Extended confirmed users, Rollbackers 486,013 edits case fix ← Previous edit		Revision as of 09:50, 17 July 2024 edit undo Citation bot (talk \| contribs) Bots 5,871,530 edits Alter: template type, title. Add: magazine, arxiv, authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. \| Use this bot. Report bugs. \| Suggested by Headbomb \| #UCB_toolbar Next edit →
Line 45: For images, [[Saliency map\|saliency maps]] highlight the parts of an image that most influenced the result.<ref>{{Cite web \|last=Sharma \|first=Abhishek \|date=2018-07-11 \|title=What Are Saliency Maps In Deep Learning? \|url=https://analyticsindiamag.com/what-are-saliency-maps-in-deep-learning/ \|access-date=2024-07-10 \|website=Analytics India Magazine \|language=en-US}}</ref> However, these techniques are not very suitable for [[Language model\|language models]] like [[Generative pre-trained transformer\|generative pretrained transformers]]. Since these models generate language, they can provide an explanation, but which may not be reliable. Other techniques include attention analysis (examining how the model focuses on different parts of the input), probing methods (testing what information is captured in the model's representations), causal tracing (tracing the flow of information through the model) and circuit discovery (identifying specific subnetworks responsible for certain behaviors). Explainability research in this area overlaps significantly with interpretability and [[AI alignment\|alignment]] research.<ref>{{Citation \|~~last~~last1=Luo \|~~first~~first1=Haoyan \|title=From Understanding to Utilization: A Survey on Explainability for Large Language Models \|date=2024-02-21 \|url=http://arxiv.org/abs/2401.12874 \|access-date=2024-07-10 \|~~doi~~arxiv=~~10.48550/arXiv.~~2401.12874 \|last2=Specia \|first2=Lucia}}</ref> === Interpretability === Line 52: Interpretability research often focuses on generative pretrained transformers. It is particularly relevant for [[AI safety]] and [[AI alignment\|alignment]], as it may enable to identify signs of undesired behaviors such as [[sycophancy]], deceptiveness or bias, and to better steer AI models.<ref>{{Cite web \|last=Mittal \|first=Aayush \|date=2024-06-17 \|title=Understanding Sparse Autoencoders, GPT-4 & Claude 3 : An In-Depth Technical Exploration \|url=https://www.unite.ai/understanding-sparse-autoencoders-gpt-4-claude-3-an-in-depth-technical-exploration/ \|access-date=2024-07-10 \|website=Unite.AI \|language=en-US}}</ref> Studying the interpretability of the most advanced [[Foundation model\|foundation models]] is one of the primary goals of the company [[Anthropic]]{{Advert\|date=July 2024}}. This work often involves searching for an automated way to identify "features" in generative pretrained transformers like [[Claude (language model)\|Claude]]. In a [[Neural network (machine learning)\|neural network]], a feature is a pattern of neuron activations that corresponds to a concept. Using a compute-intensive technique called "[[dictionary learning]]", Anthropic was able to identify millions of features in Claude, including, for example, one associated with the [[Golden Gate Bridge]], and another representing the concept of a [[software bug]]. Enhancing the ability to identify and edit features is expected to significantly improve the [[AI safety\|safety]] of [[Frontier model\|frontier AI models]].<ref>{{Cite web \|last=Ropek \|first=Lucas \|date=2024-05-21 \|title=New Anthropic Research Sheds Light on AI's 'Black Box' \|url=https://gizmodo.com/new-anthropic-research-sheds-light-on-ais-black-box-1851491333 \|access-date=2024-05-23 \|website=Gizmodo \|language=en}}</ref><ref>{{Cite ~~web~~magazine \|last=Perrigo \|first=Billy \|date=2024-05-21 \|title=Artificial Intelligence Is a 'Black Box.' Maybe Not For Long \|url=https://time.com/6980210/anthropic-interpretability-ai-safety-research/ \|access-date=2024-05-24 \|~~website~~magazine=Time \|language=en}}</ref><ref>Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan {{citation \|title=Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet \|url=https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html \|access-date=24 May 2024 \|publisher=Anthropic}}</ref> For [[Convolutional neural network\|convolutional neural networks]], [[DeepDream]] can generate images that strongly activate a particular neuron, providing a visual hint about what the neuron is trained to identify.<ref>{{Cite ~~news~~magazine \|last=Barber \|first=Gregory \|title=Inside the ~~‘Black~~'Black ~~Box’~~Box' of a Neural Network \|url=https://www.wired.com/story/inside-black-box-of-neural-network/ \|access-date=2024-07-10 \|~~work~~magazine=Wired \|language=en-US \|issn=1059-1028}}</ref> == History and methods ==

Explainable artificial intelligence: Difference between revisions