Revision as of 06:27, 9 October 2024 edit Citation bot (talk \| contribs) Bots 5,862,101 edits Removed parameters. \| Use this bot. Report bugs. \| Suggested by Dominic3203 \| #UCB_webform 111/199 ← Previous edit		Revision as of 21:43, 7 November 2024 edit undo Yasmeenbg (talk \| contribs) 14 edits grammar Tag: Visual edit Next edit →
Line 35: === Explainability === Explainability is useful for ensuring that AI models are not ~~taking~~making decisions based on irrelevant or otherwise unfair criteria. For [[Statistical classification\|classification]] and [[Regression analysis\|regression]] models, several popular techniques exist: * ''Partial dependency plots'' show the marginal effect of an input feature on the predicted outcome. Line 50: Scholars sometimes use the term "mechanistic interpretability" to refer to the process of [[Reverse engineering\|reverse-engineering]] [[artificial neural networks]] to understand their internal decision-making mechanisms and components, similar to how one might analyze a complex machine or computer program.<ref>{{Cite web \|last=Olah \|first=Chris \|date=June 27, 2022 \|title=Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases \|url=https://www.transformer-circuits.pub/2022/mech-interp-essay \|access-date=2024-07-10 \|website=www.transformer-circuits.pub}}</ref> Interpretability research often focuses on generative pretrained transformers. It is particularly relevant for [[AI safety]] and [[AI alignment\|alignment,]], as it may enable to identify signs of undesired behaviors such as [[sycophancy]], deceptiveness or bias, and to better steer AI models.<ref>{{Cite web \|last=Mittal \|first=Aayush \|date=2024-06-17 \|title=Understanding Sparse Autoencoders, GPT-4 & Claude 3 : An In-Depth Technical Exploration \|url=https://www.unite.ai/understanding-sparse-autoencoders-gpt-4-claude-3-an-in-depth-technical-exploration/ \|access-date=2024-07-10 \|website=Unite.AI \|language=en-US}}</ref> Studying the interpretability of the most advanced [[Foundation model\|foundation models]] often involves searching for an automated way to identify "features" in generative pretrained transformers. In a [[Neural network (machine learning)\|neural network]], a feature is a pattern of neuron activations that corresponds to a concept. A compute-intensive technique called "[[dictionary learning]]" makes it possible to identify features to some degree. Enhancing the ability to identify and edit features is expected to significantly improve the [[AI safety\|safety]] of [[Frontier model\|frontier AI models]].<ref>{{Cite web \|last=Ropek \|first=Lucas \|date=2024-05-21 \|title=New Anthropic Research Sheds Light on AI's 'Black Box' \|url=https://gizmodo.com/new-anthropic-research-sheds-light-on-ais-black-box-1851491333 \|access-date=2024-05-23 \|website=Gizmodo \|language=en}}</ref><ref>{{Cite magazine \|last=Perrigo \|first=Billy \|date=2024-05-21 \|title=Artificial Intelligence Is a 'Black Box.' Maybe Not For Long \|url=https://time.com/6980210/anthropic-interpretability-ai-safety-research/ \|access-date=2024-05-24 \|magazine=Time \|language=en}}</ref>

Explainable artificial intelligence: Difference between revisions