Revision as of 23:03, 24 June 2025 edit Jammmmmy (talk \| contribs) 6 edits mechanistic interpretability is a subfield of the broader XAI and does not necessarily overlap with "interpretable machine learning" (in fact it is misnomered and better considered mechanistic explainability or theory for deep learning) <ref>{{Cite web \|title=Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead \|url=https://www.nature.com/articles/s42256-019-0048-x \|language=en}}</ref> Tag: Reverted ← Previous edit		Revision as of 23:29, 24 June 2025 edit undo UrielAcosta (talk \| contribs) Extended confirmed users 44,048 edits Revert Tags: Manual revert Reverted Visual edit Mobile edit Mobile web edit Next edit →
Line 57: === Interpretability === {{Main\|Mechanistic interpretability}} [[File:Grokking modular addition.jpg\|thumb\|upright=1.2\|[[Grokking (machine learning)\|Grokking]] is an example of phenomenon studied in interpretability. It involves a model that initially memorizes all the answers ([[overfitting]]), but later adopts an algorithm that generalizes to unseen data.<ref>{{Cite web \|last=Ananthaswamy \|first=Anil \|date=2024-04-12 \|title=How Do Machines ‘Grok’ Data? \|url=https://www.quantamagazine.org/how-do-machines-grok-data-20240412/ \|access-date=2025-01-21 \|website=Quanta Magazine \|language=en}}</ref>]] Scholars sometimes use the term "[[mechanistic interpretability]]" to refer to the process of [[Reverse engineering\|reverse-engineering]] [[artificial neural networks]] to understand their internal decision-making mechanisms and components, similar to how one might analyze a complex machine or computer program.<ref>{{Cite web \|last=Olah \|first=Chris \|date=June 27, 2022 \|title=Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases \|url=https://www.transformer-circuits.pub/2022/mech-interp-essay \|access-date=2024-07-10 \|website=www.transformer-circuits.pub}}</ref> Interpretability research often focuses on generative pretrained transformers. It is particularly relevant for [[AI safety]] and [[AI alignment\|alignment]], as it may enable to identify signs of undesired behaviors such as [[sycophancy]], deceptiveness or bias, and to better steer AI models.<ref>{{Cite web \|last=Mittal \|first=Aayush \|date=2024-06-17 \|title=Understanding Sparse Autoencoders, GPT-4 & Claude 3 : An In-Depth Technical Exploration \|url=https://www.unite.ai/understanding-sparse-autoencoders-gpt-4-claude-3-an-in-depth-technical-exploration/ \|access-date=2024-07-10 \|website=Unite.AI \|language=en-US}}</ref>

Explainable artificial intelligence: Difference between revisions