Explainable artificial intelligence: Difference between revisions

Content deleted Content added
mechanistic interpretability is a subfield of the broader XAI and does not necessarily overlap with "interpretable machine learning" (in fact it is misnomered and better considered mechanistic explainability or theory for deep learning) <ref>{{Cite web |title=Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead |url=https://www.nature.com/articles/s42256-019-0048-x |language=en}}</ref>
Tag: Reverted
Revert
Tags: Manual revert Reverted Visual edit Mobile edit Mobile web edit
Line 57:
 
=== Interpretability ===
{{Main|Mechanistic interpretability}}
[[File:Grokking modular addition.jpg|thumb|upright=1.2|[[Grokking (machine learning)|Grokking]] is an example of phenomenon studied in interpretability. It involves a model that initially memorizes all the answers ([[overfitting]]), but later adopts an algorithm that generalizes to unseen data.<ref>{{Cite web |last=Ananthaswamy |first=Anil |date=2024-04-12 |title=How Do Machines ‘Grok’ Data? |url=https://www.quantamagazine.org/how-do-machines-grok-data-20240412/ |access-date=2025-01-21 |website=Quanta Magazine |language=en}}</ref>]]
Scholars sometimes use the term "[[mechanistic interpretability]]" to refer to the process of [[Reverse engineering|reverse-engineering]] [[artificial neural networks]] to understand their internal decision-making mechanisms and components, similar to how one might analyze a complex machine or computer program.<ref>{{Cite web |last=Olah |first=Chris |date=June 27, 2022 |title=Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases |url=https://www.transformer-circuits.pub/2022/mech-interp-essay |access-date=2024-07-10 |website=www.transformer-circuits.pub}}</ref>
 
Interpretability research often focuses on generative pretrained transformers. It is particularly relevant for [[AI safety]] and [[AI alignment|alignment]], as it may enable to identify signs of undesired behaviors such as [[sycophancy]], deceptiveness or bias, and to better steer AI models.<ref>{{Cite web |last=Mittal |first=Aayush |date=2024-06-17 |title=Understanding Sparse Autoencoders, GPT-4 & Claude 3 : An In-Depth Technical Exploration |url=https://www.unite.ai/understanding-sparse-autoencoders-gpt-4-claude-3-an-in-depth-technical-exploration/ |access-date=2024-07-10 |website=Unite.AI |language=en-US}}</ref>