Content deleted Content added
mechanistic interpretability is a subfield of the broader XAI and does not necessarily overlap with "interpretable machine learning" (in fact it is misnomered and better considered mechanistic explainability or theory for deep learning) <ref>{{Cite web |title=Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead |url=https://www.nature.com/articles/s42256-019-0048-x |language=en}}</ref> Tag: Reverted |
UrielAcosta (talk | contribs) Revert Tags: Manual revert Reverted Visual edit Mobile edit Mobile web edit |
||
Line 57:
=== Interpretability ===
{{Main|Mechanistic interpretability}}
[[File:Grokking modular addition.jpg|thumb|upright=1.2|[[Grokking (machine learning)|Grokking]] is an example of phenomenon studied in interpretability. It involves a model that initially memorizes all the answers ([[overfitting]]), but later adopts an algorithm that generalizes to unseen data.<ref>{{Cite web |last=Ananthaswamy |first=Anil |date=2024-04-12 |title=How Do Machines ‘Grok’ Data? |url=https://www.quantamagazine.org/how-do-machines-grok-data-20240412/ |access-date=2025-01-21 |website=Quanta Magazine |language=en}}</ref>]]
Scholars sometimes use the term "
Interpretability research often focuses on generative pretrained transformers. It is particularly relevant for [[AI safety]] and [[AI alignment|alignment]], as it may enable to identify signs of undesired behaviors such as [[sycophancy]], deceptiveness or bias, and to better steer AI models.<ref>{{Cite web |last=Mittal |first=Aayush |date=2024-06-17 |title=Understanding Sparse Autoencoders, GPT-4 & Claude 3 : An In-Depth Technical Exploration |url=https://www.unite.ai/understanding-sparse-autoencoders-gpt-4-claude-3-an-in-depth-technical-exploration/ |access-date=2024-07-10 |website=Unite.AI |language=en-US}}</ref>
|