#
|
Source (linked)
|
NeurIPS/ICML/ICLR/Other Reputable Conferences
|
Reputable journal/news/500+ Citations
|
Does it satisfy either condition?
|
1
|
Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases
|
No
|
No
|
|
2
|
Zoom In: An Introduction to Circuits (Distill)
|
No
|
Yes, Distill
|
|
3
|
A Mathematical Framework for Transformer Circuits
|
No
|
633 Citations
|
|
4
|
Mechanistic? · ACL workshop version
|
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP (Peer-reviewed, archival workshop proceeding at ACL)
|
No
|
|
5
|
The Building Blocks of Interpretability (Distill)
|
No
|
Distill
|
|
6
|
Transformer Circuits Thread
|
No
|
No
|
|
7
|
In-context Learning and Induction Heads
|
No (arXiv)
|
600 Citations
|
|
8
|
Toy Models of Superposition
|
No (arXiv / project page)
|
368 Citations
|
|
9
|
Progress Measures for Grokking via Mechanistic Interpretability
|
ICLR 2023 (main; notable top 25%)
|
No
|
|
10
|
Sparse Autoencoders Find Highly Interpretable Features in Language Models
|
ICLR 2024 (main)
|
No
|
|
11
|
Towards Monosemanticity: Decomposing LMs with Dictionary Learning
|
No
|
587 Citations
|
|
12
|
This startup wants to reprogram the mind of AI—and just got $50 million to do it
|
No
|
Fast Company (news)
|
|
13
|
ICML 2024 Mechanistic Interpretability Workshop
|
ICML 2024 (Workshop; not main proceedings)
|
No
|
|
14
|
Tweet by @ch402 (Chris Olah), Jul 29, 2024
|
No
|
No
|
|
15
|
Mechanistic Interpretability Quickstart Guide
|
No
|
No
|
|
16
|
Mechanistic Interpretability Workshop 2024 (site)
|
ICML 2024 (Workshop; not main proceedings)
|
No
|
|
17
|
Linguistic Regularities in Continuous Space Word Representations
|
ACL
|
5601 Citations
|
|
18
|
The Linear Representation Hypothesis and the Geometry of LLMs
|
ICML 2024 (main)
|
No
|
|
19
|
Circuits Updates — July 2024: What is a Linear Representation?
|
No
|
No
|
|
20
|
Open Problems in Mechanistic Interpretability
|
No (arXiv)
|
No
|
|
21
|
Interpreting Neural Networks through the Polytope Lens
|
No (arXiv / blog)
|
No
|
|
22
|
RNNs Learn to Store and Generate Sequences using Non-Linear Representations
|
7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP (Peer-reviewed, archival workshop proceeding at ACL)
|
No
|
|
23
|
Polysemanticity and Capacity in Neural Networks
|
No (arXiv)
|
No
|
|
24
|
Mechanistic Interpretability for AI Safety — A Review
|
No
|
TMLR (journal)
|
|
25
|
Understanding Intermediate Layers using Linear Classifier Probes
|
ICLR 2017 (Workshop track)
|
1126 Citations
|
|
26
|
The Geometry of Truth: Emergent Linear Structure in LLM Representations
|
COLM 2024
|
No
|
|
27
|
Refusal in Language Models Is Mediated by a Single Direction
|
NeurIPS 2024 poster
|
No
|
|
28
|
Linear Representations of Sentiment in Large Language Models
|
No (arXiv)
|
No
|
|
29
|
Open Problems in Mechanistic Interpretability
|
No (arXiv) — duplicate of #20
|
No
|
|
30
|
Investigating Gender Bias in Language Models Using Causal Mediation Analysis
|
NeurIPS 2020 (main)
|
No
|
|
31
|
Locating and Editing Factual Associations in GPT
|
NeurIPS 2022 (main)
|
No
|
|
32
|
NLI Models Partially Embed Theories of Lexical Entailment and Negation
|
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (Peer-reviewed, archival workshop proceeding at ACL)
|
No
|
|
33
|
Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability
|
No
|
JMLR (journal)
|
|
34
|
Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small
|
ICLR 2023 (poster)
|
No
|
|
35
|
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
|
ICLR 2024 (main)
|
No
|
|
36
|
Open Problems in Mechanistic Interpretability
|
No (arXiv) — duplicate of #20
|
No
|
|
37
|
Attribution Patching: Activation Patching At Industrial Scale
|
No (blog)
|
No
|
|
38
|
Attribution Patching Outperforms Automated Circuit Discovery
|
BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP (Peer-reviewed, archival workshop proceeding at ACL)
|
No
|
|
39
|
AtP*: An efficient and scalable method for localizing LLM behaviour to components
|
No (arXiv)
|
No
|
|
40
|
Axiomatic Attribution for Deep Networks
|
ICML 2017 (main)
|
8647 Citations
|
|
41
|
Open Problems in Mechanistic Interpretability
|
No (arXiv) — duplicate of #20
|
No
|
|
42
|
Scaling and evaluating sparse autoencoders
|
ICLR 2025 (Oral)
|
No
|
|
43
|
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
|
No (arXiv)
|
No
|
|
44
|
Circuits Updates — January 2025: Dictionary Learning Optimization Techniques
|
No (blog/tech note)
|
No
|
|