# Source (linked) NeurIPS/ICML/ICLR/Other Reputable Conferences Reputable journal/news/500+ Citations Does it satisfy either condition?
1 Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases No No  
2 Zoom In: An Introduction to Circuits (Distill) No Yes, Distill  
3 A Mathematical Framework for Transformer Circuits No 633 Citations  
4 Mechanistic? · ACL workshop version Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP (Peer-reviewed, archival workshop proceeding at ACL) No  
5 The Building Blocks of Interpretability (Distill) No Distill  
6 Transformer Circuits Thread No No  
7 In-context Learning and Induction Heads No (arXiv) 600 Citations  
8 Toy Models of Superposition No (arXiv / project page) 368 Citations  
9 Progress Measures for Grokking via Mechanistic Interpretability ICLR 2023 (main; notable top 25%) No  
10 Sparse Autoencoders Find Highly Interpretable Features in Language Models ICLR 2024 (main) No  
11 Towards Monosemanticity: Decomposing LMs with Dictionary Learning No 587 Citations  
12 This startup wants to reprogram the mind of AI—and just got $50 million to do it No Fast Company (news)  
13 ICML 2024 Mechanistic Interpretability Workshop ICML 2024 (Workshop; not main proceedings) No  
14 Tweet by @ch402 (Chris Olah), Jul 29, 2024 No No  
15 Mechanistic Interpretability Quickstart Guide No No  
16 Mechanistic Interpretability Workshop 2024 (site) ICML 2024 (Workshop; not main proceedings) No  
17 Linguistic Regularities in Continuous Space Word Representations ACL 5601 Citations  
18 The Linear Representation Hypothesis and the Geometry of LLMs ICML 2024 (main) No  
19 Circuits Updates — July 2024: What is a Linear Representation? No No  
20 Open Problems in Mechanistic Interpretability No (arXiv) No  
21 Interpreting Neural Networks through the Polytope Lens No (arXiv / blog) No  
22 RNNs Learn to Store and Generate Sequences using Non-Linear Representations 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP (Peer-reviewed, archival workshop proceeding at ACL) No  
23 Polysemanticity and Capacity in Neural Networks No (arXiv) No  
24 Mechanistic Interpretability for AI Safety — A Review No TMLR (journal)  
25 Understanding Intermediate Layers using Linear Classifier Probes ICLR 2017 (Workshop track) 1126 Citations  
26 The Geometry of Truth: Emergent Linear Structure in LLM Representations COLM 2024 No  
27 Refusal in Language Models Is Mediated by a Single Direction NeurIPS 2024 poster No  
28 Linear Representations of Sentiment in Large Language Models No (arXiv) No  
29 Open Problems in Mechanistic Interpretability No (arXiv) — duplicate of #20 No  
30 Investigating Gender Bias in Language Models Using Causal Mediation Analysis NeurIPS 2020 (main) No  
31 Locating and Editing Factual Associations in GPT NeurIPS 2022 (main) No  
32 NLI Models Partially Embed Theories of Lexical Entailment and Negation Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (Peer-reviewed, archival workshop proceeding at ACL) No  
33 Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability No JMLR (journal)  
34 Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small ICLR 2023 (poster) No  
35 Towards Best Practices of Activation Patching in Language Models: Metrics and Methods ICLR 2024 (main) No  
36 Open Problems in Mechanistic Interpretability No (arXiv) — duplicate of #20 No  
37 Attribution Patching: Activation Patching At Industrial Scale No (blog) No  
38 Attribution Patching Outperforms Automated Circuit Discovery BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP (Peer-reviewed, archival workshop proceeding at ACL) No  
39 AtP*: An efficient and scalable method for localizing LLM behaviour to components No (arXiv) No  
40 Axiomatic Attribution for Deep Networks ICML 2017 (main) 8647 Citations  
41 Open Problems in Mechanistic Interpretability No (arXiv) — duplicate of #20 No  
42 Scaling and evaluating sparse autoencoders ICLR 2025 (Oral) No  
43 Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders No (arXiv) No  
44 Circuits Updates — January 2025: Dictionary Learning Optimization Techniques No (blog/tech note) No