User:JoNeedsSleep/sandbox

#	Source (linked)	NeurIPS/ICML/ICLR/Other Reputable Conferences	Reputable journal/news/500+ Citations
1	Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases	No	No
2	Zoom In: An Introduction to Circuits (Distill)	No	Yes, Distill
3	A Mathematical Framework for Transformer Circuits	No	633 Citations
4	Mechanistic? · ACL workshop version	Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP (Peer-reviewed, archival workshop proceeding at ACL)	No
5	The Building Blocks of Interpretability (Distill)	No	Distill
6	Transformer Circuits Thread	No	No
7	In-context Learning and Induction Heads	No (arXiv)	600 Citations
8	Toy Models of Superposition	No (arXiv / project page)	368 Citations
9	Progress Measures for Grokking via Mechanistic Interpretability	ICLR 2023 (main; notable top 25%)	No
10	Sparse Autoencoders Find Highly Interpretable Features in Language Models	ICLR 2024 (main)	No
11	Towards Monosemanticity: Decomposing LMs with Dictionary Learning	No	587 Citations
12	This startup wants to reprogram the mind of AI—and just got $50 million to do it	No	Fast Company (news)
13	ICML 2024 Mechanistic Interpretability Workshop	ICML 2024 (Workshop; not main proceedings)	No
14	Tweet by @ch402 (Chris Olah), Jul 29, 2024	No	No
15	Mechanistic Interpretability Quickstart Guide	No	No
16	Mechanistic Interpretability Workshop 2024 (site)	ICML 2024 (Workshop; not main proceedings)	No
17	Linguistic Regularities in Continuous Space Word Representations	ACL	5601 Citations
18	The Linear Representation Hypothesis and the Geometry of LLMs	ICML 2024 (main)	No
19	Circuits Updates — July 2024: What is a Linear Representation?	No	No
20	Open Problems in Mechanistic Interpretability	No (arXiv)	No
21	Interpreting Neural Networks through the Polytope Lens	No (arXiv / blog)	No
22	RNNs Learn to Store and Generate Sequences using Non-Linear Representations	7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP (Peer-reviewed, archival workshop proceeding at ACL)	No
23	Polysemanticity and Capacity in Neural Networks	No (arXiv)	No
24	Mechanistic Interpretability for AI Safety — A Review	No	TMLR (journal)
25	Understanding Intermediate Layers using Linear Classifier Probes	ICLR 2017 (Workshop track)	1126 Citations
26	The Geometry of Truth: Emergent Linear Structure in LLM Representations	COLM 2024	No
27	Refusal in Language Models Is Mediated by a Single Direction	NeurIPS 2024 poster	No
28	Linear Representations of Sentiment in Large Language Models	No (arXiv)	No
29	Open Problems in Mechanistic Interpretability	No (arXiv) — duplicate of #20	No
30	Investigating Gender Bias in Language Models Using Causal Mediation Analysis	NeurIPS 2020 (main)	No
31	Locating and Editing Factual Associations in GPT	NeurIPS 2022 (main)	No
32	NLI Models Partially Embed Theories of Lexical Entailment and Negation	Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (Peer-reviewed, archival workshop proceeding at ACL)	No
33	Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability	No	JMLR (journal)
34	Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small	ICLR 2023 (poster)	No
35	Towards Best Practices of Activation Patching in Language Models: Metrics and Methods	ICLR 2024 (main)	No
36	Open Problems in Mechanistic Interpretability	No (arXiv) — duplicate of #20	No
37	Attribution Patching: Activation Patching At Industrial Scale	No (blog)	No
38	Attribution Patching Outperforms Automated Circuit Discovery	BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP (Peer-reviewed, archival workshop proceeding at ACL)	No
39	AtP*: An efficient and scalable method for localizing LLM behaviour to components	No (arXiv)	No
40	Axiomatic Attribution for Deep Networks	ICML 2017 (main)	8647 Citations
41	Open Problems in Mechanistic Interpretability	No (arXiv) — duplicate of #20	No
42	Scaling and evaluating sparse autoencoders	ICLR 2025 (Oral)	No
43	Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders	No (arXiv)	No
44	Circuits Updates — January 2025: Dictionary Learning Optimization Techniques	No (blog/tech note)	No