Talk:Mechanistic interpretability

Latest comment: 1 day ago by Alenoach in topic Bad sourcing, COI editing

Creating the Mechanistic Interpretability article

edit

I started this page after consulting with the meta-page XAI/Interpretability in Talk:Explainable artificial intelligence#Mechanistic Interpretability. I believe that mech interp is a sufficient distinct movement and field of study, with a growing body of work (see main article), such that it warrants a separate page. I plan to continuously improve upon and maintain this page, and will respond to feedback. Thank you! JoNeedsSleep (talk) 03:28, 12 May 2025 (UTC)Reply

Requested move 12 May 2025

edit
The following is a closed discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. Editors desiring to contest the closing decision should consider a move review after discussing it on the closer's talk page. No further edits should be made to this discussion.

The result of the move request was: Speedy close as moved; agreed to be a non-controversial "technical" request. (closed by non-admin page mover) —⁠ ⁠BarrelProof (talk) 21:22, 12 May 2025 (UTC)Reply


Mechanistic InterpretabilityMechanistic interpretability – titles should be in sentence case (see MOS:TITLECAPS). Alenoach (talk) 08:03, 12 May 2025 (UTC)Reply

Agree. I don't see why this would be controversial. WeyerStudentOfAgrippa (talk) 14:32, 12 May 2025 (UTC)Reply
Yes, it's not controversial indeed, but there is already a redirect page named "Mechanistic interpretability", so renaming it requires special permissions. Alenoach (talk) 14:58, 12 May 2025 (UTC)Reply
The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

Wiki Education assignment: Computation, Culture, and Society

edit

  This article was the subject of a Wiki Education Foundation-supported course assignment, between 25 March 2025 and 31 May 2025. Further details are available on the course page. Student editor(s): JoNeedsSleep (article contribs).

— Assignment last updated by Laurelli7 (talk) 19:03, 30 May 2025 (UTC)Reply

Just curious

edit

@JoNeedsSleep: Glad to see this page exists. Just curious, was any of the content from User:AryamanA/Draft:Mechanistic interpretability? (Don't really care if it was or wasn't, just glad if it was useful!) AryamanA (talk, contribs) 01:34, 1 July 2025 (UTC)Reply

Hi Aryaman, appreciate the message! I didn’t see this sandbox page when I created the page - it would’ve been very helpful. Just had a look at your draft; your techniques section, especially the extensive entry on SAEs and its evals, would be a great addition to the current methods section. JoNeedsSleep (talk) 17:07, 1 July 2025 (UTC)Reply
Awesome, yes I'll be contributing to the page as I have time. I've merged in the new stuff I had (mainly SAEs). AryamanA (talk, contribs) 22:07, 1 July 2025 (UTC)Reply

Bad sourcing, COI editing

edit

This article seems substantially cited to primary sources and unreliable sources such as arXiv preprints, and there's a call to action on an external site from the people who seem to have coined a lot of this stuff. At best it may be OR if not just straight up COI.

What's the coverage of the topic like in RSes from people who are not directly involved? - David Gerard (talk) 22:53, 13 August 2025 (UTC)Reply

I am one of the main contributors, a current Ph.D. student at Stanford working on mechanistic interpretability and not involved in the LessWrong/AI safety sphere. I was working on this article before the call above. I have spotlight papers at NeurIPS and ICML on these topics. I intended to work on this article as a sort of survey of the field, to collate important works I frequently refer to. I do not appreciate the removal of the citations that I had added, yes some of these are arXiv preprints/blogposts/LessWrong posts but the ones I have added are widely used in the field. What are the appropriate guidelines on citations for scientific sources? I am not sure about the removal of my contributions by User:Stepwise Continuous Dysfunction and would at the very least like a reference to the justifying guidelines. I do not believe there are many contributors qualified to work on this article, and it would be great to have a constructive exchange since I hope to continue working on this article in a conducive environment. AryamanA (talk, contribs) 20:08, 19 August 2025 (UTC)Reply
You can also see I had a draft up long before all this: User:AryamanA/Draft:Mechanistic interpretability. AryamanA (talk, contribs) 20:16, 19 August 2025 (UTC)Reply
And to add a final postscript, most of the citations on the actual methods were added by me, not by the call to action authors. The revision before my edits doesn't include many technical details: https://en.wikipedia.org/w/index.php?title=Mechanistic_interpretability&oldid=1297450723. AryamanA (talk, contribs) 20:19, 19 August 2025 (UTC)Reply
Preprints on the arXiv are almost always inadmissible for Wikipedia's purposes. See WP:RSP. Even in the case of those published by recognized subject-matter experts, where they can see some limited use, Wikipedia articles are based on secondary and tertiary sources rather than primary ones (WP:PSTS). Preprints by notable and highly-qualified subject-matter experts might see occasional use as convenience links for thoroughly uncontroversial and solidly established science, but that's about it. This is just the standard we all have to live with; much as I'd like to expound at great length upon my own research area based on the cool not-yet-in-journals things that people I know have posted, Wikipedia just isn't the place. LessWrong is a group blog and inadmissible for almost everything per WP:SPS. Honestly, it sounds like your time would be better spent writing a review article or monograph, rather than a Wikipedia page. Stepwise Continuous Dysfunction (talk) 22:28, 19 August 2025 (UTC)Reply
Exactly. Our standards are different from those of research. A preprint which is accepted in the field and even cited in other papers may still not be a suitable source for Wikipedia. Elestrophe (talk) 23:00, 19 August 2025 (UTC)Reply
Thanks for the links, definitely clears up some confusion. In my opinion a Wikipedia article is significantly more visible as a point of entry than a review article, so it's valuable to have this. I'll abide by the citation guidelines going forward even if it will necessarily make the article a little behind the literature. AryamanA (talk, contribs) 23:13, 19 August 2025 (UTC)Reply
A Wikipedia article should be well behind the literature if the literature is preprints. I would advise keeping strictly to WP:RSes, which means in this case peer-reviewed at the very least. The WP:OR rule originated in physics authors putting in their own stuff - David Gerard (talk) 01:16, 21 August 2025 (UTC)Reply
Tagged the arXiv preprints. In the general case, if it's a preprint on arXiv and not independently notable as a source, it's unlikely to be a RS. Also tagged some primary sources - David Gerard (talk) 12:53, 23 August 2025 (UTC)Reply
Thank you very much for the detailed feedback to this article and for flagging specific unreliable sources. I wanted to address the issues you flagged. For context, I wrote the earliest versions of this article in May and June 2025, and since then many Wikipedia editors have made significant edits, especially AryamanA.
1. COI: I'm currently an undergrad at uchicago doing research in adjacent fields. I checked the edit history and Aryaman was the main person who substantially reshaped this page after me and he has clarified his affiliations (many editors made smaller edits, of course, but COI seems less relevant here). I don't believe COI is at play in the editing process of this article and would appreciate it if this tag can be removed.
2. RS: I agree that many of these sources do not satisfy wikipedia requirement. Thanks for pointing this out. For the sake of systematic evaluation I listed a few specific criteria for reliability - whether it was published at a peer-review academic venue, from a reputable news outlet, or has a lot of citations (500+) - and asked whether the sources satisfied either of the bars. Let me know if you have any feedback.
Here (User:JoNeedsSleep/sandbox) is a table of all the sources and whether or not they fulfill the criteria (as of August 24, 2025). 17 out of the 44 sources listed did not satisfy this criteria - they were mostly technical blogs or arxiv preprints. I will work on removing them or complementing them with more reliable sources. JoNeedsSleep (talk) 15:17, 24 August 2025 (UTC)Reply
I don't see a reason why the COI tag should be removed. On the contrary, if "the main person who substantially reshaped this page" is currently working on the subject itself, then the COI tag is justifiable (all the more so if that "main person" is being paid to research the subject, as seems likely for a PhD student). Stepwise Continuous Dysfunction (talk) 03:33, 25 August 2025 (UTC)Reply
JoNeedsSleep's initial version doesn't really look promotional, and includes a "Critique" section, which he later expanded. The tone of Aryaman's edits also appears neutral, citing a wide range of researchers. Researchers writing on their topic of expertise on Wikipedia is usually fine as long as they don't overfocus on their own work or the work of their colleagues. The issue seems more about the excessive usage of preprints and primary sources, as already highlighted, and also with the fact that some parts of the article may also be difficult to understand for the average reader (WP:TECHNICAL). Alenoach (talk) 21:10, 26 August 2025 (UTC)Reply