Automating Mechanistic Interpretability
Summary
Chenhao Tan introduces the MechEvalAgent framework that validates automated mechanistic interpretability research using three dimensions: coherence (following plans), reproducibility (rerunning experiments), and generalizability (extending findings).
SESSION Transcript
So this is some position talk, talks about opportunity and challenges in automating mechanistic interpretability. The hope is to encourage you to work on this topic and share some lessons that we have learned so far. And this joint work with Xiaoyan Bai and Ari Holtzman, who are both at this conference. And it'd be great if you can talk to them as well.
So I guess in many ways we started with this because we think mechanistic interpretability represents a ‘dream problem’ for research agents. It is a problem where most of the experiments can run in silico. You only need a model and some compute. And then you can do a lot of these experiments and a lot of the claims and the findings are causally testable. You can take the model and do all kinds of things like ablation and figure out whether your finding is actually true. And then there's a lot of toolkits available. So relatively, we can already do a lot of experiments with existing toolkits.
And then also not surprisingly, we also find that people have thought about it and including people in the room, people have started to work on automating interpretability at different scales. But these days, what changed is that AI now is much more capable. Even if you don't do anything, you just take Claude Code, Gemini, they can already do interpretability research for you and you can give instructions finding certain circuits in certain models. It will go out and do it. And as we started this project by talking to people in this room, for instance, we also learned many specialized efforts, for instance, from Goodfire, from Yonatan on doing specialized interpretability agents for this task. So that's what we found out that. Let's see.
All right, in 2025, I think the bottleneck is no longer having agents running experiments, is reviewing and trusting them. And I think this is a recurring theme in a lot of this session. I think it came up a couple of times and how we can use different strategies to review these agents. And the solution that we are working on so far is to think about building another agent called MechEvalAgent. And we believe that automation will not advance interpretability unless we can automate evaluation to some extent. And this is the framework that we have so far. We can break down this problem. So first we need to unify what this MechAgent, or this interpretability agent needs to produce. It is going to produce a code repo, a finding and a plan, when it was running the experiments. Then we defined these three dimensions to validate these research outputs. And the QR code is for our GitHub repo. If you're interested, please check it out. And it's pretty much still a work in progress.
‘Coherence’ defines how well the agent actually follows this plan, whether the code actually implemented the plan that is proposed. The ‘reproducibility’ is to see the standard definition, to see whether if you get another agent, can that agent actually take the code and rerun it, or re-implement certain parts and reproduce the findings that this interpretability agent claims. And finally, generalizability is one maybe more interesting insight that we came up with. Well, one key feature of science is that maybe the findings do not only constrain to this particular instance of this particular data set or this particular model, and it would generalize beyond to other cases.
And we will have question designers that will design new scenarios, ideally that are also runnable, implementable. And we can test whether finding holds or not, and see whether agents realise the finding from the interpretability results, whether they become more knowledgeable or whether they are better at predicting what's going to happen in this new setting. And that's also kind of the most more tricky scenario. And just to give you a very quick overview of what we have found so far, these are some example reports, and in practice it can be much longer. And you can also generate this kind of radar chart that describes how well the model or the research output is doing.
And next is failure. So what we find is that in a lot of cases, the model may lack meta-knowledge in these research agents. For instance, in this example, it was in a standard kind of IOI style problem and the agent doesn't really understand what a validation is. It didn't try to run any real code. It just takes your input and checks whether the node is in your source nodes. And that doesn't really qualify as a check for validation.
And the second one is very related. There are a lot of implicit hallucinations and misleading methodology. For instance, the agent may claim it runs ablation studies, but the code doesn't show that. Or it claims that it has done some causal validation, but it actually uses some non-causal checks.
And going back to the generalizability, and I do think that was kind of the most interesting angle, but in many ways we also realized that this is where... even human reviewers, for instance, when they review papers, they may disagree on what it means for study to generalize, and does the method generalize to other tasks? Does the insight generalize to other models, or does the conceptual takeaway generalize to similar context? So this leaves a lot of open problem for us to think about. How can we build this kind of MechEvalAgent? And in summary, we also do not know how to evaluate a MechEvalAgent. So that's another level that we need to think about about. With the final second, I'm going to emphasize the next frontier.
And I do think once we have both of these agents, I think it's important to think about a lot of problems that we haven't discussed in agents, like resource allocation. Mechanistic interpretability is not free. And if we can have better resource allocation, we can potentially have faster discovery of new frontiers. And this is also a great testbed to test problems such as like sandbagging or research sabotage, know more realistic ecology.
And finally, I guess just to emphasize this again, and I need to go, I think this is a great testbed for AI in research and development. And thank you.
So I guess in many ways we started with this because we think mechanistic interpretability represents a ‘dream problem’ for research agents. It is a problem where most of the experiments can run in silico. You only need a model and some compute. And then you can do a lot of these experiments and a lot of the claims and the findings are causally testable. You can take the model and do all kinds of things like ablation and figure out whether your finding is actually true. And then there's a lot of toolkits available. So relatively, we can already do a lot of experiments with existing toolkits.
And then also not surprisingly, we also find that people have thought about it and including people in the room, people have started to work on automating interpretability at different scales. But these days, what changed is that AI now is much more capable. Even if you don't do anything, you just take Claude Code, Gemini, they can already do interpretability research for you and you can give instructions finding certain circuits in certain models. It will go out and do it. And as we started this project by talking to people in this room, for instance, we also learned many specialized efforts, for instance, from Goodfire, from Yonatan on doing specialized interpretability agents for this task. So that's what we found out that. Let's see.
All right, in 2025, I think the bottleneck is no longer having agents running experiments, is reviewing and trusting them. And I think this is a recurring theme in a lot of this session. I think it came up a couple of times and how we can use different strategies to review these agents. And the solution that we are working on so far is to think about building another agent called MechEvalAgent. And we believe that automation will not advance interpretability unless we can automate evaluation to some extent. And this is the framework that we have so far. We can break down this problem. So first we need to unify what this MechAgent, or this interpretability agent needs to produce. It is going to produce a code repo, a finding and a plan, when it was running the experiments. Then we defined these three dimensions to validate these research outputs. And the QR code is for our GitHub repo. If you're interested, please check it out. And it's pretty much still a work in progress.
‘Coherence’ defines how well the agent actually follows this plan, whether the code actually implemented the plan that is proposed. The ‘reproducibility’ is to see the standard definition, to see whether if you get another agent, can that agent actually take the code and rerun it, or re-implement certain parts and reproduce the findings that this interpretability agent claims. And finally, generalizability is one maybe more interesting insight that we came up with. Well, one key feature of science is that maybe the findings do not only constrain to this particular instance of this particular data set or this particular model, and it would generalize beyond to other cases.
And we will have question designers that will design new scenarios, ideally that are also runnable, implementable. And we can test whether finding holds or not, and see whether agents realise the finding from the interpretability results, whether they become more knowledgeable or whether they are better at predicting what's going to happen in this new setting. And that's also kind of the most more tricky scenario. And just to give you a very quick overview of what we have found so far, these are some example reports, and in practice it can be much longer. And you can also generate this kind of radar chart that describes how well the model or the research output is doing.
And next is failure. So what we find is that in a lot of cases, the model may lack meta-knowledge in these research agents. For instance, in this example, it was in a standard kind of IOI style problem and the agent doesn't really understand what a validation is. It didn't try to run any real code. It just takes your input and checks whether the node is in your source nodes. And that doesn't really qualify as a check for validation.
And the second one is very related. There are a lot of implicit hallucinations and misleading methodology. For instance, the agent may claim it runs ablation studies, but the code doesn't show that. Or it claims that it has done some causal validation, but it actually uses some non-causal checks.
And going back to the generalizability, and I do think that was kind of the most interesting angle, but in many ways we also realized that this is where... even human reviewers, for instance, when they review papers, they may disagree on what it means for study to generalize, and does the method generalize to other tasks? Does the insight generalize to other models, or does the conceptual takeaway generalize to similar context? So this leaves a lot of open problem for us to think about. How can we build this kind of MechEvalAgent? And in summary, we also do not know how to evaluate a MechEvalAgent. So that's another level that we need to think about about. With the final second, I'm going to emphasize the next frontier.
And I do think once we have both of these agents, I think it's important to think about a lot of problems that we haven't discussed in agents, like resource allocation. Mechanistic interpretability is not free. And if we can have better resource allocation, we can potentially have faster discovery of new frontiers. And this is also a great testbed to test problems such as like sandbagging or research sabotage, know more realistic ecology.
And finally, I guess just to emphasize this again, and I need to go, I think this is a great testbed for AI in research and development. And thank you.