[Fireside Chat] White-box Methods for AI Control
Summary
Neel Nanda chats with Josh Clymer on AI interpretability and control, discussing the waning enthusiasm for sparse autoencoders, the shift from perfect understanding to pragmatic problem-solving in interpretability research, and the challenges of implementing control systems in AI labs.
SESSION Transcript
Joshua Clymer: All right. Welcome, everyone. I’m Josh. I work at Redwood Research. I’m working on some white box research projects, but I’m not super deep into interpretability.
Neel Nanda: Yet.
[Audience Laughter]
Joshua Clymer: [Chuckles] Yet. So hopefully I’m a proxy for people who kind of are following what’s going on, but aren’t experts. And Neel, maybe you want to introduce yourself.
Neel Nanda: Yeah. Hello! I’m Neel. I run the DeepMind mechanistic interpretability team. I do not work on control, but was asked to speak here for some reason and said yes for some reason. And I will be attempting to give thoughts informed by my general understanding of interpretability, which I think is fairly good, and my extremely shallow understanding of control that I hope will be useful to some people here.
Joshua Clymer: You don’t work on control yet.
Neel Nanda: [Laughs]
[Audience Laughter]
[Audience Laughter]
Joshua Clymer: All right. [laughs]
Neel Nanda: I mean, I assume this is why I was invited to this conference.
Joshua Clymer: That’s right. That’s right. [Tall scheme, Neel. Tall Scheme. 29:52]
Neel Nanda: Subtle marketing.
Joshua Clymer: Okay. So, you recently put out a blog post called “Negative Results for SAEs.” Did you guys give up on SAEs? What’s going on?
Neel Nanda: Basically.
[Audience Laughter]
Neel Nanda: Okay, so the backstory behind this is for a bit over a year, Sparse Autoencoders have been the thing that most people in mechanistic interpretability have been really excited about. And they seemed like they could be a promising approach to kind of perfectly decipher what models are thinking about and that just seems awesome.
And so, I thought about a year ago, I think there’s a decent chance that these things are a really big deal and we should study them and improve them and see how cool they are. And eventually, I started to become a bit more uncertain, along with a decent chunk of the field.
And in particular, I became fairly pessimistic about the ambitious mechanistic interpretability aim, which I would roughly define as models have a true set of concepts within them. And if we just train the right sparse autoencoder, it will perfectly decipher them. And I basically think this is so hopeless.
But I think that this is not sufficient for giving up on interpretability things or SAEs because ultimately, my goal is not perfectly reverse engineer models; my goal is don’t die. And I hope this is instrumentally useful. But it’s possible for imperfect interpretability to also be instrumentally useful.
And so, my next rationalization for working on SAEs became maybe there are real world tasks that SAEs can help on. And even though I know the technique is flawed, maybe it’s good enough that it’s still like a big deal that we should care a lot about.
And so, my team did a project trying to use them for the downstream task of making probes that would generalize better out of distribution. There were a bunch of other external papers in parallel trying SAEs on various different tasks. And my overall assessment is some of these worked; some of these didn’t. I don’t feel wowed by any of the things that have happened here. And this doesn’t feel the level of success that I think we should be having for the amount of investment the field is making into SAEs.
I don’t—It’s really hard to conclude that a technique is useless. And in particular, it’s really hard to conclude that a technique is not promising to keep working on because, I don’t know, maybe tomorrow someone will come up with the magic tweak where when you add an L5 term to the loss, all these problems go away or something. I think this is fairly unlikely.
And I have a decent idea of what in progress work is happening in the community and don’t feel like any of it is likely to change my mind. But, you know, I’d love to have my mind changed.
And the current take on my team is something like SAEs are one of several useful tools in the interpretability toolkit. We probably shouldn’t privilege them as the thing we focus on as the point—focal point of our research. But if in the course of a project, it seems like we’re kind of confused about what’s going on and we need help generating new hypotheses or otherwise deciphering what’s happening, SAEs are plausibly a thing we would reach for because they definitely are a lot better than nothing.
We don’t think the field as a whole should ditch Sparse Autoencoders, but we think there should be a lot less research that is kind of like assuming SAEs are awesome and pushing on them and more research that’s trying to see what you can actually do with them.
And then once we have things they work on, trying to hill climb on improving those. For example, there was a paper I really liked recently on using SAEs for hypothesis generation on kind of vaguely social-science-related datasets where this is apparently a field that has benchmarks and SAEs are SOTA, apparently. And it’s great.
And trying to improve on things like that seems awesome to me. And work that tries to understand why SAEs are not currently awesome and why they’re losing to dumb baselines like linear probes and tries to fix them in a targeted intentional way seems exciting to me. And there was a recent paper on Matryoshka SAEs I supervised that was basically saying, hmm, the fact that we optimize these things to be sparse, seems to lead to some issues. What if we made it a bit harder for it’s optimized for being optimally sparse? Oh, look, these issues mostly go away. Life is good. Anyway, end long ramble.
Joshua Clymer: Okay. So, quick summary, we want to do what works.
Neel Nanda: Hell yeah!
Joshua Clymer: SAEs don’t seem like they especially work relative to simple baselines.
Neel Nanda: Mm-hmm.
Joshua Clymer:Is that a reasonable summary?
Joshua Clymer:Is that a reasonable summary?
Neel Nanda: Yes.
Joshua Clymer:Okay.
Neel Nanda: And plus a bunch of vague speculation about whether this will change in future.
Joshua Clymer: Sure, sure.
Neel Nanda: I think my kind of distilled take is I think that a great form of interpret research is picking a thing we want to get good at and trying to do it. And I think that you should start with the stupidest baselines you can, like just prompt the model and steadily ascend up through the ranks of fancy high effort things as the dumb things fail.
Joshua Clymer: Yeah.
Neel Nanda: And so far, prompting has just solved the last two problems we tried. [Laughs]
Joshua Clymer: Okay. So, I feel like something has gradually changed in how people are talking about interpretability. I remember in the early days there was this grand ambition of really just understanding the model, you know, like understand what does that neuron do? What does this specific float in the weights do? You know, just really having the full picture in your head. And I feel like now we’re just like, well, does the technique work? Does it fit to this data set? And I’m like, okay, something has changed here and I don’t quite understand what, and I’m curious if I’m just going crazy or what’s going on.
Neel Nanda: I think this is a real shift. I also think that the field totally has lots of different people who have lots of different takes. I think Anthropic is a fair bit more into the, let’s perfectly understand the thing than I am. I think I’m more into it than many other people in the field. I know you have people like Buck who got so not into it, they just left the field and found it controlled instead. And yeah, I think that—so I think one question is, what is your ultimate goal? I think that from a scientific perspective, what is happening in these mysterious alien brains is just an extremely reasonable question.
And hypothesizing that there is some amount of ground truth and trying to decipher it, seems pretty great to me. I think this kind of research is more promising if you don’t just assume it is clean and crisp and you assume there will be some error and some missed things and you try to minimize that. I think there’s a bit of a mistake I’ve historically made where I really wanted pure and principled techniques so I didn’t do things that were kind of flawed but pragmatic, like maybe rather than training a Sparse Autoencoder, I should just try interpreting the neurons. Often this fails, but sometimes this works well enough to be useful. I should check.
But yeah, I think that there is still this scientific ambitious goal. I do have the take that a large chunk of the research there is kind of bullshit and it is really hard to do research that isn’t bullshit. And I feel like trying to do real hard to fake things that non- interp[retability] people are also trying to do and beat them is one of the few things that feels like a good path to not being bullshit, but it’s not the only one.
And then there’s the pragmatic side of things. The way I would describe the pragmatic side of things is something like basically all of machine learning treats the neural network as black box. You get something in, you get something out. Maybe you take a gradient and never analyze the gradient except to just add it to the parameters and then you go home.
But networks are so much more structured than this. They have layers, they have activations. We can see what is happening at each step. And we know that what happens between each step is mathematically fairly simple. And I don’t know, I just feel like there’s so much alpha here. There just should be so many things you can do with this that you can’t do if you don’t. And I kind of see pragmatic interpretability, which sometimes I call model internals to avoid definitional debates, is, surely there is something useful we can do here. What is that?
Joshua Clymer: So, the hope is still alive.
Neel Nanda: Mm-hmm.
Joshua Clymer: It’s just tempered a little bit. Your ambition is a little bit less large.
Neel Nanda: Mm-hmm.
Joshua Clymer: But you still think there’s some juice in understanding what’s going on.
Neel Nanda: So, I definitely think there’s juice in trying to understand what’s going on. I was kind of trying to outline a philosophy difference you can have, where one is I’m aiming for as much scientific understanding as possible and shrug, it will probably be useful eventually. I think this is a reasonable mode of research, but I do think people should be more careful than they often are, that their work isn’t bullshit. And then there are just things I care about being good at, like control schemes, like monitoring models, like steering models in certain ways.
And I want to get better at this task. I think a task that I am actually very optimistic interp should help with is debugging models, like: "Why is my search engine gaslighting users?" We’d all like to know. And I think that there is real research you can do to see if interpretability is helping.
So, I really liked the recent Auditing Game paper from Anthropic, where they had a model that they kind of messed up and gave a weird hidden goal. And then they had some teams try to interpret it, try to understand the goal. Some were allowed to use interpretability, some weren’t and the interpretability teams won.
I think this is great. I love things like this. I would love to see more. It’s a bit hard to understand the results because we’ve spent a year getting really good at SAEs. We have not spent a year getting really good at black box debugging. So it’s a bit of an unfair comparison. I think we are, in fact, a lot better at black box debugging now we’ve figured out how to do what the Interp team did without SAEs. But I think things like this are great.
Joshua Clymer: Yep, okay. So, there’s, can I solve this problem?
Neel Nanda: Mm-hmm.
Joshua Clymer: And white box techniques aside, can I just find any way to solve this problem?
Neel Nanda: Yep.
Joshua Clymer: That’s one approach. And the other approach is like, can I progress towards a greater scientific understanding?
Neel Nanda: Yep.
Joshua Clymer: Okay.
Neel Nanda: And I think that from a alignment-helping perspective, it is not obvious to me which of these is higher impact because I think that shrug understanding has got to be useful somehow. Let’s just push this forward to make sure we’re doing good science is actually the kind of thing I think might pay off. Though if I thought there was a greater than 50% chance of AGI in less than two years, I would think that is much less promising.
Joshua Clymer: Gotcha. I think this is a good –
Neel Nanda: Which I think not, to be clear.
Joshua Clymer: Yep. I think this is a good point in the conversation to clarify what we’re talking about. So, the title of the discussion was “White-box Control.” That is a bigram that many people probably haven’t even seen before. So, what is White-box Control, Neel? And how does it relate to all this interpretability stuff that you do and you were just talking about?
Neel Nanda: I did not invent the term White-box Control. But if you’re asking me to define it the way I think it should be defined and hope everyone follows my convention, sure. So, I probably think that defining exactly what terms like interpretability and mechanistic interpretability mean are kind of annoying and boring and useless because debating over words rather than actual reality never gets you anywhere.
But it is nice to have consistent terms. And roughly the way I’m using mechanistic interpretability nowadays is just work that leverages the internals of the system in order to understand it better, which can encompass things like algorithm discovery where you try to actually understand the mechanisms, which is like circuit analysis, which is kind of what people have sometimes classically used mechd-interp term for, but I think is a bit too constraining. I’d rather have a more specific term for, and things like concept discovery, which is kind of what Sparse Autoencoders do or probes do. What is the model thinking about right now? What are the things driving its decisions and how could they be manipulated?
And I kind of interpret White-box Control as—well, control does not care about whether you are using interpretability or not. It is like a pragmatic, grounded question. That’s just like, could a scheming LLM escape and kill all of humanity? We would like this not to happen.
And one tool in your toolkit probably should be things that don’t just assume it’s a black box and actually use the internals of the model and like white box in contrast to black box. And so, I kind of just define white box control as anything that attempts to help with control that actually acknowledges the existence of activations and parameters.
Joshua Clymer: Gotcha. So, you talked about these two philosophies. There’s progressing towards greater understand—scientific understanding, and then there’s just solving a problem. And if your problem is prevent AIs from escaping and killing everyone, like you said, then maybe you want to use white box control.
This is like one—one method you could try to tackle that problem with. But again, you’re just trying to solve the problem of controlling the AIs. Don’t care about progressing towards greater scientific understanding.
Neel Nanda: Yep. And yeah, I think there’s a common mistake people have where like, I don’t know, interpretability is really cool and photogenic to a certain kind of person. And so, people want to use interpretability techniques for that thing. Like I know, they saw Golden Gate Claude, and they’re like, “Ah, I want to get a model to do a bit more of something. I need Sparse Autoencoders.” Spoiler: there are way easier methods.
And I would caution anyone here against trying to use white box control because white box feels safer or more robust or something. If a technique is safer or more robust, this should ground out in empirically validatable things. Like that when you try it on a somewhat different distribution, it generalizes better than some other technique. And you should test this. And you shouldn’t just take it as a given because interpretability has good vibes.
Joshua Clymer: Yeah, that makes sense. So here’s a question. Do you think that white-box control really buys you much over black-box control? So for example, a technique that’s currently popular is chain-of-thought monitoring.
Neel Nanda: Mm-hmm.
Joshua Clymer: And some people are really into this. I’m curious if you think maybe we could just do chain-of-thought monitoring. Maybe we can just understand what models are thinking about by just looking at the tokens they output in their chain of thought instead of trying to mess with this activation stuff.
Neel Nanda: Sure. So I am a fan of chain-of-thought monitoring. I think it’s great. I would love people in this room to go and try to work on it. I expect it to eventually break because it’s just really not that hard to realize, if I talk about my plan to kill all of humanity, then the humans might notice and stop me killing all of humanity, and that would be bad. It’s just not that difficult.
I think the right conceptual way to think about—Oh, and another common take is: “Well, there have been these papers that show that chain-of-thoughts can be unfaithful. So why would you ever trust it? This is hopeless.”
But I think that’s kind of like purity thinking. Like a thing does not have to be perfect for it to be useful. And the reason I think chain-of-thought monitoring is probably fairly useful is that we know that models get better when they have chain-of-thoughts. Like, thinking models are a meaningful capabilities improvement.
Intuitively, this makes sense because if you need to do a problem that involves formulating and acting on a 10-step plan, it’s a lot easier to do one step at a time than to need to do 10 steps in a single forward pass, where no matter how many tokens you have to think on, you still have to do all your thinking in a fairly small number of layers.
And so it seems like if models want to do hard things, they are going to need to use their chain-of-thoughts somehow and the easiest way to use your chain-of-thought, we hope, is just putting it in natural language. And so we can look at it and be like, there’s something weird happening.
I think this is a nice argument. Another way of thinking about it is how much working memory does a model need to execute on its plan? Maybe you can get enough working memory from something like a single forward pass, but that seems pretty difficult and not a thing we’re probably going to train it for. Chain-of- thought gives it way more working memory.
If we potentially had spooky architectures with vector-based chain of thought rather than token-based, maybe that becomes its memory and we have no idea how it works. That would be sad. Please don’t do that.
And yeah, so distilling all that, I think that I’m most optimistic about chain-of-thoughts monitoring working when the model is trying to do something at the limit of its capabilities that can be factored into smaller tasks and where it is useful to represent intermediate results of those tasks or the results of the first task to help with the second task or something. It can’t be done in parallel.
There’s a lot of other concerns this isn’t getting into, like what if it’s encoding these intermediate things and this reasoning in what looks like natural language or it’s learned to use some weird gibberish that we don’t understand, some language that’s invented. These are things we should be looking out for but I think are kind of hard and they don’t expect to happen immediately.
Joshua Clymer: Okay, when do you expect them to happen? Say you have a AI research scientist. So let’s separate timelines. Say you have an AI research scientist. Do you think that that AI research scientist is going to have gibberish chain-of-thought?
Neel Nanda: How good an AI research scientist?
Joshua Clymer: As good as you, Neel.
Neel Nanda: Oh, I mean, that happened years ago. So the question was, will chain-of-thought monitoring work?
Joshua Clymer: For you.
Neel Nanda: Mm.
Joshua Clymer: I bet, Neel, if I had access to your internal monologue, this would make things kind of rough for you.
Neel Nanda: It does sound annoying.
Joshua Clymer: Yeah.
[Audience Laughter]
Neel Nanda: Though, I really hope I’m not doing that right now.
[Audience Laughter]
Neel Nanda: Yeah, so it’s kind of a bit unclear to me whether the right analogy for chain of thought monitoring is reading someone’s thoughts or reading all of the notes they write down when they’re forming secret plans themselves. And if I was not allowed to write anything down that leaked my plans, yeah, I could probably figure out how to overpower a kingdom of toddlers or something.
Joshua Clymer: Sure, sure. You can think for a long time though and remember things. So, imagine that you couldn’t do that. You can only think for like 10 seconds and then everything needs to be encoded in text.
Neel Nanda: Mm.
Joshua Clymer: You’re like the memento guy,
Joshua Clymer: You’re like the memento guy,
you know? You have to constantly make tattoos on your body just to navigate daily life.
Neel Nanda: This does sound hard, but also I feel like science is probably a lot more my comparative advantage than it would be for an LLM. Like, LLMs are so much better than me at loads of things, including presumably their ability to encode and decode ciphers, let alone whether via reinforcement learning they’re kind of forming their own language or something.
And so, I don’t know, there’s also just the question of do people start training the chain-of-thought to look nice, which DeepSeek already does. And there’s the question of like when lots of papers about how a really important safety technique is looking at the chain- of-thought of these models and this gets in the training data and they read the Wikipedia page on-chain- of-thought monitoring, possibly they might realize this in a weird, spooky, out-of-context reasoning-y way. And I do not understand how out-of-context reasoning works, but it seems plausible they would just not mention things.
I’m excited about evaluations that do things like fine-tune an LLM to try to take over the world or something and keep it in a really, really secure box. But [laughs] see how well it does? And if it can’t do that, maybe we’re fine. I don’t know. Given we have a minute left, I just want to–
Joshua Clymer: Yes, yes, lightning round, Neel. Okay, I wanna do two questions.
Neel Nanda: Sure.
Joshua Clymer: First question, you have five seconds. If we just train a very simple probe in the simplest, dumbest way, are we fine? Does the probe work? You have five seconds.
Neel Nanda: 60% yes.
Joshua Clymer: Okay, second question. What needs to be done at labs to use control schemes? Like what are the non-technical aspects that we should be considering here?
Neel Nanda: You need to have the infrastructure in place to use it in production, or at least internal deployment. This involves getting the people who maintain that code base to actually let you do it. This probably requires there to be an actual reason for it that is not preventing sci-fi scheming AIs unless we have a damn good evidence base. You also need to convince more powerful stakeholders that it’s in their interests.
Joshua Clymer: Like who? Like the CEO of Google?
Neel Nanda: For example.
Joshua Clymer: Okay.
Neel Nanda: I don’t know, the next time there’s a scandal for ChatGPT, someone at OpenAI shows that you can fix it with deployment monitoring techniques that just so happen to also be adaptable for control. I feel pretty great about that. And yeah, demonstrating that you can usefully solve problems that people care about.
I think the obvious one is reward hacking. So like the recent OpenAI chain-of-thought paper, they use it to detect reward hacking. It seems people at OpenAI don’t like reward hacking, so they actually use this work and they think it’s cool.
Joshua Clymer: Gotcha. But you think that if you just wave your hands and you say, look, there’s this concern, the scheming concern, like we’re not very confident about whether models are scheming or not, but they might be. We should definitely have control. You don’t think that’s enough?
Neel Nanda: I mean, I don’t know. There seem to be all these people who want really high assurance safety things, and I’m just like: "Have you seen a neural network? This seems kind of impossible."
Joshua Clymer: Gotcha. Probes, 60%, 60%!
Neel Nanda): That is not high assurance!
[laughter]
[laughter]
Joshua Clymer: All right.
Neel Nanda: Yes.
Joshua Clymer: Thank you, everybody.
[Audience Applause]