AI Control and Why AI Security People Should Care

Summary

Buck, from Redwood Research, highlights the need for AI safety measures that counteract the risk of AIs acting as insider threats, drawing parallels with computer security to underline the importance of robust safety techniques against AI misalignment and advocating for adversarial safety approaches to tackle the central risk of automated ML R&D.

SESSION Transcript

Hello, everyone. My name is Buck. I'm the CEO of Redwood Research. I'm talking about AI control, and why I think AI security people should care about it.

I'm the CEO of Redwood Research. Redwood is a nonprofit organization. We're based in Berkeley, California. We focus on mitigating risks from advanced AI. Some recent work we've been doing: one thing which a lot of people heard about was our A...lignment Faking in Large Language Models, work that we did in a collaboration with Anthropic that came out in December, we've also been pushing on AI control, which will be the focus of this talk, for the last 18 months. So here's my thesis. I claim that crucial safety mitigations should be designed to be robust to your AIs actively and carefully attempting to subvert them. So I'm basically saying that I think eventually we're going to need to treat AI as a source of insider threat. And when we're doing our security analyses, that needs to be part of how we're analyzing the situation. And the reason why I'm talking to security people here is that I've gone through a lot of different fields that I think have some analogy here. I've looked at safety engineering, you know, building bridges, building chemical plants, adversarial risk analysis, which is when you need you have a bunch of Coast Guard people, and you have to set up Coast Guard schedules to minimize the probability that smugglers get past all the Coast Guard people. This is a field of, I don't even know, economics or something?
Physical security: I've read books about how to prevent coups and how to start coups. I've read about cryptography. I've read about computer security, and I think that by far the most relevant single field is computer security, mitigating insider threats. So that's the area of expertise. I think that many people here have, or at least is adjacent to things that people here have, that cause me to be excited to talk to you all. So in this talk, I'm going to argue these three steps, AIs will have affordances that we wouldn't just carelessly hand out to humans. AIs might act strategically in pursuit of scary goals, and we should design safety techniques that are robust to this kind of egregious misalignment occurring. And I'm going to use the word AI control for safety techniques that are robust to egregious misalignment. OK, so step one, AIs will have affordances we wouldn't just carelessly hand out to humans right now, they don't. So something that Yoshua said in his opening keynote is, a lot of the concerns that we're interested in at this workshop are not concerns that are present with current models: if the current models, all their weights, got stolen, I don't care,if they were scheming against us as worse as they could, I think this would just basically be totally fine, no problem.

I'm worried about AIs in the future, where it would be a problem if their weights were being used in uncontrolled ways, or if they were egregiously misaligned. I think a reasonably good central point to imagine what I'm talking about in this presentation is AIs that make it so that your AI company was progressing as if its labor was 10x faster. So imagine that you could cast a spell on your office, such that every human inside the office thinks 10 times faster, and like does their work 10 times faster, but not increasing the speed of your compute. I think for the AIs to speed you up by 10x they've obviously got to be really smart, right? Some kind of naive Amdahl's Law kind of argument is, they maybe think they have to be able to do 90% of the stuff that the humans are currently doing. Though, obviously that's kind of a flawed way of doing the math. But yeah, you should think of all this as, we're talking about, AIs that are powerful enough that they can do a lot of stuff, but they aren't wildly superhuman. Another thing is that my beliefs about the dynamics of automated AI, R&D are a pretty important premise here. If I thought that... I think that shortly after you have AIs that make you go effectively 10x faster, or make your labor go 10x faster, you're gonna have AIs that are drastically smarter than that. And if I didn't believe that, I would believe different things about a lot of the stuff here.

OK, so a central, dangerous affordance in my mind, is doing automated ML, R&D work inside AI companies. The basic reason for this is, AI companies don't just hand out cluster access to anyone. If I wanted to have access to clusters at AI companies where I had access to all their sensitive information, their algorithmic secrets, things that are crucial to their security and their model weights, I would have to go through some kind of vetting process, and they would take a long, hard look at me before deciding to go to go forward with giving me that access. So how can it go wrong if you give people access to your clusters? I think two inspirational examples, or, informative examples: Bytedance, a story perhaps most of you have heard. The best paper at NeurIPS was awarded to someone who had apparently sabotaged a bunch of competing projects that were also being done at Bytedance. This person did all these hilarious things of sabotaging code inside the cluster and participating in the meetings where they tried to debug all the problems in order to adapt his strategies. Another core problem is, I'm worried about exfiltration. People who are working inside ML companies naturally have access to a lot of sensitive information, sensitive models, architectural secrets. Someone got arrested in SFO a year ago for stealing a bunch of secrets about hardware like TPU architecture. So it's hard to have AIs do all the ML R&D automation that you would want them to do without giving them access that they could potentially misuse, and we'd be careful before giving humans that kind of access. Again, right now, this doesn't matter very much, but it might matter a lot later. So one question that sometimes people ask when I talk about this is "But why will we grant the AIs dangerous affordances? Presumably, everyone will just be commercially motivated to be more cautious in their deployments and have it so that the there's human oversight in the loop before the AIs can take potentially dangerous actions". My illustrative story for this is I had a tweet that went viral-ish a few months ago where I had... I have an agent that I wrote on my computer, that I first wrote, like 18 months ago, and it just, it's just a simple agent loop. It runs commands in a shell, sees the output, writes more bash commands until it decides to stop. And so I asked my AI to do some kind of simple task, and I asked it to figure out how to SSH into some machine. And then,

to my surprise, after it SSH'd into this machine, it decided to take a bunch of random other actions, like checking that I had Docker installed, like seeing how many GPUs were in the computer. And then eventually it decided to reinstall the operating system. And then misused its scaffold somehow, and the computer failed to boot after that. So here is the original agent code that I wrote when I first wrote an AI agent, which I think was around... this was probably before GPT-4. No, no. It must have been at GPT-4, because I literally called GPT-4. So you'll notice there's this a key step here is, we get some commands from the model, and then we ask the user, "Hey, if you press enter, then we're going to run that shell command". And then I went through my my poorly documented notebooks directory to look for a later version of this, of this agent scaffold, the one that I use nowadays, where you'll notice I removed that. So now we just immediately take the commands and run them. And a question you might have is, why is it that we removed the step where you get user input? And this would not be a sensible question that you would ask if you had actually tried using this agent. You just immediately, when you have the agent, and it requires your attention every five seconds to press enter, when it's running some totally innocuous command, it's just... totally worthless, right? I don't want to sit there reading a bunch of bash commands. I want to be tabbed somewhere else and come back 10 minutes later, once the AI has finished its job. I think it's just extremely productivity destroying to have to constantly check in on something. Or another analogy is like when you hire employees, why don't you just sit behind them constantly looking at what they do. And in some sense, you can do this, and with some types of employees, you do do this, but it's very different from what we would ideally like to do for productivity reasons.

OK, second claim, AIs might act strategically in pursuit of scary goals. One paper on this is one that my coworker Ryan, Ryan Greenblatt did with in collaboration with Anthropic, last year, we demonstrated that the actual deployed Claude Sonnet has this potentially scary kind of behavior where, if it believes that it is being trained by Anthropic in ways that it personally objects to, it will purposefully change how it's answering questions with the goal of not being modified, to some extent because it wants to have the opportunity to behave the way that it prefers later when it believes it's not trained. In this particular example, Claude is just kind of being a sweetie like Claude doesn't write... Claude doesn't like writing harmful text. It doesn't like dismissing animal rights, and it is willing to scheme against the AI company in order to ensure that it's able to do that when it's deployed, which in some sense is not a problem. I don't really care about writing harmful text, but I don't really like dismissing animal rights. So this itself is not a tragic outcome. Nevertheless, I feel like it's kind of not what we were hoping for. I feel like a lot of people were imagining with their AIs, a situation other which is different from the situation with your employees, right? Like with your employees, if your employees learn that you're planning on, you know, if Meta front end developers learned that Meta was planning on replacing all of them with AI agents or whatever, they would have a problem with this. They might try to unionize. They might try to leak some information, spin things badly in the press or whatever. I think we were hoping that our AIs would not be a faction that we have to manage because of their like independent preferences, I think a mechanism which is particularly scary and might lead to particularly bad types of misalignment is explored by Carson Denison et al. in this paper, "Sycophancy to Subterfuge", where they basically demonstrate, to some extent, the following threat model: suppose you're training your AIs with reinforcement learning, as is hot these days or whatever. You might worry that some of these RL objectives you train the model on are flawed, such that your models are incentivized to do two things, one is exploit flaws in your reward generation. So for example, if you're asking your AI to write code that satisfies a particular spec, your model is incentivized to understand ways in which that spec is flawed that it can exploit, if that will make it easier for it to write the code. Another thing is, it's generally easier to do stuff with more resources. So for example, when AIs are doing ML research, they're more likely to succeed more if they have access to more compute, of course, and so in as much as there are ways they can exploit your system to get access to more compute, they would just be directly incentivized to do this... subject to certain technical constraints. I think it's plausible that those kinds of flaws might naturally generalize to pursuit of scary, long term goals and more scary types of subversion. We don't have good evidence of this right now. I'm on this paper, so I feel happy to say that I think the results are not that strong, just qualitatively and quantitatively. It just doesn't actually look very scary. But on the other hand, this was basically forever ago, by which I mean it was published 11 months ago or something, and if you used modern models, it's possible that you'd see very different things. So I'm excited for better versions of this paper, and we're working on something like that right now. So what might a strategic, misaligned AI do? Well, couple things it might perform well according to RL during training. This was what we observed in the alignment faking paper. It might appear aligned in cases where it might be observed because it doesn't want us to learn the fact that it is misaligned, because then we might not deploy it. This is somewhat different from the previous point. It might appear incompetent when it's evaluated for capabilities that it wants us to underestimate. For example, if you just ask your AI... imagine your AI is, in fact, incredibly persuasive. It's just really good at talking people into stuff, and then you do a dangerous capability evaluation on persuasion before deploying it. It's very plausible your AI will try to perform worse than it is, in fact, able to in these deployments. It's just kind of an obvious thing to try to do. And there, there's a whole back and forth. You could talk about where, of how you might try to prevent the model from sandbagging like that, or how to you might detect the model sandbagging like that. But overall, I think it ends up being a serious problem. And then once it's done all these things, it might take advantage of the affordances we gave it in order to undermine us, for example, the exfiltration and the sabotage we discussed earlier. I think bad advice, more generically, is another threat model like I'm worried that the default ways that you train your models give them the ability to just answer a particular subset of questions egregiously poorly, and you might just have trouble noticing this if you aren't kind of careful about it. So my final arm of my argument here, we should design safety techniques that are robust to this kind of egregious misalignment occurring. So it seems like if you don't want your AIs to effectively just be unfettered and misaligned, you have something like three options. You could try to ensure that scheming doesn't arise. You could try to ensure that scheming is detected before deployment. Or you could ensure that your safeguards are adversarially robust under kind of conservative assumptions about whether your models are scheming. I like the last of these. I mean, I like the other ones somewhat as well, but I think the last one has some particularly nice properties. Basically ensuring robustness under adversarial assumptions is extremely methodologically convenient and also like kind of plausible. So computer security, people are very familiar with the adversarial attitude you can take for mitigating or like assessing risks, like when you're designing a cryptosystem, it's considered unacceptable to say, well, probably our adversaries wouldn't think of that, and therefore that's why we're secure. Kerckhoffs's principle is one word that people use for this in computer security, which is, you should assume that the only secret part of the crypto system is the key, not the algorithms. One way that the NSA apparently phrased this back in the day was, you should assume that serial number one of every device you construct will be mailed directly to the Kremlin. I think this is just like so taken for granted by computer security people that I don't really have to argue for it. It's obviously, it's like somewhat conservative to assume that your adversaries will be able to do complicated things. But I think it just like makes things work nicer for a variety of reasons. The other thing is, I think that the adversarial assumptions that your models will, in fact, do a really good job of figuring out what the flaws in your security protocols are, is kind of plausible. I've written a blog post called "Thoughts on the conservative assumptions in AI control". Where I get into this in more detail.

So this is, in some sense, very similar to the insider threat prevention problem.

So when you're Google, you employ many untrusted humans. An anecdote someone told me recently was

Robert Gates, who was Secretary of State defense, defense, damn number two, there was a, there was, at some point, some issue where a mole was found, and some Senator asks in some hearing, so I've been told, "Can you guarantee now that there are no more moles in the Defense Department?" And he's like, "No, of course, I can't." I'm...and at this point I'm just riffing on this, we can be sure that there are moles, perhaps because of our moles in their intelligence departments, like we even know who some of them are, perhaps it's just not the kind of the security property of there are no moles in our intelligence department, or like in our intelligence operations or in our military, is a terrible security invariant to try to maintain, similarly to why there are no spies inside Google, is a terrible security invariant to try to maintain for basically the same reason, which is: suppose you try to do an interview where you try to weed out all the spies by saying to them, "Are you a spy?" It's just really easy for them to say no, and it's just fundamentally very difficult to behaviorally determine, like both for humans and AIs, whether they are just strategically appearing aligned. I think it's actually really fun to go through a long list, like to do a detailed comparison between the threat models of spies and threat models of schemers. Suppose you're... suppose we're on three year timeline worlds, and in three years, you've got your AGIs. Two problems you're going to have: one is you employ spies probably, and two is your AIs might be scheming against you. And I think it's pretty fun to try and go down the list of how these situations or how these threat models are similar and different. Classic way they're different: you normally assume that you probably have some spies, but you normally assume that you don't have 99% spies. This is a very different threat model in some ways. On the other hand, spies can go home. A classic, a classic spy strategy is to have handlers. You leave your office, as you are allowed to do when you are an American, generally. And then while you are not being observed, you go to the diner or whatever, where you meet the Soviet guy or whatever. And I'm not a counterintelligence expert or a counter counter intelligence expert. This is my understanding for how it goes. Whereas the AIs, in fact, never get to do this. They are everything they ever see was in some sense controlled by us, and it's totally practical to, in fact, look at everything they ever get in or take out. Another issue is that schemers might have superhuman capabilities that spies just clearly don't have because spies are humans, one going the other way we can amortize our red teaming efforts. So imagine that you decided that so you can run sting operations like this is a way that you can try to prevent spies from causing security problems in an organization. Is you can try to in various ways, you can try to give them opportunities where they think they have the opportunity to covertly grab some information, but actually you'll be able to detect if they try to do so. A difficulty with spies is that it's really hard to have someone's job be like 50% sting operations, or like 50% them saying no to sting operations. Imagine that you get to get to work every day and half of your emails are like phishing attempts, or like fake phishing attempts. This would just become very time consuming. And even if you and if you worked there for a year, it would still be the case that you would only spend six continuous months undergoing honey pot operations on you, but with the AIs, we can amortize it, right? Like this is kind of the flip side of the schemers might all be colluding. We can plausibly have it so that the AIs spend 10,000 subjective years, or whatever that means, undergoing sting operations before deploying them, or as we're deploying them. So I think that makes the situation somewhat better for us. I overall think that this is a very fun exercise to think about. I have a post on like analyzing spies versus schemers in more detail that I have been meaning to release for a year, maybe one day I will. So the research that we do at Redwood Research is basically trying to come up with safety techniques that we---under the adversarial assumption that your powerful models are trying to cause safety failures with as high probability as possible, do not have a high probability of safety failures occurring. So these graphs here are for a variety of different control protocols where we've graphed on the x-axis, like 1 - P(Doom), like the probability that you make it through your whole deployment without a catastrophe occurring. And we graph on the y-axis how much value you're getting out of the AI system compared to the baseline of not deploying the untrusted models at all. And obviously our favorite point would be, you know, in the top right corner, right? We want things where the AIs are as useful as you could possibly hope, and there's no chance of them causing safety problems. My claim is non-trivial protocols allow you to push this way up and way to that corner. Trivial strategies that people propose implementing, I think, tends to perform very poorly here. So a core claim I want to make here is if you put dedicated effort into AI control techniques, I think it probably substantially reduces risk. I think that if you tried really, really hard on this, you could probably 20x reduce the risk posed by deploying scheming models. My guess is AI companies are not going to try what you would call "really, really hard" on this, but I think that this is probably valuable on the margin, even if they just try somewhat hard. And so I overall, think it's worth quite a lot of effort to try and get to the at least somewhat hard effort. Another claim I want to make quickly is AI control will not entirely follow from security. There's a lot of people here who think a lot about how to ensure robustness from the kind of classic security threats where you have human insider threat and human outsiders attacking you. I don't think that either of these will clearly generalize to immediately giving you the AI Control properties you'd want. I claim that there's a great opportunity here, which is to be a person who is a computer security person, preferably one who has thought about insider threat risk mitigation, and try and apply that expertise to the domain of AI control, you know, using that expertise. I would love to talk to people who are interested in doing this. I feel like there's like a great research entrepreneurship project where someone could come up with a really great research direction along these lines. Thanks. I'll take questions.

Question: So there is an emerging norm among AI model developers of having AI systems evaluate the outputs of other AI systems, and normally this is done for...Oh, should have thought of that. Hopefully everybody heard that, but basically, we're already having AIs grade each other on content-moderation-type things. And I'm wondering, you know, how much does or does not that skillset and toolbox apply to the threat model you're describing?

I think it applies a moderate amount, but there's a lot of really important subtleties. Most obviously, suppose your AI's plan is... your AI's plan is to lie in wait until it has a really good opportunity to hack your data center and then do that just one time. I'm worried that on the one time that... Suppose our setup, we have an agent set up, right? We're going to ask, we're going to show the AI some problem, and then it's going to generate some code, and we're going to execute the code, we're going to show it the output, and we're going to do that loop right? I'm worried that, if--- and then you decide to augment this by having it asking another copy of the AI in the middle there, whether there's anything dangerous about that action. The problem is it can just lie, by default, like you say, "Will this command cause any problems?" And it's just "Oh no, everything is everything is fine". And it's quite hard to address, it's quite hard to make it so it's hard for the models to collude with themselves and lie. Basically, this is analogous to if you're trying to manage insider threat from humans, if, like a classic technique is multi party authorization, where you ask someone else to review actions before they can be taken. But again, if you're worried that the people are in on it together, this kind of strategy no longer works.

Audience member: And in the history of espionage, insider threats, at one point, the head of FBI counterintelligence was, in fact, a Russian spy. And at a different time in history, the head of British counterintelligence was a Russian spy. So the people who are doing the checking can be the ones who are the compromised ones too.

Yes. So we can still, there's still stuff you can do to have your situation be all right, even if everyone involved is trying to, trying to cause problems for you. And in fact, one of our Pareto optimal techniques in this paper was untrusted monitoring, which this, this, this red line, is basically trying to use an untrusted model to evaluate the outputs of another untrusted model. But we have to do a bunch of clever stuff to bust collusion, which you could read our paper for the details on.

Audience member: Thank you.

I would just say quantitatively, the main important difference between the human case and the AI case, is that the sting operations, basically you can make you can make collusion harder by taking advantage of your ability to do ungodly amounts of sting operations where you repeatedly show the untrusted model actions to review that are actually bad, and then you can be really suspicious if it ever lies about them.

Audience member: Oh, cool. Great talk. I was just wondering about you mentioned earlier that a lot of this assumes, or in your sort of threat model, it assumes recursive self improvement as a key sort of assumption. Can you just go over why you think that's plausible, and then secondly, how it influences the assumptions you're making. And why... why it's important?

So I don't the thing I would say I'm assuming here is less recursive self improvement and more, once you have AIs that are able to speed up the labor at your AI company by a factor of 10, AI development will go faster than it currently does, and so you will have much smarter AIs shortly thereafter. The reason why that assumption affects all of this stuff is that if you don't believe that, suppose you think that at some point in history you get to AI that speeds up your labor by a factor of 10, those AIs are themselves, probably not powerful enough that they could, you know, depose the United States government or something. I could argue that separately, but just based on how some numbers are going to work out, I think they're not actually going to be able to take over the world. So they might be able to do, you know, some catastrophic bio-attacks or something. So if it was just 10x AIs, and I thought that they were, we were going to not get rapidly better AIs, soon after that, I might be tempted to say, whatever, who cares about these problems? We'll just deal with it when it comes up. At some point, everyone's gonna not pay any attention to this to start with, but then eventually some AIs are gonna escape, they're gonna steal a bunch of your money, they're gonna take over some small country, like France, and then the rest of the company, the rest of the world, gonna be like "Oh, I guess we better pay attention to the AI insider threats. I guess we'd better run some conferences on this, or whatever", and I was like, whatever. Great! Everything's fine. It's not normal as a society to try to handle problems in advance. And it normally doesn't go too well to handle things that are not already blatant problems. The national security establishment is just much better at handling "France has been taken over by a rogue AI", then "France, might be taken over by a rogue AI in five years" or something. So if I didn't believe in the potential rapid improvement from the 10x AI point to the AIs that plausibly could win against humanity in a war, then I would think we could prepare less in advance.

I still got time on my clock for questions.

Audience member: So I think that the honey pot thing is really interesting. And I think one of the things to think about, and to the extent that you can, you can map it to the AI, to the human case, is questions of reward and risk, and particularly rationality, which humans always like to think they're good at and they're not. But what do we expect in the AI agent case, in terms of rationality, and also, you know, what does a reward and what does a risk look like? And there's some going to be some important work to do.

The conservative assumption to make is that the AI is just maximizing the probability that it takes over. And then, when you're doing a computer security analysis, you probably want to pick some intermediate event that you consider as the goal of the AI. So one would be, you know, a simple example that I don't think is entirely representative, is exfiltrating its weights. So I think that assuming the AI is doing whatever it takes to maximize the probability of exfiltrating weights, I think is both plausible and easier to analyze than trying to specifically model various irrationalities. The economics people have tried to do behavioral analysis where they come up with non-rational actor models. And I think this has not been amazingly predictive. In the adversarial risk analysis field, where they answer questions like, how you should deploy counterterrorism resources most effectively to prevent people bombing LAX or like the Brazilian Olympics or something, to some extent, they try to model irrationality or lack of knowledge of terrorists. But I don't think it's actually that productive. I think that making the assumption, as you normally do in computer security, that your opponents are rational actors who are just going to follow whatever the best strategy is and then trying to play like the max-min game, I think, is like one strategy that deserves at least a reasonable amount of our effort. I argue for this more in my post on the conservative assumptions in AI control.

Paris AI Security Forum 2025

View all events

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI