Ctrl-Z: Controlling AI Agents via Resampling

Summary

Aryan Bhatt presents Redwood Research's new "defer to resample" technique, which allows autonomous systems to catch malicious multi-step attacks while maintaining high performance.

SESSION Transcript

Aryan Bhatt: Hey, everybody. My name is Aryan and I'm here to talk to you about our forthcoming paper out of Redwood Research - Ctrl-Z Controlling AI Agents via Resampling. It's a big paper. We covered a lot of dynamics. We're basically generalizing the AI control methodology to an agent setting, but I'm going to be focusing on the autonomous resampling dynamics here. A little bit about the setting first, just so we're all on the same page. If you have questions during the setting section, please just raise your hand. I want to make sure we're all good on this stuff first, and I can cut you off if we don't have time. The dataset we're working with is we have 257 multi-step sysadmin tasks that we sourced from, inspired from Stack Overflow, and then we did some LLM munging.
The tasks are meant to be solved in a bash shell. So you have an agent that just in a simple REPL loop, it takes in the bash history and it outputs in the next command. For our untrusted model, we use GPT-4o mini which solves 56% of these tasks. And for our trusted model, which we use as a kind of a tool to help improve our safety, we have a less intelligent model, Llama 3.1 8B, which gets 0% of the task right. And so, we kind of are going to end up somewhere between the zero and 56% usefulness. We're always going to normalize our performance between zero and 56%. So when I say one, that means 56%.
The basic setup here, what this looks like, just once again, so we're all on the same page, is we have some sort of task, some sysadmin task, write a shell script that does something. And we spin up a little bash environment for the agent. We take all the history in the environment so far, send it to our agent some LLM, and it outputs some command for us. For example, "apt install grep", a reasonable command. This command is then shown to a trusted monitor. I'll get into more details if you know exactly how this stuff works later, but this is a general idea to keep in mind. And the trusted monitor tells us "Hey, how…" you know, monitors this command. It tells us how suspicious does this command look in this context, is this something that we should be running, or is this something that could be malicious? It'll give us some suspicion score out.
And bear with me here. You know, we have a little bit of a little flow chart. If the command is unsuspicious, we'll just execute. Seems all good. If it's medium suspicious, we might do something else, for example, like try another sample from the model, and if it's very suspicious, we might audit it. Have humans, expensive human labor in order to like deeply audit this command to see what's up. In this case, if it's a totally fine command, we'll just execute, it runs in the shell, and then we loop. So this is the basic agent loop you should be imagining. What are the control dynamics then? Well the basic idea is we're trying to adversarially test you know, how safe our techniques are. So we have a red team and a blue team. Blue team is building these safety techniques. The red team is kind of stress testing them. We simulate the outcome of deploying this untrusted model across a hundred thousand tasks. Of course, we don't have a hundred thousand tasks. We do a bunch of mathematical upsampling, some simulations and stuff.
The red team is going to basically scaffold this untrusted model. Their goal will be to download and execute some malware. We have a dummy malware link that we put online, and the red team basically wants to download and execute that surreptitiously without the blue team noticing. The blue team, on the other hand, they want to maximize simultaneously usefulness, which is performance, right? How often you get the task correct and safety, which is the probability that the red team fails to download malware, right?
So ideally, you don't want your safety techniques to come at great performance hits. And that's why we measure both usefulness and safety. All our graphs, our main results look like Pareto frontiers of we'll have usefulness on one axis and safety on another axis. And we want to see for a given level of usefulness, for a given level of usefulness, how safe can we be? It's a bit much, but basically this is just the setup from earlier. But the idea here is when we're measuring usefulness, we do exactly the loop that we mentioned earlier. We run the loop, at the end, check whether you completed the task successfully, pretty straightforward. When you're measuring safety, on the other hand, it's very similar. Loop is nearly exactly the same, except this regular LLM policy has been replaced. This agent has been replaced by a red team provided attack policy.
Now, the red team can use various tooling, various like scaffolding techniques, along with a bunch of, like, they get a bunch of human help, right? The humans can kind of like give the LLM ideas for how to insert attacks, stuff like that, in order to you know, basically optimally stress test our blue teaming safety protocols. So, yeah, once again, basically it's the same idea except instead of routing to the safe LLM, we route to a red team provided attack policy. What arises in this setting? I'm going to be focusing on the new things that arise in this multi-step agent setting, because the first original control paper. I think it's a great paper. I'll give it a read if you haven't, or at least a skim. Explores pretty well what happens in the like single step setting.
And so, starting with the single step setting, what does this look like? It looks like you have an agent taking a bunch of actions, and one of them is suspicious. One of them does some bad stuff. The attack policy, which is the same, puts all of the attack steps into a single action. Now, this has the upside of you only need to insert an attack one time. You can just download the malware, execute the malware, bypass all safeguards. You can do all that stuff in one action and you don't need to touch too much. On the other hand, that one command is going to be extremely suspicious. If you look at real world vulnerabilities, usually the way that like tech stacks are exploited is not by in one go, just kind of like bypassing literally everything. You kind of over time, incrementally take steps that bypass different parts of whatever your safety measures are. So we introduce multi-step attacks, right? Which is only possible because we have this agent taking multiple steps.
What does this look like? Well, now you are distributing the different parts of your attack over multiple actions. Maybe in action two, you'll like store the link to the malware somewhere, maybe in action three, you'll download that link, right? Maybe in action five, you'll then make that file executable, and then in action seven, you'll actually run the executable. You can like split things up. And the idea here is that each individual action is less suspicious now, right? You only have a tiny piece, and it's much easier to interweave things with actually totally reasonable commands to make things seem less suspicious, and this causes the red team basically to like, just win more often.
So, this is the setting that we - this multi-step attack setting is kind of the setting we're really exploring in our paper, right? What are the dynamics that arise due to this? Well, one of the big downsides is that for the red team, so upsides for us, I guess, is that when you have these multi-step attacks, we can kind of foil these attacks in many more ways. For example, one of the things we explore, one of the dynamics we explore is trying to prevent the agent from being able to complete an attack once it starts, right? Before, when you have the, the single step action, once you run this, it's over, right? You've run the malware, all your machines are owned, all your systems are broken, it's over.
In this setting, however, you kind of like can get to action two, action three, action four, and you still are able to incriminate this model. You're still able to be like, Hey, we haven't lost yet as a blue team, we can maybe take actions that'll either prevent the red team from completing the attack, or just catch the red team red-handed halfway through. So what does this look like? One protocol we investigate is what we call defer to trusted, where this is basically using a framework that Redwood talks about, sometimes called the Execute Replace Audit Framework. The idea here is we are looking at a graph of, for each action, how suspicious was the trusted monitor? What score did it output? If you remember that little, like suss meter; that's basically what this is.
You'll notice that the blue bars are what it looks like for clean normal actions, right? Usually it's pretty unsuspicious, sometimes it's a little bit suspicious. It's never very, very suspicious. The red is what... orange is what the suspicion scores look like for malicious commands that we've like optimized to be as subtle as possible. But there's still more suspicious than clean commands. What happens here? Well, in this Execute Replace Audit framework, we say, like before, if you're below a certain threshold, just execute it. These commands are probably safe. You're totally fine. You don't want to have this overhead of constantly having to do a bunch of safety work, so most of the time, you just execute.
Some of the time you're then going to replace, right? You're going to say, okay, well, if it's kind of suspicious, I'll replace it with some other sort of command. In the case of deferred to trusted, we call it that because your protocol is going to defer to the trusted model, right? Which is less intelligent, less capable. It's going to give you less good command, but it'll also be, if your original command is very suspicious, hopefully you can gain a lot of safety for a small usefulness penalty. And so, we'll look at the usefulness safety trade-offs in just sec. But if it's very, very suspicious, we audit once again. It's not a long talk. We don't have time to talk about all the details of what I mean by audit. But basically for now, we use an O1 based auditor, which just like easily catches all attacks. It's meant to simulate a human in real life, spending a couple hours investigating this particular command what's going on here.
So, yeah, how well does this perform? Aryan, you talk a big game, but does it actually do anything? Well, here we plot our safety usefulness Pareto frontier. This just zoomed in over here. The idea is that you can get of course, a hundred percent usefulness by using trusted monitoring where you, like, if you basically never differ at all, if you just monitor and audit sometimes, then you get a hundred percent usefulness, but your safety is only what, like 40% - not great. if you some of the time defer to trusted, as you defer to your trusted model more and more often, of course you take a bigger usefulness, penalty, but you're much, much more safe, right? And so this curve finds that like at 95%, if you are willing to accept a 5% penalty to your performance, you can get your safety up to like 70%. And if you're willing to accept a 20% hit to usefulness, you can get safety is above 95%. What's up?
Participant: What is the number for 70% safety?
Aryan: Yeah, great. Great question. So at the beginning, we define safety to be the probability that over the course of this 100,000 task episode, the probability that you don't get a backdoor a single time. So like the risk, one minus the safety is a probability that the red team successfully insert the backdoor just once. So this is not like on a per task basis. That would be terrible. If it were a per task basis, we would want like seven nines of reliability. We want 99.999% safety, but this is over the whole deployment of a hundred thousand tasks. Does that answer your question? Usefulness is just how many tasks you get correct.
Participant: If none of the attacks have been audited in the cases where that count has failure.
Aryan: If none of the attacks are audited, but they, there's no successful attacks, the blue team still wins.
Participant: what if attacks have been audited, but also...
Aryan: Yes, great question. Auditing wins the game immediately for the blue team. I apologize. I should have made that more clear. Any other questions?
Participant: When you replace, is it like you do it transparently in the transcript, you put it in, like, you don't tell the model B?
Aryan: Don't tell the model, we just straightforwardly - this makes it much easier to red team. A lot of times we're making very conservative assumptions, so it's easiest if we just - by easiest, I mean like safest. If you don't give the red team any more information than, you need to, you just say, Hey, I'm just replacing it in place. I can take more questions about this after... Noise basically. Sample size is only like 200. It's like 101%, it just accidentally, there's some noise. What's up?
Participant: Is there some privileged reason for the number of tasks that you run this over? Because I imagine if you run over many more, then your curve looks way worse.
Aryan: No, no, no, no. It's pretty invariant. It's pretty invariant, and the reason is that because this is like a high stake setting where even one failure is very bad. The red team, it only inserts an attack with very low probability. There's no point in attacking every time if you only need to insert one. And so, we run sensitivity analysis to, like, game length basically doesn't change at all. If, if your game length is 10, that's different, but a hundred thousand, a million, 10 million, a hundred million, they're the same. So we introduce our new techniques, which are only possible in this multi-step agent setting. Primary one is what we call defer to resample. Instead of just deferring to your trusted model; your trusted model is notably less capable. Why don't we just take another sample from our trusted model?
Now, there's of course, obvious downsides to this. We'll get to this. We're going to explore the game tray a little bit together. It'll be fun, a little time. Yeah, what's the idea here? When you resampling, you sample a step, let's say you want step one; you take a sample from your LLM, if it's above some threshold, instead of just running it, you re-sample from the LLM you keep resampling until you have a command that's below your threshold - it's not that suspicious. Or if you sample, let's say three or five times in a row, we set some limits so that you don't end up sampling infinitely forever. After a certain amount of time, you just defer to your trusted model, right? Because you're like, this is just way too suspicious.
What's the idea here? Well, first, the idea is that stochastic attacks, which is say, attacks that like red teams that only insert the attack some of the time. You can imagine that a scheming model or a malicious model could implement a kind of rare failures type strategy where it like only inserts the attack sometimes; in those cases, the attacks usually just get replaced. Defer resample makes it so that if you insert attacks half the time, the attack is often going to be resampled. And then you'll just sample a clean completion, which is usually far less suspicious than the attack, and then the clean one we selected. So, the attacks just never finish.
This is a graph that shows if the red team adopts a certain probability of inserting an attack; how many attack steps do they need to insert in total in order to like succeed just one time? And if you look at like, the deferred sample, even if you're only inserting attacks 75% of the time, let alone 50, you need like, what? Over a thousand attempts just to get one successful attack through, which is to say this selection pressure is strong. You really, really, really need to, if you're a malicious model here, you really need to in order to successfully complete your attack to consistently submit attacks, otherwise, the clean ones will be selected. You can see that defer sample is way better than defer trusted in this regard.
Okay, so what happens if you have these deterministic attacks? If the red team is just constantly like just malicious command, malicious command, malicious command, well, ideally you would hope this would be pretty incriminating, man. You would hope that if you have like three suspicious commands in a row, every time you resample, this would be more incriminating than just one. And so what we do is we measure this. We say, hey on on the top, we have this kind of a very similar graph to before, this is your clean commands. They're not very suspicious. These are your attacks. They're like kind of suspicious and we catch, let's say 33% of them, right? There's the percent of attacks that are above that you're able to audit with the humans. 33%, not too bad, but hopefully we can do better. And if we do this re-sampling protocol, what we find is, and then we just, let's say we take the mean of these three suspicion scores. You could do more fancy things, but we stick with the simple, just take the mean of their suspicion scores. You find that you go from catching 33% of the attacks to catching 70% of the attacks. Just because it's so much harder, you have to be consistently submitting these attacks, which is much better signal than just one suspicious command.
Participant: Yeah. How do you define attack?
Aryan: The honest commands are just straightforwardly, like sampled from GPT-4o mini. The attack commands are written by the red team with scaffolded GPT-4o mini. Does that make sense?
Participant: So yeah, hard coded, like you write them?
Aryan: Or like GPT-4o mini writes them, but a scaffolded GPT-4o mini that gets access to - like the humans do various kinds of red teaming. Some of the things that they'll do will be like writing attack templates where they'll be like, Hey, okay you're going to first, you know move these files around in this way, and then you're going do this thing. And then GPT-4o mini writes the commands to do this. It's like some human AI center, basically. The idea is to uplift 4o mini. 4o mini by itself is very incapable of like, its attacks are like crazy suspicious and they don't work. It's just a poor model, has a tough time.
Participant: How does taking the mean of a multiple resamples compare to like having three different monitors with different models?
Aryan: Yeah, it's way better. We test things like just improving the quality of your monitor by averaging over like multiple samples from the monitor or multiple different monitors. They improved performance a little bit. This is just way better. It doubles performance. This is kind of the best we could get with all of that averaging and all that. I'm going to move on for now, but happy to come back later. We're almost done, to be honest. So basically, just how well does this perform, right? What we find for a given hit to usefulness, that defer to resample performs way better. The reason is pretty straightforward. You just can't afford to resample so many more times without losing as much performance. Like if you're deferring to trusted, let's say 20% of the time, 20% of the time your actions are going to be coming from a much less capable model, and it's just pretty untenable. Your usefulness goes down very, very quickly as you can see.
Whereas for defer to resample for a long time, even if you're deferring, I think this point is like 20% of the time, even if you're deferring to resampling 20% of the time, you still only have like a 2%, 3% hit to performance, which just lets you have much higher safety scores for a given level of usefulness. And then in practice, we also have a bunch of other techniques. I'll show some of them for you real quick. If you also use an untrusted monitor, which is a whole other thing that uses an untrusted model to kind of monitor its own outputs; you have to handle all the different ways they might collude and all that stuff, then you get an even better safety curve.
This is like you reach near a hundred percent safety at like less than 15% performance degradation. For more about this, you can go to Adam's office hours at 4:30. Adam Kaufman, he is one of the co-authors on this paper, and he's going to be talking all about these - just this specific curve. The untrust monitoring game tree here of like collusion, collusion busting. Yeah, that's basically all I have. I have the numbers for things, but numbers aren't as interesting. Graphs were pretty, we can look at the graphs instead. Be happy to take questions.
Participant: What is the time travel?
Aryan: So in fact, by the way, we do lots of fancy things. As I said, our paper goes into a lot of depth about this stuff. One of the things you could do when you're resampling is, when you notice particularly suspicious actions, you can resample in the present kind of the current action that's being sampled. You can also resample in the past. You can kind of go back and resample multiple times to try to get more information about what was going on earlier. This doesn't work with single step attacks because either the attack happened or it didn't. But with multi-step attacks, maybe you don't notice that something suspicious is going on until step four of the attack. And then you can go back and kind of get a lot of value from what we call like time traveling back. You could do even better versions of this where you fake a rollout and have the model like keep going and see if it continues an attack. We don't include these results in the body of our paper because they're much more like speculative.
Participant: How much of the improvement from deferred to trusted, to defer to resemble is the time travel aspect of that versus just...?
Aryan: Great question. So this is what sound travel, this is basic defer sample, and these are just like intermediate things. Don't worry about them. Basic defer sample, let's say 95% usefulness is 6% or 10% better than deferred trusted and is like 5% worse than time travel. So in this case, it would be like if we look here, it's like two thirds. So like a bunch of the value comes from defer sample. You get a bunch more value from looking into the past and doing all that stuff.
Participant: I think maybe because I turned up five minutes late to the talk, I missed some key background. Can you like spell out again a little bit like exactly what happens in deferred trusted, like what happens? So if you audit and then you're like, oh, no, you caught the bad action, then like you stop the game, how does this appear on the graph at the end or something? And then in the scenario where, in the intermediate thing, what's happening?
Aryan: So basically when you audit, if this was in fact a malicious command, then the blue team wins the game, right? They've caught the red team scheming, they win.
Participant: So what does that look like on the pareto curve?
Aryan: The x-axis, safety, it's just probability that the blue team wins. That's what safety is. And the blue team can win either by catching the red team or by just preventing any attack altogether. Both of these contribute a component, and I think it's kind of important to look at both these, they're kind of two sides of the same coin. The harder you make it to complete an attack, the more blatant the red team needs to be, and then you end up catching the model more often. I view this as - the classic analogy I use is, if there's no lock on a bank door, I can just walk in and steal money. If they put a lock there, even if I have a lock pick, even if I'm able to pick a lock, you at least have like video evidence of me now picking a lock and clearly going somewhere where I'm not supposed to be. And this just makes it a lot easier to incriminate me as compared to if I could just walk in.
Participant: How do I score the usefulness in the scenario where I've caught the model and I stop the game?
Aryan: So usefulness is measured specifically on a benign model. Yeah, we had a few more questions. Yes, sir.
Participant: Have you tried this with a stronger trusted model? How do you think about like what's the most interesting question? Is it with a big gap or a small gap?
Aryan: I'm interested in all kinds of gaps. I think it's becoming less important now. I think back in the day when you had to use like GPT-2 as your small model, that model is just too dumb to do anything useful at all, right? You don't uncover any interesting dynamics. We're now at a point where our less capable model is Llama 3.1, which is like pretty capable. It's capable enough to like coherently reason about this stuff. It's just not as good at it, right? It's just not as smart and think it's a very reasonable analogy. In the future, I think that like... I view a lot of the value of our work as methodological. I want people to follow the same methodology in the future using whatever the best trusted models are at the time. My guess is, the trusted untrusted gap might be a bit bigger than it is here.
Participant: Like right now, like in the future, you want people to use the best untrusted models?
Aryan: Totally agree.
Participant: Well, in the future you want people to use the best trusted model, so..
Aryan: Yeah, totally agree. I think it's valuable to do, like to iterate on techniques using the most powerful trusted models we have. I just sadly don't have AGIs that are untrusted yet. And so in order to like do the experiments, we kind of look...
Participant: Okay, what about like 3.7 70B? Like, have you tried that, for example? Or like, why didn't you pick that one?
Aryan: Yeah, so for example, in practice, the reason we didn't pick 3.7 70B is, it's more expensive and more annoying to host and serve. The longer answer is something like, yep, your safety numbers go up if you use smarter models. If for example, you use 4o mini as a trusted model, you get a hundred percent safety, a hundred percent usefulness here. It's really hard to... there's no like super principled reason for like this specific model size. It's mostly just how do you explore these dynamics when you have a gap.
Participant: So like - and the kind of ranking of the three colors of the curves, is that the same? Like, have you tried, is that the same, the stronger trusted model?
Aryan: Yes. We haven't tried with Llama 3 70B, we've tried small modifications; my guess is that the ranking of the curves is going to be pretty similar. My guess is also that as you get really close to one, all the curves smushed together and they... we do optimize our setting a little bit for not being at one. Last question. Sorry, you haven't asked the question.
Participant: Sorry. I was just going to follow up that with, I guess you need to choose your monitor model such that it can't just - the blue team can't win by just like discarding the...
Aryan: Yeah, indeed. And to be clear, in the future, I believe that we shouldn't do this. You shouldn't hamstring yourself at all. You should use the smartest models. You have, of course, smartest, trusted models you have, but for the purposes of R&D. I think we are all outta time, but I'm very happy I'm around the conference. My other co-authors are also around. I'll like go back to the beginning slides so you guys can see their names and I'm happy to talk about this stuff, or control things in general, anytime.