AI Control: Addressing Risks from Agentic Internal Deployments

Summary

Mary Phuong presents AI control as our strongest short-term defense against misaligned AI, outlining the threat from autonomous agents entrusted with critical systems and explaining how automated monitoring, escalation processes, system design, and control evaluations can address these risks.

SESSION Transcript

Really excited to be here and to see people interested in AI control. So my goal for this talk is to make sure that when you walk out of this room, you'll have a sense of like what AI control is about and hopefully that happens in the next 15 minutes. So what is AI control? AI control is a subfield of technical AGI safety that is focused on mitigating risks from misaligned models.
In the space of possible technical mitigations for misaligned AI, I see it as one of two main approaches that we could take, the other being alignment. But before we go into the details of AI control, first, why should you even care? So the big reason to care about AI control is that misalignment is looking more plausible as a potential threat model. So some of you have probably seen recent reports on models behaving in a way that is not what their developers expected.
So the first example I have in mind is recent reports of reward hacking. This has been shown in both OpenAI and Anthropic production models. Both these models do reward hack spontaneously.
And what I mean by that is that when they're given a task, they solve it in a way that basically exploits loopholes in the task or they cheat. So a typical example would be if a model is asked to write some code, it would instead modify the test to make its bad solution pass. Another example is perhaps more concerning.
This is where the models pursue goals that are long-term goals that were not intended by developers. So two examples I have in mind. One is alignment faking, and the other is models trying to self-preserve.
So both of these were shown in frontier Anthropic models. And in the latter case, the model does sometimes take extreme actions. like for example, blackmailing an engineer that would be in charge of resetting the model or self-exfiltrating its ways.
But I should say that both of these are only in relatively contrived test scenarios and not in production yet. So misalignment is exacerbated by models getting better at long-term tasks. Helen has already shown this plot, so I won't go into detail, but basically it shows that the length of tasks that models can do autonomously has been doubling every seven months.
So if this trend holds up, in 2027, we could have agents that can do eight-hour tasks. By the end of 2028, we could have models that can do 32-hour tasks autonomously. And I anticipate that what this means for internal deployments is that labs will switch from using models in the kind of human-centric deployment model where most of the work or information flows through a human to a more agentic deployment model where the AI has access to various interfaces directly.
It can access production databases, it has access to Code weights, it talks to its manager from time to time. But basically, it's not bottlenecked by a human copy-pasting its answers to the relevant place in the codebase, for example. So I hope you see where this is going.
If we take misalignment risk seriously and also this trend of more and more agency in internal deployments, we can see why this could be concerning. So three points to try to illustrate this risk. First is that models will probably get access to various critical assets.
For example, to do different types of ML research, models might need access to model weights or to large amounts of computes. They might need to access sensitive code. For example, they could be involved in writing training code or inference code for the next generation of models.
They could be making evaluations or writing monitoring code. And these are all opportunities for misaligned models to cause harm. Another point I want to mention is that it's very unlike, unlikely that human oversight will scale to this regime.
And this is because as we have humans, you know, basically just managing, you know, tens or hundreds of AI agents, it won't be feasible to look at every single line of code that these agents write. And so there will be have to be some different mechanism of maintaining control. And third, I want to mention correlated misalignment.
So this is something that doesn't happen with human insiders, for example, because we could expect that internal usage will concentrate on models that have the best capabilities, for example, in software engineering or in ML engineering. And so, you know, if that particular model checkpoint is misaligned, then suddenly we have the situation where a large portion of the AI workforce at a company is misaligned. So how can AI control help with this situation? We can think of it as there being two points of intervention.
One is during training and one is during deployments. So in theory, if we aligned our models perfectly during training, There would be no need to do anything at deployment time. But in practice, it's very unlikely that we will have succeeded perfectly during training.
And also during deployments, models will see a much larger range of situations, and their actions will be much higher stakes because they will be interacting with real systems, real code bases, their actions will have real consequences on the world. And so AI control is the second layer of defense in addition to alignment, and it operates on the assumption that we have failed in the first stage. So the assumption is that all the model instances that we're running are now misaligned, and we're trying to make sure that the whole system is safe anyway.
I'm now going to talk through what I see as the four main components of AI control, although I should say that the field is quite nascent and this is likely to change. So the first component is automated monitoring. This is, as I've said, human oversight won't scale.
So what are we going to do? Well, we will have AIs overseeing AIs. This is a great idea. Why not? How does it actually help? So there are multiple ways it can help.
And the most basic one is that we have trusted monitoring. This is where the actions of an untrusted model are overseen by another model, but one that we trust. The caveat is that the only way currently that we have of establishing trustedness is that the monitor doesn't have sufficient capabilities for, for example, deception or strategic reasoning.
Another type of monitoring that we could consider is chain of thought monitoring. So current models are often trained to produce kind of legible text in English before they produce any type of answer. And so, and this reasoning often, you know, the model goes through different things.
It could do various considerations, its intent, and so on. So, if the model's reasoning remains legible and faithful to the model's actual thinking, then we could hope that having a trusted model oversee chains of thought would be a reasonable monitoring strategy. And then finally, untrusted monitoring.
This is still an open research problem. How do we make it so that we can delegate monitoring to a model that is also untrusted? So that was one of what I see as the four main components. Second one is some kind of escalation process.
Suppose, imagine we're sitting on these growing and growing logs of unsafe actions that models have taken. But humans don't really have capacity to go through all these logs to make sense of whether they constitute a serious incident or not. I imagine we will have to have some kind of hierarchy of more and more powerful models looking through these logs and eventually escalating to humans.
And once it's established that a serious incident has happened, something will have to be done at that stage. I don't think labs currently have escalation processes in place, but some of the things that they could include are things like shutting down the model, having some kind of minimal testing period until another version could be deployed, writing up a detailed report, investigating what's happened. I imagine we can learn a lot from other fields here.
Okay, what else might we want in a robust AI control system? I think just a lot of common sense security or system design. So things like logging agent actions, having fine-grained ways to give agents access to only the parts of the code that they are currently working on or only the assets that they actually need, limiting agents' persistence, redacting their inputs, and so on. There will also probably be novel threats that don't have analogs in adjacent fields.
And one example of that would be, for example, rogue deployments. This is where an AI could spin up a copy of itself, but that isn't running in the same like scaffold that enforces monitoring. And so now it's basically free to do whatever it wants without being found out.
And currently this is quite easy for models to do, but I imagine in the future we would have to deal with this somehow. And finally, control evaluations. So, these are red teaming exercises where the control team intentionally instructs or trains a model to be misaligned and to try to cause some specific type of damage in the system and then use this misaligned model to test our defenses.
Okay, so that's AI control in five minutes. To maybe give a sense of the ecosystem or the landscape. There is a lot of research happening outside of labs and also a little bit inside labs.
So the org that is most pushing for this is Redwood Research. They've spearheaded this direction and they're probably the biggest lab focusing on AI control. The UK AI Security Institute also has a control team, and I think the three biggest labs, Anthropic, DeepMind, and OpenAI also have control teams. But my guess is that each of these is less than 10 people.
In terms of commitments, OpenAI's preparedness framework does mention system oversight as a potential safeguard, and it includes some of the things that I've mentioned, in particular, automated monitoring, automated monitoring and some of the system design things. And Google's Frontier Safety Framework mentions chain-of-thought monitoring as a mitigation that will be triggered at a certain critical capability level. So this is where we're at at the moment, although I would expect in the future labs will have to rely on AI control much more if they want to deploy agents internally.
Okay, and to close, I wanted to share a few thoughts on what I see as the strengths and weaknesses of the control agenda. So what are the weaknesses? First is that monitoring probably will not scale. As the best trusted model that we have, the gap between the best trusted model that we have and the untrusted models we want to deploy will continue growing. And at that point, the trusted model isn't doing much useful work anymore.
Second, there might be some risk compensation going on between control and alignment. So control might enable labs to deploy imperfectly aligned models or even models that they know are misaligned and still think that they're being safe. And this is especially concerning because control actually doesn't help address the kind of like longer term root cause of the problem and doesn't help with the longer term goal of aligning superhuman systems.
In terms of strengths of the control agenda, I think the main strength is that in the near term, it gives us way stronger assurances than alignment or alignment testing can. And so in the near term, it's probably the most effective thing we could be doing to reduce risk from misaligned AI. Another more speculative opportunity or strength is that once we are deploying possibly misaligned AIs, we could catch a model doing quite concerning things in the wild.
So for example, if we caught a model that was trying to take over a data center. I think that would be a huge update for both labs internally and also to the external ecosystem. So that could help galvanize more work on safety.
All right, I hope you found the talk informative and thank you all for your attention. Amazing. Thank you.