Hopes & Difficulties with Using Control Protocols in Production

Summary

Fabien Roger highlights that while implementing control protocols for AI systems faces challenges like costs and security concerns, mitigations such as API monitoring are becoming increasingly tractable as AI capabilities advance.

SESSION Transcript

Fabien Roger: So, I’ll present “Hopes and Difficulties with Using Control Protocols in Production.” So, I’m going to share some ideas about what mitigations might exist by default, what are the ones which are tractable to build right now, and what are the kinds of roadblocks that one might face when trying to design control protocols, what are the constraints that are binding decisions that are being made right now. And I’ll take questions at the end, and also there will be enough time right after the talk.
So, I will share personal views. They do not represent the views of my employer. I’m not claiming that Anthropic is/isn’t doing any of the things described here.
Cool, so I’ll describe what I think the default trajectory is for, what are the sorts of mitigations people will use in practice. I’ll describe examples of control mitigations which I think are doable today, which are tractable to build today, some hopes and difficulties for actually getting them deployed at least internally at AI companies, and how constraints might change when AI has become more powerful.
So first, I’ll point out that some mitigations already exist which would make the lives of misaligned AIs somewhat more difficult today. So first during AI training, if you use reinforcement learning it will train away some low-reward behavior, developers will look at transcripts. During testing, developers will also look at what AIs are saying, maybe also check that the performance is as good as you might expect.
These are not perfect mitigations. A good enough schemer might just recognize which prompts are obviously training or obviously pre-deployment testing, but these at least give you some assurances against AIs which are misaligned but not very clever.
And also you have things which are happening in deployments. So, for example when AIs are writing codes, usually at least today, humans will look at the code that AIs write before pushing it in production. There will be also static tests—static analysis and tests to make it hard to slip by certain mistakes.
And once it actually executes, then there are classic security things detection response which might catch some of the more egregious things that AIs might do, for example, attempts at exfiltrating weights from servers.
But things might change in the future. So, I expect that as time goes on, the most likely trajectory is the amount of quality-weighted AI labor to go up by a massive amount, far exceeding the amount of human labor available.
And so while today you can just check the code that AIs wrote and probably avoid many sorts of problems that might arise. I expect that in the future it might look a bit more like, let’s see what AI assistants are up to. And this makes it more difficult to check every single thing that AIs are doing. And afterwards, crazy thing might happen, but I’ll mostly focus on this intermediate regime where AI labor are massively speeding up safety critical work, AI language research, cyber defense, and these other things.
And so, once you get in this middle region where AIs are being used as extremely productive assistants, I expect that we’ll keep most of the existing safety measures except the PR review one because this one requires lots of human labor. But we’ll keep probably most of the default security. I expect that humans will still try to keep track of what AIs are doing so that’s a bit like executives in regular companies try to keep track of what their employees are doing. Though most of it might just look like humans asking the AIs directly what they’re up to.
I expect that to prevent mistakes things might look increasingly AIs checking their own work. And I’m unsure if things will involve more or less sandboxing. So maybe AIs will be given—will be put in stranger sandboxes so that they can do very large trajectories without mistakes, risking, causing various security critical issues. But maybe also there will be less sandboxing if people manage to drive the rate of mistakes sufficiently low, that just giving them more affordances is worth it.
But I claim the situation is still relatively risky. I think two important risks are the risk of rogue internal deployments. That is AIs that manage to run in ways which aren’t monitored using maybe affordances which developers did not intend to give to AIs. So, for example, if AIs try to get – manage to get access – root access to the data center, they might be able to run scripts which run themselves a lot more without any kind of software—any form of monitoring doing research into how they might cause even bigger problems to happen.
And this might not be trivial to prevent because this might look not that different from regular research workloads. And also AIs might—and humans won’t be able to look at every single piece of code that AIs are writing.
I’m also afraid that AIs might sabotage safety-critical work. So if they’re doing most of the work, it will probably give them lots of opportunities to insert certain bugs that make certain sorts of—certain promising research project look they’re not working after all or might allow them also to oversell research that doesn’t work right.
And also depending on what exactly the shape of AI capabilities look like, I’m somewhat worried that AIs might succeed at self-exfiltration or manipulate or persuade some humans into doing things which are extremely against their best interests or even other things.
So how can we defend a bit better against these risks and what are the sorts of control mitigations which might be doable to implement today. So, by control mitigations just things that directly help with safety and that work even if AIs are initially misaligned and that are tractable today.
There are lots of things which are useful to improve safety. This one is doing research, doing better evals, doing policy stuff. This one is mostly about actually implementing very basic measures. I think people should do other things.
So, the default state is – specifically for coding agents, I think this is what things will look like by default. So, you have a loop where you sample a new generation from a language model, you execute it, you sample again. Sometimes the human users will give instruction to this AI agent. Sometimes they will look at actions before executing them. And I claim that there are ways to modify this default state to inject some control mitigations.
I think a default thing that is probably good to do and that I think is tractable is to have human users report to some auditing team, actions which are suspicious, so that we can learn more about the weird AI behaviors. Some of these weird behaviors might be related to very concerning misalignment concern, especially as AIs become more powerful.
Another easy place to intervene is to add some API monitoring. That is right at the output of your inference server, you just maybe run monitor right here and also you log the traffic and this also allows your auditing team to notice if some very worrying things are going on.
And the third class of things which I think of interventions, which I think are very tractable is adding within-scaffold monitoring. So, before an action is executed, an LLM monitor looks at it. And then once you get a score for how suspicious it looks, you can either block it and just ask the human user what it wants to do. You can re-sample it. So even without asking the human user, you just re-sample a new action which hopefully will be better or you tell the language model, what you was about to do was very problematic for this reason, can you suggest a different action? Or you can report it directly to the auditing team if it was very suspicious.
Okay. I think this is starting to be tractable because we are starting to get extended APIs and scaffolds. And so this means that we have actual software objects as we can start to modify to add control mitigations. And I think that in addition to the preparation benefits of adding these mitigations today, there are also some near-term benefits of using these sorts of interventions. In particular, my current favorite near-term benefits is that models are sometimes a bit too eager to solve problems and will make mistakes and will go a bit beyond what the user intended.
So, for example, this tweet says that Claude 3.7 Sonnet needs its hand held as a leash to stop it from building an entire service business when you ask it to create a database. So, this mistake is not too bad, but they’re slightly worse things that Claude Code could do.
And I claim that if you had better reporting and API monitoring, you could notice these issues appearing earlier and if you had within-scaffold monitoring, you could preventing from causing issues with this sort of eagerness to solve problems.
There are other classic near-term benefits you could get from control like mitigations, like preventing LLM misuse which is a good motivation to start implementing good ways of doing API monitoring and security against humans, which is mostly not applicable for the mitigation I’ve talked about here, but which I think are a promising reason in general to monitor what people are doing on your data centers.
Okay. So, what are the roadblocks when trying to build these sorts of control mitigations in practice? I think one of the main roadblocks is the productivity. So, there is the productivity of people actually using the scaffolds and using the API which might be reduced if you have too many false positives which are blocking some of what they’re trying to do and also which increase the latency.
If you run a monitor before executing an action, this might increase the latency of when you’re actually doing things, which might slow down the execution of the agent loop but might also slow down usage for more interactive work. You might also lower the productivity of people just building the scaffolds by adding code bloats to the code base.
I think these are not massive costs, these are doable and probably will become more tractable as people worry more and more about misalignment, but today, I think it’s plausible that effort might be better spent elsewhere.
So, if you’re willing to bid for some time of other people in a company, maybe you should instead have engineers help with safety projects or maybe you should make sure that everyone in the company understands the risks of AIs and maybe improve the culture of the organization.
There are also other things you can do to limit these costs,
things like first focusing on mitigations that run asynchronously, don’t actually block the actions, do simple deployments first. And so, to the extent that some of you want to build control techniques that might be deployed very soon. I think it’s a good prompt to think about what are simple ways for things to be implemented that don’t increase code bloat too much and that have not—that have an impact which is not too big on the productivity.
Another natural roadblock that many of you have already thought about is the price. The naive price is just like monitoring needs compute. Even if you use a very cheap monitor that only costs 10% of the regular monitor, if you deploy it on full external traffic, this might be hundreds of millions of dollars of compute spent, at least at the OpenAI scale.
And also, it might reduce revenue and have a price in indirect ways. For example, a classic worry is if you use monitoring, especially if you have human reviews afterwards, this will reduce the privacy of the conversations with users. And to the extent that some users don’t like that, this might reduce revenue indirectly.
And again, I don’t think these costs are completely major blockers for anything happening forever, but I think that these resources right now might be better spent on other things. So, for example, AI companies can build like– can fund
fellowships or expensive science experiments.
And also, there are ways to—It’s slightly around this roadblock for now, by, for example, doing internal deployments first and doing cheap versions of the scaffolds first.
And the third roadblock is security. I think this one won’t get better over time. So, the basic concern is if you have human monitoring, this means humans look at the traffic, and if the traffic contains sensitive information, so think about like algorithmic secrets at labs, or if you’re running—if you’re helping governments, maybe it contains state secrets, then it means these humans might learn these secrets and might weaken what is called compartmentalization, which is like only small teams have access to these secrets.
Another kind of security concern is if you rely too much—if you use LLM monitoring and you advertise it as being very good, and this means people use their own judgment less and just mash the approve button, I think this can actually reduce AI safety on nets. So you have to be somewhat careful about this. I think for now, the easy fix is you just avoid deploying control mitigation to sensitive internal deployments, but this is definitely not a viable solution. I want in the future that AI control—I want AI control to be deployed even for sensitive internal deployments. And so, I’m keen to see future control work thinking how to get around these sorts of constraints.
Okay, very quick note on projects with near-term benefits. I think having projects which have near-term benefits has benefits but also costs. So, if it’s close enough, you can get real-world experience, understand a bit better what are the real constraints. Also, you get direct preparation for actually building these tools. But if it’s too far, it can be distracting.
And there are also indirect benefits and costs. For example, it helps the AI company which might be hosting the project, which might mean you get more resources to work on that project. They also mean that they get more resource, which depending how much you like the AI company you're working at might be a benefit or not.
And also, it helps other AI companies to be safer. But maybe there are some compensation effects where if you give them some tools to be safer, they just cut corners on other things, which means that it speeds up timeline with relatively small additional net safety. Cool.
So, the constraints I’ve described so far are the constraints that I’m seeing right now. I think things might still change in significant ways as AI has become more powerful.
First, I think AI will become more powerful over time. And additionally, people will want to give these very powerful and scary AIs like, lots of affordances, maybe more computes, access to their emails, access to money. And I claim that in the situation on the right as opposed to the situation on the left, which is closer to the current situation, users will want more guarantees about what will the AI do and not do.
I also claim that responsible AI companies which care about misalignment will be more careful with these AIs than these AIs and probably will be more willing to spend more to prevent bad things from happening.
The main reason why the tractability of control mitigations might not improve as much is because the models might look more trustworthy. If you have a smarter AI, it might also be more situationally aware and more able to distinguish what is an action which is an obvious mistake from the action that the user actually intends.
And so, I expect overall that concerns around computes and pricing will go down as people become more keen to get higher guarantees. But just because these two concerns go down doesn’t mean you have like, all the costs become more important.
For example, one thing that is relatively cheap to have right now is have a human in the loop. When you block an action, show it to an actual human. And maybe also don’t give access to AIs, don’t give lots of computes and emails and money to AIs. I think this might become a bigger bottleneck in the future as AIs become more useful and as the amount of AI labor grows.
I also expect that security and secrecy concerns will go up because I expect that people will want to use AIs for more stuff, including things which are very sensitive.
And also, there are miscellaneous ways in which the future is weird. I expect that if you have lots of AI labor, the scaffold and the AI actions might become much more confusing. You could imag—AI labs are right now relatively small and it’s not impossible to keep track of everything that’s going on. But if you’re speeding up everything by a factor of 20 because AIs are doing a huge amount of labor, I think it might be harder for humans trying to understand if a given action was bad or not. Like, why was this action taken? What was the bigger project that it was part of? And therefore, things might be more confusing which might make auditing AI actions harder.
Other things that might be weird, maybe trusted models might become more expensive than untrusted ones because untrusted models might be able to do things to make their own inference way cheaper and you don’t really trust their distillation process or the things that makes most cheaper process. And so, you don’t want to apply to your untrusted models. And so, this means that the price of your untrusted models might go down extremely fast as you apply untrusted AI labor to make untrusted AIs cheaper in a way that doesn't straightforwardly transfer to trusted models.
Also, you might have changes in architectures that makes AIs harder to monitor. Also, you might have continuous deployments where instead of having distinct training and evaluation and deployment phases, you might get continuous training and continuous evaluation and continuous deployments. I think some of these are preventable if they are extremely bad, but there might be significant costs associated with avoiding some of these things.
And so I expect that current work on control protocols might not be directly applicable in the future because the future might be a weird place. I still think it’s worth doing because I think there will be some applicability. I also expect that current AI control work helps us understand in the future how we will do it, how we will adapt to new constraints, both for future humans doing this work, but also future AIs.
Okay, to conclude, some takeaways. I think it’s important to take into account existing default mitigations, both to avoid the threats which are most likely to be dealt with by the default control mitigations, and also to build more realistic and interesting control settings.
I think it’s—we’re starting to get at an exciting time where control mitigation can be deployed today, subsidized by preventing mistakes from AIs. So, I’m excited to see various AI labs and maybe startups build things like better reporting, better API monitoring, within-scaffold monitoring that leads to actions like blocking, resampling, or reporting AI actions. And maybe there are other exciting opportunities to be taking advantage of.
And for people doing more research-y style work, I’m interested in people taking into account constraints which are relatively foreseeable, many existing work taking into account some constraints like the costs of running monitors, but I’m interested in taking into account specifically maybe directly user productivity, maybe the ease of implementation and maintenance, at least to the extent that they want to implement them right now, even though in the future, probably there will be more willingness to implement more complicated things. Taking into account computer requirements, taking into account user privacy and secrecy, taking into account security implications, and I also encourage people doing control research to think of potential weird future constraints that we might face, and if any of them seems particularly likely, maybe build control protocols which work despite these constraints. Thank you very much.
Okay, so we have time, I think, three minutes maybe, or maybe a bit more for questions. And then in another room, there will be an office hour where people are welcome to come if they want to chat more about these questions and about other things related to AI control.
Q: Are you optimistic about being able to build privacy-preserving monitoring techniques that would allow monitoring to be applied widely, maybe using ideas from other fields where privacy preservation is important?
A: I don’t know exactly which ideas from other fields you’re referring to. The sort of things that might be applied from other fields is if you have automated algorithms for monitoring, you can just apply this automated algorithm without having any human look at it. For example, if you have a very, very good monitor, you can just have this monitor run and you never look at the most suspicious thing it finds.
I think this is a somewhat sad situation. I think at some point, I want a human to look at it such that if the idea is something actually extremely spooky, you’re able to raise the alarm. I think the way you do that—one of the ways of dealing with this challenge is maybe to have the users themselves or someone in a compartmentalization group actually review the most suspicious actions. And if your monitors are good enough, maybe it’s actually worth people’s time to actually look at the most suspicious actions.
Also, maybe there are some control protocols which don’t rely on this reporting of the most suspicious actions. So, for example, if you’re afraid of sandbagging, the most weirdest action won’t look terribly suspicious. They will just look slightly weaker than they would otherwise be.
So maybe in other fields, for certain threat models, you will be able to do things which are more privacy preserving. So it’s not like, I think it won’t be very easy.
Q: I’m just curious, when you do production monitoring, how do you handle streaming APIs?
A: Okay, so there are multiple ways of doing this. One way that people have experimented with for misuse is using streaming classifiers. Obviously, these are some flaws because sometimes it’s much easier to notice that something was suspicious when you see the whole block as opposed to a prefix. I think if you’re—One thing you can do is for some suspicious thing, you just disable streaming.
And then the last solution is you just wait for the full streaming response to come in and then you do—If your call to the monitor lives on the front end, which obviously is not ideal, you can first get the streaming response back and then before acting on it, for example, by executing the action, you just send that back to the monitor once the action has fully streamed in.
Q: So basically, that would just introduce latency into your API then though?
A: Oh, yeah, yeah. Yeah, okay.
Q: So it’s a latency cost when you do streaming?
A: Yes.
Q: So for the graph where it’s slightly more ahead and people are monitoring their own agents doing things with money and this, that, the other, where do you think it’s most likely the monitoring is going to come from? If part of it relies on the model, part of it might be around actions related to the scaffolding, where do you think it might—Who do you think in the value chain might provide that?
A: Sorry, let me try to rephrase the question. I’m not sure I understood it. Is the question like what is being monitored or is it…?
Q: Yeah, I guess thinking who do you think is most likely to be the person providing the monitoring to the user? Do you think it will come from the...
A: Oh, yeah.
Q: Like the developers of the models or like the intermediary layers?
A: And for internal deployments, probably this will be like the AI companies themselves. For external deployments, I think it will depend a little bit. I think it will often be the people building the scaffolds since I expect them to be slightly sensitive to it. For example, like—I could see, for example, Cursor trying to build these sorts of things in their own scaffolds if there was demand for it.
Though, yeah, maybe the incentives are not exactly aligned here. I’m not sure. I can also imagine either AI labs or third parties providing these directly as services. One advantage of having the AI labs do it themselves is that if they’re the one who are owning the absence of catastrophic risk, probably it makes most sense for them to ensure that these sorts of defensive measures are always used.
Though, obviously, it depends a little bit what the threat models are. For example, I could imagine that if you’re doing a deployment in a government agency and the AI lab doesn’t have a lot of transparency to what the government agency is doing, that it will be the government agencies themselves who will be responsible for ensuring the safety of the deployment.