AI Control Safety Cases

Summary

Tomek Korbak presents a framework for developing AI control safety cases that evaluate whether control measures will remain effective as AI capabilities increase towards superintelligence.

SESSION Transcript

Tomek Korbak: Hi, I'm Tomek Korbak, I'm a Senior Research Scientist at the UK AISI Control Team, and I'm going to talk about AI-controlled safety cases. So first, the plan for today is to first quickly go through what is a safety case, then I'm going to show you a concrete sketch of a controlled safety case, and then I'll discuss a bit about the roadmap of how we scale controlled safety cases from today to artificial superintelligence.

So, a safety case is a structured argument that the given system is safe to be deployed in a given context. They are standard in many safety-critical industries, and a safety case can be used either internally or shared with regulators. Currently, AI labs are making implicit inability safety cases, for example, they run a suite of capability evaluations and determine that, for example, the model is not capable of meaningfully assisting with biological weapons development.

But soon, models will be capable of doing that, and will be increasingly used autonomously and we will need some safeguards to ensure that they won't cause harm despite being capable of causing harm. And if your threat model risks misalignment, you want to apply some control measures and make a controlled safety case. So, let me show you an example of a specific sketch of an AI-controlled safety case. I will describe a threat model, it’s like control measures, control evaluation and the structure of the resulting safety case.

So, let's start with a threat model, we assume that we start with an LLM agent, being used internally, to assist in software engineering tasks. So, it's a language model similar to those being used today with some scaffolding that allows it to call tools. We assume that it can execute code in a safe sandbox environment and has access to the internet. And the threat model we worry about is that the LLM agent that has access to some sensitive data, for example, frontier architecture code or some other algorithmic secrets, is going to leak them to the public internet.

So, we would like to design some safety measures with this goal in mind, and inside our safety case, we want to make a claim that the probability of data exfiltration, given our control measures, is at most 1%. And here are those control measures: So, we assume that control measures consist of both upfront evaluations and deployment time measures. I will now just focus on deployment time measures.

So, we assume that the LLM agent has access to some context, which is for example, the description of a problem, and previous action it has taken. And it produces an action and a chain-of-thought (CoT). That is shown to a monitor, and the monitor produces some suspiciousness score, and based on this suspiciousness score, the control protocol takes an action, for example, allows this code to be executed, or in the extreme case, it sends it to a human for an audit.

We assume that we will use untrusted
monitoring, which is using another instance in the same model as a monitor. And to prevent collusion, we will use summarization to decrease the size of this communication channel that can be used for collusion.

Coming back to our safety case claim, we say that this probability is at most 1%. Now, how do we estimate this number? This is where control evaluation comes in. So, control evaluation is framed as a game between the red team and the blue team. It consists of five steps. First, the blue team constructs some control measures, then the red team figures out some attack strategies and asks developers to produce tasks that allow those attack strategies to be executed.

So, here we require that the tasks actually present a good opportunity for attack. And then the red team constructs model substitutes that actually execute those attacks. This in practice probably means either prompting or fine-tuning them.

And then we run this phase of proxy deployment. At evaluation time, the blue team can evaluate models and maybe reject them. At deployment time, we apply this monitoring pipeline that I just described. And finally, the developers extrapolate from this proxy deployment phase to estimate the frequency at which the red team would win if we would upscale this proxy deployment to the size of the actual deployment.

And then what is the structure of this resulting safety case? It rests on two basic claims: The first one is that the data exfiltration probability we just estimated is acceptable. This is basically a statement about your risk threshold. And then the second claim is that this probability that we estimate is actually an upper bound on risk at deployment time.

So, we estimated a number, attack success rate of the red team, and we want to claim that it is actually an upper bound of the deployment time risk. To support this claim, we make three other claims: We say that control measures are going to be deployed. The thing that we are going to deploy is the same thing that we evaluated in control evaluation.

Then we make a claim that our red team is as effective in control evaluation as real schemers would be in deployment. And finally, we make this claim that the developer can conservatively extrapolate statistics from the proxy deployment phase to estimate this risk at deployment time.

That was our concrete sketch. And now here is a trajectory of safety cases and how they're going to evolve. So, let me first start with the goals of the paper. This is a work in progress, but the paper is going to be published in like a week or so—I hope. Then I'll describe a framework and a concrete trajectory, how we expect control safety cases to evolve over time from today to artificial superintelligence.

So, let's think about how the difficulty of running a control evaluation depends on the capability of the model we care about. The safety case I just described, pertains human-level schemers, but we know that models that exist today are pretty incompetent schemers. And we also know that we should expect a couple of capability levels in between today and human-level scheming.

In the paper we are—Oh, and we also know that progress won't stop there, and at some point, we are going to have models that are superhuman at scheming. In the paper we are interested in the general trend of how we should adapt control evaluation to capability levels of models. And we introduced this notion of an ACL, AI Control Level,

basically a capability level of a model that determines how we should design the rules

of this red teaming game in control evaluation.

So, the first goal we had in the paper was to account for capability limitations of schemers. Currently, most of work in AI Control has focused on building control measures, robust to best attacks known to humans. But we know that in practice, especially early on, this is overly conservative. We want this upper bound, but ideally we would have a bound that would be tight. And we don't build control measures that are unnecessarily expensive, and we can build control measures that are more likely to be deployed in practice.

Second, developers are kind of already doing AI Control and many safety measures that are implemented today, such as access control or human approval are already being used to address risks that models pose today. And implementing AI Control is just scaling them up.

My next goal is that we wanted to trace the path of AI Control safety cases through time. We expect that initially the load-bearing component is going to be capability evaluations. But over time, we will have to increasingly rely on control evaluation itself. And finally, we need to ask whether AI Control scales to ASI. And if not, it would be good to know.

We propose a six-component framework for thinking about that. So, the idea is that first, a developer should figure out the capability profile of a given model. Capabilities, especially autonomy, determine the context in which the model is deployed. By context, I roughly mean the degree of human oversight, scale, and stakes involved.

Capabilities in a given context pose risks, and we would like control evaluations to account for those risks. And then once we have rules of control evaluation, they determine what is an adequate control measure. Namely, adequate control measures are those that pass control evaluation. And also, once we've run control evaluation, we basically have all the evidence that fits into an AI Control safety case.

Now, I'm going to describe how we apply this framework to get an example trajectory of AI progress. So, this is basically assuming some timeline of AI progress and applying this framework to get the sequence of AI Control levels that roughly show how AI Control might scale in response to increasing capabilities.

I find this timeline plausible, but you don't have to. In principle, you can apply the six-component framework I described to your own personal timeline of AI development, and you will end up with a different trajectory. So, this is just an illustration that I hope is helpful, but in practice, we should adapt as we go.

In the first level, we assume that models basically can't do anything, but they can scheme in context. By scheming in context, I mean that they form misaligned goals in context, like, for example, due to a conflict in the prompt or retrieved documents. So, the misaligned goals are represented in activations, not weights.

To account for that, probably allowing the red team to fine-tune models would be overpowering them. Probably it's okay to just allow them to prompt models, and you'd assume that adequate control measures would involve secure sandboxing, CoT monitoring, and human approval for consequential actions.

In the next level, we imagine that language models start being capable of deliberately obfuscating their CoTs. To account for that, the red team should be allowed to fine-tune models to use particular steganographic encryptions, and we should adopt untrusted CoT monitoring, meaning using another copy of the same model as a monitor. You can then make a case that hopefully it will be able to see through this steganographic encryption of the agent.

At the next level, we imagine that models become capable of forming consistent goals out of context during training. That leads to a much higher consistency, which allows them to be better at coordination with other instances of themselves. So, it's either collusion or coordinating attacks across time with other copies.

To account for that, we should allow the red team to fine-tune models to follow some general attack strategies, and to counter that, the blue team should be allowed to introduce some anti-collusion measures.

At ACL-4, we imagine that models become capable of latent reasoning, by which I mean reasoning without externalizing the reasoning into any CoT tokens. And to counter that, the red team should be allowed to fine-tune models to use any attack strategies. We also can no longer treat control evaluation as just a set of IID tasks, instead, we should allow developers to use computer simulation to estimate the risk if models pursue attack strategies with optimal probability.

And to account for this latent reasoning, it no longer makes sense to rely on CoT monitoring. We probably have to switch to untrusted, sorry, to untrusted action-only monitoring.

Finally, at ACL-5, models are super-intelligent. By that, I mean that they roughly dominate all human experts across all cognitive domains. To counter that, the red team could try to rely on super-human AI systems, and the blue team could try to restrict energetic affordances of a model. But those mitigations and strategies should be seen on a best-effort basis.

Actually, I would like to argue that absent research breakthroughs, you cannot run a proper control evaluation for an ACL-5 model or a super-intelligent model. And here is a more precise claim: We can't make a control safety case for an ACL-5 model, because the red team won't discover certain attacks that this model could execute.

And here is the argument: So, it's my basic premise of control evaluation that the red team should upper bound models' intensive capabilities. I argue that this bound should be tight, but still, we have this issue that we have super-human models. And then the effectiveness of the red team has to plateau.

I would also argue that using trusted AI systems won't be enough to help the red team. So, the red team would have to rely on some uplift from using untrusted AI, and that will hopefully allow them to actually upper bound those scheming models. But then the worry is that untrusted AI that is being used for assistance can itself be a schemer and can sandbag on strategy generation.

So, the worry is that it might decide not to reveal certain attacks. And since humans, by definition, can't discover those attacks, they won't be able to catch or prevent sandbagging. So, in consequence, the red team can never claim with confidence that they upper bound an ACL model in terms of capabilities.

The summary is that control safety cases are those structured arguments that our control measures are sufficient for a given LLM agent. They are based on evidence produced by control evaluation, which is this game between the red team and the blue team, and we assume that the red team should upper bound models in terms of capabilities. I argue that we should adapt control evaluations to model capabilities, that we should constrain on handicapping the red team so that it's only a little bit stronger than those scheming models we worry about.

And then early control evaluation would hopefully be reasonably cheap such that we can start today. And then we should just scale up and try to push AI control as far as it can go, but no further. Thanks, this was joint work with Mikita Balesni, Josh Clymer, Ben Hilton, Buck Shlegeris, and Geoffrey Irving from UK Air Security Institute at Apollo and Redwood.

We are also hiring, so probably many of you should apply, and now I am happy to take questions.

Q: You were saying that at ACL-4, you would stop doing chain-of-thought monitoring,

but it doesn't seem clear to me that it's a good idea, because you can't completely rely on chain-of-thought monitoring anymore, because the model can't do latent scheming, but it seems like latent scheming must still be harder to pull off than legible scheming. So, if you stop monitoring chain-of-thought, you might be making it easier for the model to scheme without being caught or would be otherwise. [Yeah.] So, if chain-of-thought monitoring is cheap, I think you should keep doing it.

A: Yeah, that's a good point. It's possible that it's not going to be a radical transition from CoT reasoning to no CoT reasoning. It could be a sharper shift if it occurs due to some architecture modifications. My reason for dropping COT monitoring was mostly that it is an additional attack surface for collusion between the policy and the monitor.

Ideally, you would like to restrict the opacity of the communication channel between the policy and the monitor. And the idea is that if models mostly do latent reasoning, it might be not worth it to show those CoTs to the monitors. The downside might outweigh the benefits. But it is just my speculation, like in practice, you should run an experiment and find out.

Q: Hi, my name is Jay, I'm bad at being at meetings. I wanted to ask about the graph of human capabilities for red teaming versus AI capabilities.

A: For this one?

Q: Yes. One thing we've noticed is that AI capabilities are very jagged versus human capabilities. This implied that we could see strategies that the team was never going to reach prior to the naive projected crossover point. Or conversely, that red-teaming might still be effective substantially after the naive crossover point where we'll still be discovering strategies even though the AI is ostensibly superhuman.

A: So, you're saying that this capability profile of models is uneven?

Q: Yeah.

A: And it's not a radical shift for all strategies are discoverable to zero strategies are discoverable.

Q: Yeah.

A: I see, yeah. That's true. Probably we mostly worry about—probably here we are argmaxing, we are worried about the case if there is any attack strategy that the red team missed. And ideally, we would like to ensure that the red team has either found all the strategies, or that they found all strategies as effective as found like the most effective strategies.

Q: Okay. Is that…? Okay, cool.

A: Yeah, so the short answer is I guess we would like the red team to more or less be exhaustive.

Q: Okay.

A: In strategy search.

Q: Cool.

Q: Hi, Sam from Lakera. We have our own red teaming models that we run. They're actually, we found that you can run much smaller ones than the big LLMs. So, because basically the models don't need to be able to do all the things that the underlying model, they just need to be really good at that one thing of spotting plan attacks. Does that change this at all? Because, yeah, the red team capacity uplifted by AI, but maybe you don't need a trusted AI, maybe you just need something that's very good at spotting attacks but is dumb otherwise, and that's good enough.

A: Yeah, so I guess there are two parts here: One is that you need some models that generate strategies, and maybe another model that verifies that they are indeed successful attacks. And the worry here is that this first model might decide to hold off and not reveal certain attack strategies. And then like even if you have a perfect verifier, you still have this elicitation problem.

You can try to adapt some, apply some optimization pressure, but to my mind it's still an awkward problem. It's basically an elicitation problem, or it basically also reduces to exploration hacking.

Q: Can you say more about your intuition that out of context steganographic scheming is below human level or superhuman level scheming capabilities?

A: Yeah, let's see. So, I guess my intuition would be that soon we—labs will be applying a lot of high compute RL to optimize models for long trajectories. And then you can have this effect of some instrumental goals arising, such as seeking compute or curiosity or something like this. I would expect that still at this point models would not be superhuman, that like, they would maybe be superhuman in some domains like coding, but not all across the board. But this is just a speculation, don't cite me on that.

I think we don't have more time for questions, but I will be available somewhere here, so if you have any more questions, I am happy to take them off the stage. Thanks.

London ControlConf 2025

View all events

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI