Next Steps for Control Safety Cases
Summary
Martin Soto explores how control mitigations and stress testing with adversarial models can prevent harmful AI outcomes if we address current challenges in creating realistic testing scenarios, ensuring redundancy, and bounding adversarial AI's capabilities.
SESSION Transcript
So yeah, I'm Martin from the UK's AI Security Institute. And I'll be talking about Evaluating Control Mitigations. So first, I'll give some framing on what are control mitigations, then why we want to evaluate them, and finally go over some remaining challenges that we are currently working on now at AISI.
So what are control mitigations? Well, say you want to deploy your AI in some environment to perform a task. Say your AI is already capable enough to bring about some particular negative outcome, like, for example, exfiltrating its own weights from the lab. So regardless of whether the AI will actually try to do this, if it tried to do this, it would be already capable enough. And then say, also, you cannot rule out that the AI will actually competently and purposefully try to bring about this negative outcome. That is, you cannot rule out misalignment from your model.
Then something you can still do is imposing control measures on your AI, so that even if it tries, even if it's misaligned in this way, then it won't succeed at bringing about the negative outcome. So some of the measures that people usually discuss here are using humans or other AIs to monitor your problematic AI.
You might want to monitor its output, or also the chain of thought it used to produce this output, or even its internals with white box methods. We can also think of other measures to take, like withholding certain information from the AI, although this becomes complicated, or even strengthening just internal defenses generally in the Infosec or Cybersec sense, which is useful both for human insider threats and for these kinds of AI insider threats. So in a sense, with any of these measures, what you're really doing is just increasing the required threshold of capabilities for your AI to be able to bring about the negative outcome.
So without these measures, it will be easy for your AI to bring about the outcome. But with these measures, possibly, it's no longer possible. And yeah, I think this will be relevant soon, because it seems like both of those properties will probably hold soon. We'll have the capabilities, but we don't have robust ways to ensure alignment.
So how do we decide what control mitigations to actually implement in a given situation? We might hope that just knowing our AI, the environment and task that we'll deploy it in, we'll be able to predict and decide what control measures will be robust enough, will be sufficient to prevent the outcome. Unfortunately, this gets quite tricky and hard to predict.
This is because the environment and task might be pretty complex. We might even be uncertain about the exact capabilities or knowledge that our AI has. And also, some of these failures could be quite high stakes. Once the model weights are exfiltrated, you cannot go back, and so on. So it seems better to just directly stress test, evaluate whether the control mitigations you're implementing are sufficient for your setting.
This can be helped and made better with research, for sure. Just research and control give us a better picture of what control mitigations work, how well, how powerful they are. But in the end of the day, your specific setting will differ in 1,000 small ways from any research papers you've seen and so you still want to evaluate your specific setup.
So how does this actually look like? Well, let's look at other industries, actually. In cybersec, or also recently in AI Misuse safeguards, what we do is, since we're worried about the human user being an adversary, we just hire a red-team to play the role of that human adversary and they stress our security measures.
So what we can do for control mitigations is, since we're worried that the AI itself might be the adversary, we can let a human red team make the AI into an actual AI adversary. So create an AI through prompting, or fine-tuning, or whatever method that actually plays the role of a fully misaligned AI that's trying to subvert our control measures. So, the flow here will be something like, first, the blue-team decides what control measures to implement, and writes that out explicitly. Then the red-team comes in and creates an AI model that's purposefully trying to subvert those measures. And finally, in some proxy environment, or even in your actual production code, we run an interaction where this model from the red-team tries to subvert your control measures. So this is a pretty simple picture.
Unfortunately, there are a few technical questions that need to be decided about how to implement this exactly that are here not fully represented. For example, some of the technical questions that we face when actually running a control eval, what models should even be substituted by the red-team? So some AI models that play some role in your deployment might be so small that you are sure that they are not coherently and competently misaligned in any way. So you don't let the red-team just make them as adversarial as they want.
But we don't yet have a clear line to draw here about which model should or shouldn't be substituted. So ongoing research on that. Also, what should the proxy environment be exactly? Because, of course, you want to make it as realistic and close to your real deployment situation as possible.
But even if you are running the experiment on your production code or a clone of your production code, the input distribution will probably not be exactly the same, and you need to make that up somehow. Or also, even what strategies are the red and blue team allowed during this exercise? Because, for example, you might see in your capability evaluations that your AI model certainly doesn't have a capability. You are even accounting for sand bagging and the like.
But this is a capability that your red model could fine tune your model to learn. So probably you do not want to let your red team do this, because it will be too conservative. The red-team will be too strong.
So a lot of finicky questions like these ones, in which we want to strike a balance in taking decisions that are efficient for the developers to be able to run this evaluation process, while still remaining robust in the sense that the evaluation process is robust in figuring out whether your measures are sufficient. And in fact, since there are many tricky questions like this one, we're also thinking that many of these evaluation processes, you probably want to have some responsible arbiter of the evaluation, what we usually call a green-team that tries to play this impartial role, making all of these decisions, but also consulting with external experts.
So these are technical questions. And some of my colleagues at AISI recently published a couple of papers, actually, in these directions. There are also some organizational questions about how should the developer implement this process internally, what should be the human teams, how should they relate, the red-team, blue-team, etc.
And here also, some of them are in the direction of ensuring robustness. So for example, we want to make sure that the implementation of our control mitigations is preserved after the evaluation and during deployment. This could be especially crucial against misaligned AIs, because misaligned AIs will notice quite quickly security breaches in your mitigations and coherently and agentically pile up on these breaches to enact the negative outcome. For example, if some of your human employees are lazy or even maliciously not implementing a part of the protocol, this could be taken advantage by your AIs.
This is, in fact, an instance of a more general trend of doing compositional analysis, so actually looking if a part of your mitigations goes down, do the rest still stand? We also need basic Infosec measures for sure. Maybe we want to document the whole process for external review as well. But then other questions are about ensuring efficiency. And especially, we want to ensure that these things don't take too much serial time and research hours. These seem like the most scarce resources for developers nowadays. And here, there are considerations about how much we can externalize the red-team. Basically, the more, the better. We want to harmonize with other standard practices to make it easier on the developer. And generally, just finding the serial pain points and making them better with research there.
So we're working on this. We're, in fact, working on some guidelines and streamline process that are both robust and efficient. And we'll have some publications on this soon. And yeah, finally, if you're interested in any of these questions, we are hiring in roles for control as well as other teams. We are also funding research on these and other topics. And please do come talk to me or reach me at this email.
Thanks so much.
So what are control mitigations? Well, say you want to deploy your AI in some environment to perform a task. Say your AI is already capable enough to bring about some particular negative outcome, like, for example, exfiltrating its own weights from the lab. So regardless of whether the AI will actually try to do this, if it tried to do this, it would be already capable enough. And then say, also, you cannot rule out that the AI will actually competently and purposefully try to bring about this negative outcome. That is, you cannot rule out misalignment from your model.
Then something you can still do is imposing control measures on your AI, so that even if it tries, even if it's misaligned in this way, then it won't succeed at bringing about the negative outcome. So some of the measures that people usually discuss here are using humans or other AIs to monitor your problematic AI.
You might want to monitor its output, or also the chain of thought it used to produce this output, or even its internals with white box methods. We can also think of other measures to take, like withholding certain information from the AI, although this becomes complicated, or even strengthening just internal defenses generally in the Infosec or Cybersec sense, which is useful both for human insider threats and for these kinds of AI insider threats. So in a sense, with any of these measures, what you're really doing is just increasing the required threshold of capabilities for your AI to be able to bring about the negative outcome.
So without these measures, it will be easy for your AI to bring about the outcome. But with these measures, possibly, it's no longer possible. And yeah, I think this will be relevant soon, because it seems like both of those properties will probably hold soon. We'll have the capabilities, but we don't have robust ways to ensure alignment.
So how do we decide what control mitigations to actually implement in a given situation? We might hope that just knowing our AI, the environment and task that we'll deploy it in, we'll be able to predict and decide what control measures will be robust enough, will be sufficient to prevent the outcome. Unfortunately, this gets quite tricky and hard to predict.
This is because the environment and task might be pretty complex. We might even be uncertain about the exact capabilities or knowledge that our AI has. And also, some of these failures could be quite high stakes. Once the model weights are exfiltrated, you cannot go back, and so on. So it seems better to just directly stress test, evaluate whether the control mitigations you're implementing are sufficient for your setting.
This can be helped and made better with research, for sure. Just research and control give us a better picture of what control mitigations work, how well, how powerful they are. But in the end of the day, your specific setting will differ in 1,000 small ways from any research papers you've seen and so you still want to evaluate your specific setup.
So how does this actually look like? Well, let's look at other industries, actually. In cybersec, or also recently in AI Misuse safeguards, what we do is, since we're worried about the human user being an adversary, we just hire a red-team to play the role of that human adversary and they stress our security measures.
So what we can do for control mitigations is, since we're worried that the AI itself might be the adversary, we can let a human red team make the AI into an actual AI adversary. So create an AI through prompting, or fine-tuning, or whatever method that actually plays the role of a fully misaligned AI that's trying to subvert our control measures. So, the flow here will be something like, first, the blue-team decides what control measures to implement, and writes that out explicitly. Then the red-team comes in and creates an AI model that's purposefully trying to subvert those measures. And finally, in some proxy environment, or even in your actual production code, we run an interaction where this model from the red-team tries to subvert your control measures. So this is a pretty simple picture.
Unfortunately, there are a few technical questions that need to be decided about how to implement this exactly that are here not fully represented. For example, some of the technical questions that we face when actually running a control eval, what models should even be substituted by the red-team? So some AI models that play some role in your deployment might be so small that you are sure that they are not coherently and competently misaligned in any way. So you don't let the red-team just make them as adversarial as they want.
But we don't yet have a clear line to draw here about which model should or shouldn't be substituted. So ongoing research on that. Also, what should the proxy environment be exactly? Because, of course, you want to make it as realistic and close to your real deployment situation as possible.
But even if you are running the experiment on your production code or a clone of your production code, the input distribution will probably not be exactly the same, and you need to make that up somehow. Or also, even what strategies are the red and blue team allowed during this exercise? Because, for example, you might see in your capability evaluations that your AI model certainly doesn't have a capability. You are even accounting for sand bagging and the like.
But this is a capability that your red model could fine tune your model to learn. So probably you do not want to let your red team do this, because it will be too conservative. The red-team will be too strong.
So a lot of finicky questions like these ones, in which we want to strike a balance in taking decisions that are efficient for the developers to be able to run this evaluation process, while still remaining robust in the sense that the evaluation process is robust in figuring out whether your measures are sufficient. And in fact, since there are many tricky questions like this one, we're also thinking that many of these evaluation processes, you probably want to have some responsible arbiter of the evaluation, what we usually call a green-team that tries to play this impartial role, making all of these decisions, but also consulting with external experts.
So these are technical questions. And some of my colleagues at AISI recently published a couple of papers, actually, in these directions. There are also some organizational questions about how should the developer implement this process internally, what should be the human teams, how should they relate, the red-team, blue-team, etc.
And here also, some of them are in the direction of ensuring robustness. So for example, we want to make sure that the implementation of our control mitigations is preserved after the evaluation and during deployment. This could be especially crucial against misaligned AIs, because misaligned AIs will notice quite quickly security breaches in your mitigations and coherently and agentically pile up on these breaches to enact the negative outcome. For example, if some of your human employees are lazy or even maliciously not implementing a part of the protocol, this could be taken advantage by your AIs.
This is, in fact, an instance of a more general trend of doing compositional analysis, so actually looking if a part of your mitigations goes down, do the rest still stand? We also need basic Infosec measures for sure. Maybe we want to document the whole process for external review as well. But then other questions are about ensuring efficiency. And especially, we want to ensure that these things don't take too much serial time and research hours. These seem like the most scarce resources for developers nowadays. And here, there are considerations about how much we can externalize the red-team. Basically, the more, the better. We want to harmonize with other standard practices to make it easier on the developer. And generally, just finding the serial pain points and making them better with research there.
So we're working on this. We're, in fact, working on some guidelines and streamline process that are both robust and efficient. And we'll have some publications on this soon. And yeah, finally, if you're interested in any of these questions, we are hiring in roles for control as well as other teams. We are also funding research on these and other topics. And please do come talk to me or reach me at this email.
Thanks so much.