Challenges in Creating “If-Then Commitments” for Extreme Security
Summary
Chris Painter of METR elucidates the organization's dedication to crafting evaluations for high-risk AI systems, emphasizing the creation of frontier safety policies that dictate when firms should escalate security measures in line with AI progression, while acknowledging challenges in policy formulation and the prospect of state intervention during critical risk scenarios.
SESSION Transcript
Hi everyone. My name is Chris Painter. I'm the policy director at METR, which stands for Model Evaluation Threat Research. We're a Berkeley-based, 501(c)(3) nonprofit mostly dedicated to developing dangerous capability evaluations focused on agents and autonomy, AI R&D automation, that kind of thing. But my team...we have a policy team as well, which I've led for about the last two-ish years. And that team has been dedicated to basically figuring out what should dangerous capability evaluations for AI systems be used for. And the common answer to that is...one of the things they should be used for is setting thresholds to trigger information security requirements. Hence the title of this talk, "Challenges in Creating If-Then Commitments for Extreme Security."
So in the first half of this presentation, I'm going to talk a lot about the...it's not really the history of this kind of policy. It's much more like how these policies are working now. What has been committed to. What the 101 of frontier safety policy design for AI-developing firms. And then I'm going to talk a little bit about the challenges with the frontier safety policy paradigm that I'm thinking about this year, in 2025, and that our team will be focused on, that I think other organizations are focused on. And then we can do a Q&A.
So first, what is a frontier safety policy? I think the very first phrase used to get at this concept was "responsible scaling policy." Lots of different organizations call them different things. So I'm here using a company-agnostic term, "frontier safety policy" that I'll use for the rest of the talk. I think the kind of core... this initial way of describing the ask of companies, was formulated by METR, with some input from advisors: Paul Christiano...others in the early 2023. But many people have built on those ideas since then and companies have all affirmatively put forward their vision of how this should work. But I think the core question that a frontier safety policy asks of an AI developer is, "What are specific dangerous AI capabilities? What specific proficiencies could you see from your models that you would today believe you are unprepared to safely handle and secure adequately?" And so, one proposed way of answering this question is to set up a series of escalating capability redlines, conditional redlines or thresholds that then, if met, or if passed, you would think should trip requirements for increased information security or safety mitigations.
So just to run through some examples here, one classic one or common one: if the AI system is capable of guiding a unsophisticated novice through developing a biological weapon, then maybe that means that we should have strong assurances that the model will refuse requests if asked that via the front door, if accessed via the front door under the terms of intended deployment. And we should have some requirements on information security to ensure that the model can't be accessed via a back door, side door, that the weights can't be stolen, and then have refusal sort of trained off. Ditto for offensive cyber operations. This is a common formulation in some of the policies. If our model is capable of enabling offensive cyber operations for technically unsophisticated actors beyond what they would otherwise be capable of, or maybe (they all specify this in different ways. Maybe it's at the level of a terrorist or hacking group), then similar assurances around refusing requests information security requirements. Then there are some threat models for which we might have even higher levels of information security that we want to reach. For instance, if this AI system is capable of automating the process of AI research itself such that we can't make any...strong arguments or assurances that it can't obtain these other capabilities, or even more advanced CBRN, which is like biological weapons, rad, nuclear, or offensive cyber operation capabilities, then we might need...to hold ourselves to a state-level security requirement or have arguments around control, the alignment of our model that might not be technically possible even today. So this could...In effect, the idea is that if you aren't ready to implement the mitigations, then you halt further development on advancing capabilities until you have these mitigations prepared and implemented.
So in practice, the thing that this ends up leading to is a kind of roadmap into the future for a firm of...what safety and security mitigations they need by the time different dangerous capabilities arrive, which is sort of a way of charting into the future what it will concretely mean for safety and security to keep up with AI capabilities as they advance. And you can kind of back out of this, this manifold of safe and responsible development to make sure that your protective measures are keeping up with dangerous capabilities. So, just because...the focus of this conference is security, I thought I would mention that...I think in particular, Rand's report on securing AI model weights has been very influential on a lot of the company's policies, where, I won't walk through all of these now, but the typical way of formulating...the escalating tiers of information security requirements is to have lower levels that maybe correspond to standard commercial or industry practice that are common in other Software as a Service applications...other industries. Then when you get to what Rand calls SL3, you have this idea of basically information security strong enough to deter non-state hacking groups from accessing model weights, from accessing company IP related to model development. And then we have levels beyond that that are getting into the realm of trying to deter states from stealing access to the models, which might be...there's some debate about the degree to which that's even possible, or how long you can actually hold out when a state is trying to steal access to this IP. If you want to see more details of the company policies that have been published, this link at the bottom is a document that we put out last year that talks about the common elements of at least three of the policies that have been published at the time and what each of them say about information security requirements.
Just to talk briefly about why I think this is an elegant way of using dangerous capability evaluations for risk management. One thing that's nice about the frontier safety policy set of asks is that it lets you...I think a lot of the debate around AI policy and AI safety-ism can boil down to disagreements about how quickly we should expect transformative capabilities from AI systems. And this can kind of just devolve into people's differing intuitions about the rate of AI progress. And...without being grounded in any individual threat model, this gets a more abstract debate around what we should be afraid of from technology, or something. And I think that one thing that's nice about frontier safety policies is they let you sidestep that debate and talk about specific threat models that you're worried about, how you will test for them, and then what mitigations would be required if you saw that your models were proficient at those threat models. So...you have this thing where people who think that these dangerous capabilities are really far away, or that they're just generically BS, shouldn't be upset because if they are correct, then there are no interventions required on the basis of this kind of policy. But if those people are wrong, and we in fact detect evidence of dangerous capabilities, then we're prepared and have thought in advance about what a strong response to mitigate those dangerous capabilities would be. I'd say empirically, I've seen that creating this kind of policy for a company motivates hiring attention resourcing to information security and safety mitigations today that I don't think otherwise would be happening. So one way of thinking about this is, inside of your company, maybe over here, you have your security team, and you have your safety or alignment team, and then separately you have a model development capabilities advancing organization. I think, in the absence of some concrete If-Then sense of how your capabilities advances will map to your security and safety work, it's easy for people in a position of leadership inside of the company to look at the capabilities-advancing organization and think, "Well, this is really the lifeblood of our organization. This is where new products come from. The security and safety stuff is nice, but I don't really see how it's related to the core product development focus of our organization." But if you make this If-Then style commitment, or you have a mapping from the capabilities onto the safety and security mitigations that you'll need in the future, it can make you realize that this work isn't just performative. It's not just for nice PR. It one day could actually become the rate-limiting step on us developing more advanced models. And I think that causes people to hire more, for instance, into those areas than they otherwise would.
And so, yeah, so since this work started around two years ago, now, many companies have published this kind of policy. In particular, at the AI Seoul Summit, there were these Seoul frontier AI safety commitments that were signed by now up to 20 companies, which basically committed them to making a policy that set out a very similar set of ideas, setting out concrete capability thresholds and a mapping from those onto security levels and safety mitigation levels fleshed out in kind of standalone policies. If you go to METR.org/faisc, we're sort of indexing these policies as they're published. Many of this is still happening over the last two days. Some are still being published, I believe, tomorrow morning. And it's been really fantastic to see the kind of breadth of companies that are publishing this from all around the world. Because I think it's now kind of this...pseudo-standard where a protocol for directly comparing the safety and security roadmaps of otherwise very different companies.
So, having given that kind of tour of the practice or something on these policies up to this point, obviously I could go much more in depth on what the common elements are of the policies today, but I want to talk about some of the challenges that I think have come out of this process and how the events in the world are sort of changing the dialog around these capability thresholds and information security levels for AI development.
So I think my organization directly advises and helps many companies on the creation of this kind of policy, on creating frontier safety policies. And we end up talking to lots of different types of stakeholders, having to...work from first principles to re-derive each of the individual capability thresholds and the motivation for information security requirements on the basis of capabilities. So I thought I'd write out here. It might be interesting to this audience, some of the questions that I think we get asked often and those that we, in practice, don't. So for instance, I think one thing that many companies are interested in is whether the dangerous capability evaluations that have been developed are actually assessing increases in marginal risk for every one of the threat models, rather than just being like demonstrations that AI systems are competent at executing a threat model that was already possible with the pre-AI internet like Google search. I think the famous example of this that a lot of people talk about is, "Are AI systems actually meaningfully providing some uplift in what technically unsophisticated actors can do for biological weapons development beyond what was possible with just a Google search?" So that's a thing that comes up a lot, and that comes up for companies when they're trying to create this kind of policy.
I think a second question that comes up and that I think is coming up more often as companies get closer to the thresholds of needing stronger information security under their policies is, "Are the proposed levels of information security requirements actually commercially viable?" I think that this is most important to think about in the case of startups. So...there are some companies that have less than 30 employees, and that's sufficient to get them to the frontier of AI development and AI research. But...that one one person was mentioning to me if I wanted to actually implement these information security requirements, it's possible that I'd have to take my entire team of 30 people and sort of halt the work that we're doing, and move all of those people into figuring out how to implement this level of information security for the research that we're doing. So I think one thing for the audience here would be thinking about how to make it very easy to implement advanced levels of information security for AI researchers who are working at small companies. If the cost of doing that falls dramatically, that would be very useful for the purpose of these frameworks that I'm sure that's a very difficult problem. I'm not myself. You an information security expert, I just work on these policies. But my sense is that that's quite difficult.
I think one thing that companies also ask us often is whether we can offer concrete specifics of what evals in particular can serve as proxies for every kind of capability, tripwire in the policy. So, for instance, this is just a very kind of nuts and bolts thing. But if we're going to set out a threshold that if this model can assist a novice with synthesizing a biological weapon at a level that previously they were incapable of, what are the specific benchmarks and evaluations? What are specific evaluation partners that they can work with to run that kind of evaluation? Which I don't think Is that surprising of a question.
I think questions that don't come up that often are, "Why test for automated AI R&D?" So this was a threat model that I think wasn't in the initial policies when they came out, but over time, speaking I guess for myself here, I've realized that...early on, some of the policies made reference to this idea of autonomous replication. So like, "Can an AI system proliferate on the internet basically committing some forms of cyber fraud to sustain its own deployment?" And I think over time, at least for me personally, I've moved in the direction of being more worried about...The more important inflection point has seemed like when we have AI systems that can automate the process of AI research itself, such that it becomes difficult to bound the rate at which progress will occur. I think to many companies this is actually kind of...intuitive in the sense that if you are worried about something like CBRN risk from AI systems, and then you hear that your AI research is going to be automated by AI agents or something like that, it's sort of intuitive that if your AI systems are able to improve themselves, it's very difficult to upper bound or to make any other confident claims about what your AI systems can't do. So it's hard to rule out any of the other threat models. So I think in practice, that's been a big tent reason for caring about automated AI R&D as a capability threshold for motivating stricter safety and security requirements.
And then the last one I said is, "If the companies...why these particular information security measures?" In my experience, I think that the the companies creating these policies have a lot of experience with information security and other domains. In house they will have many, many people who are used to protecting things like email clients from foreign states, that kind of thing. So their expertise runs very deep in information security. They sort of just need to know what is common in the commitments that other companies are making.
So thinking about going into 2025, what's next for this paradigm? What are the changes that might need to be made to these If-Then commitments, and what are big, difficult, open problems as we get closer to some of these thresholds, and as AI systems themselves become more capable? So I'll discuss three areas of open work or questions that I'm reflecting on, and that I think the rest of the METR team is as well. This is drawing on a lot of my colleagues' work. I would say, at a very high level I think my draft personal goal, and our team's draft goal for 2025 is to increase the rigor on this redlines paradigm. So basically, increasing the amount of scrutiny and external accountability that there is on whether companies are actually upper bounding the capability of their models using these dangerous capability evaluations. And then clarify the terms for backing out of this frame in an emergency, and I'll explain what I mean by that later.
So first, I think that relative to where we were in 2024, I think going into 2025 we're going to need much more peer review of dangerous capability evals and the ways that companies are interpreting them. So, I think that one kind of illustrative example of why you should be worried about this or something is if these policies dictate that companies will put out lots of research and lots of detailed evaluations on how capable their models are at every threat model, but then, in effect, no one reads them, and there's no community of practice of external, technically sophisticated researchers that go through the evaluations that have been conducted and see whether there's well justified arguments that those evaluations are actually good proxies for the capability thresholds. This example, I pulled from work by my colleague Luca Righetti on bio uplift evaluations. In this particular case, OpenAI's evaluations for the bio uplift threat model. And I think in particular, in the context of corporate communication to the public, it's very easy for claims about evaluation results to slip into this vibe of your system card is for conducting an exploratory analysis of a model's dangerous capabilities, rather than specifically saying, "We have conducted these evaluations and...given our model progressively harder and harder tasks until it started failing some of the tasks. And thus we think we have upper bounded the model's capability." And I think the distinction between those two is very important because, in effect, the safety case that we're making now for AI systems is an inability argument, that they kind of aren't good...It's not that we have mitigated their ability to do these dangerous capabilities. It's that they aren't yet capable of them, or aren't reliably enough capable of them.
So I think the second following on that point is that task based evals, I think are getting harder to make, and in general, I think it's becoming more and more difficult to make this kind of inability argument for almost every one of the threat models. This graph is pulled from research that METR did in a paper called RE-Bench about assessing language model agents' ability to automate AI research. In this case, the TLDR, if you squint or something, the story here was basically that METR tried to come up with very difficult tasks that would take human researchers pulled from the top labs eight hours...that maybe like your first day on the job or something at an AI research organization, tasks that are representative of the things you might work on. And then we put language models up to the same task, and the TLDR is basically that the language models performed roughly as well as the humans perform in the first two hours, right? So, this is just one example where even if you spend months trying to come up with research tasks that take humans' full work day, I think the fact that we might soon see saturation even on tasks of that difficulty indicates that it's going to get harder and harder to make the case that our models aren't capable of these threat models because of specific tasks that we are observing them failing. So I think we should start thinking about other ways...to make inability arguments for your models, or we might have to set ifs that are based on other things like metrics of use or, and this is in the case of AI research, or incident reports, for instance. But broadly, need to prepare for a regime where our safety cases are not based on a model's inability, but rather on internal controls, maybe internal output monitoring, alignment, this kind of thing, which is going to be difficult and require a lot of work.
And then I think the last dynamic that I'm noticing, especially post DeepSeek, is a lot of questions about, "What happens in a situation where you as a developer think that the risk in the world is already quite high from AI systems?" You think there are systems widely available that are capable of causing a catastrophe, but that by continuing development and breaking through your redlines, are not necessarily increasing the risk at all. And in fact, your inaction might decrease the risk. An example could be that you see that in other countries, there are models that are capable of these dangerous capabilities. And so if you and your country kind of just stand stand by and don't continue further development, you will actually in some way be increasing risk. The initial kind of proposal that we worked on with responsible scaling policies had this idea of an escape clause that says, "If you think the risks from AI development are already intolerably high and inaction would be worse than action, maybe you continue scaling capabilities but notify the authorities that you think the aggregate risk from AI research is now intolerably high. Advocate for some kind of state intervention, and then be clear with all the stakeholders, whether that's employees inside of your company, the public...that you believe these risks are now imminent and actually realizable, not future and hypothetical risks." That could include things like direct demonstrations to government that the dangerous capabilities, the kind of first rung of dangerous capabilities, is now directly realizable...Some of the policies have language that get at this idea and I think that we need to flesh out what exactly are companies' obligations in this scenario. So, what would count as advocating for state intervention. How can companies keep the public in the loop? Should there be even more intense transparency requirements in the scenario? Should the companies have to notify the public precisely of why the risk is now so high? I think what one maybe motivating thing I'd say about this for the security research community is that this is largely a function of the fact that if companies think they will get in a position where they have the dangerous capabilities are realizable, but the security and safety mitigations are not available or not ready, then I think this is part of where the pressure is coming from in this scenario. So insofar as we can speed up research and implementation on information security, we can avoid being in this scenario. Yeah, and that's it. So I guess...the final takeaway would be more security and more progress on security makes all of this easier. Yeah? [applause]
[unintelligible] think about control? So in particular, the second clause, it seems like the second clause is saying that if someone else defects, you can defect as well. So I was just wondering, what do you think the probabilities are for this scenario to happen in the next few years?
I think that we are very likely to end up in a scenario where some developers are tempted to claim that they need to... I think we're quite likely to end up in a scenario where developers feel pressure to break through the capability redlines that they've set, or conditional redlines that they've set for themselves, on the basis that someone else is breaking through the capability redline, and therefore they have to...So, just some thoughts on that. I guess I only said very likely. You asked for the probability. I don't know the probability. It feels quite likely to me. I think that some things I want to avoid about that situation. I think one thing that's really important is to avoid a situation where companies are keeping up the pretense that they have adequately managed risk to the public when they in fact, have not. So I think it's important to distinguish between these two, right? I think you want to say, now that we've specified what it would mean to develop the technology responsibly, I think that there needs to be some you want external verification, but also some kind of clear incentive, one way or the other for the companies to clarify that they are no longer adequately managing the risk. If that is the case. And I think that it's hard to know what'll happen, but I think that if you have a situation where multiple companies are saying that they can no longer adequately manage the risk and that they are breaking through conditional redlines that they otherwise would have loved to have been held to, but that it's a race to the bottom, and they're all unilateral actors, I think that that's a situation that the public and governments may know how to solve. I think if you've specified the redlines that you wish everyone would be held to, but you're you're charging through anyway, that's a situation that gives me some hope that we can have collective action or the public call for essentially the re-implementing of the redlines. I think one example in the update that was just published from Google, I'm not going to remember the exact text of this, but in their new frontier safety framework, they invoke this language around, "we will... we would not unilaterally follow this framework. We will follow this framework if we see that others are doing the same." I may be quoting that incorrectly. But I think it's actually good for developers to explicitly say this because I think it's nice to be able to go to governments and say, "Hey, we've specified how we think the rules should work. Our only problem is that we're unilateral actors, and we can't solve this coordination problem."
Thank you.
[unintelligible] On a similar note you talked about at the similar note, advocating for state intervention. Do you see there being...?
Yeah, the question was, "The slide mentions advocating for state intervention. What do I imagine state intervention being?" I don't particularly know, or I'm being agnostic here on what exactly that would look like. But I think that...broadly going to the government and saying... the core message being, "This is no longer a hypothetical future risk. This is a thing that we are sure can be realized today, and where maybe the marginal risk of AI development as a whole has increased risk to society along this threat model." ...You can imagine legislation, right? You can imagine some kind of emergency action, yeah.
The reason I'm asking is, as a representative of the UK Government, if you were to come to me, what sort of thing do you think would be realistic, to
Yeah.
...advocate for?
There's something here, of like, you want the public and the government to arrive at their own solution about what the rules should be. If they're coming into this situation, solving it, right? They can go harder on the companies than the companies would go on themselves. The frameworks that have been described were things that were designed by companies when they were as voluntary commitments. So there's something here of not wanting to be too influenced by what the companies were willing to commit to voluntarily. But...
You had a slide that said, "Task-based evals are getting harder. I just wanted you to expand on that a little bit, because I feel like there's multiple interpretations of that. I'm wondering what ways are they becoming harder in that you're specifically concerned about?
I can go back to the slide. The thing that I meant was that they're getting...I think they're getting harder to make. So specifically, one paradigm that I think a lot of people imagine about dangerous capability evals is that the hope here is to get to some benchmark that you just hope to not see the model get better at. And I think that specific...that specific paradigm increasingly feels like it's a little bit in the rear view mirror for things... My sense (although I don't directly work on CBRN evals) my sense is that even in that field, you're now looking at things like wet lab studies RCTs where you're giving people access to the model and seeing how much it helps them. In the context of AI R&D automation, I think that we may need to go to something like metrics of internal use inside of companies, rather than having something that you can, kind of like quickly just run in one Docker container that upper bounds the model's capability. Yeah. We solved it. [audience laughter] Yeah. Yeah.
[unintelligible] Here's what I think do this differently... What would that cohesive message be? Or I'm now the czar of evals, here's what you [crosstalk]
Increase the stakes, yeah. Man, I don't know. [crosstalk]
Are there any...Are there any core directional adjustments that you would say of okay, taskbased evals are becoming less reliable. Let's move towards this. Any core points that you think [crosstalk]
Yeah, I think one...There was a lot of discussion about control in previous sessions and in the lobby and things like that. I think, in general, basically this point about that I'm not sure the goal any...Well, there should still be some goal of making inability arguments, but moving towards thinking about if we were no longer upper bounding the capability of our models by ruling out capabilities, but instead had to grant that the models had that capability, how would we control it? How could we be sure that our controls worked? How could we be sure that our monitors were actually working? I think is the direction that I see a lot of the conversation moving in. And because of the rate of AI progress, there's definitely a concern that I have that it's not going to move fast enough in that direction. And that we'll end up in this place where the models, in fact, are capable of all these threat models and we're sort of making these fake or...half-spirited efforts at justifying their inability. Yeah. Cool. Thank you.
So in the first half of this presentation, I'm going to talk a lot about the...it's not really the history of this kind of policy. It's much more like how these policies are working now. What has been committed to. What the 101 of frontier safety policy design for AI-developing firms. And then I'm going to talk a little bit about the challenges with the frontier safety policy paradigm that I'm thinking about this year, in 2025, and that our team will be focused on, that I think other organizations are focused on. And then we can do a Q&A.
So first, what is a frontier safety policy? I think the very first phrase used to get at this concept was "responsible scaling policy." Lots of different organizations call them different things. So I'm here using a company-agnostic term, "frontier safety policy" that I'll use for the rest of the talk. I think the kind of core... this initial way of describing the ask of companies, was formulated by METR, with some input from advisors: Paul Christiano...others in the early 2023. But many people have built on those ideas since then and companies have all affirmatively put forward their vision of how this should work. But I think the core question that a frontier safety policy asks of an AI developer is, "What are specific dangerous AI capabilities? What specific proficiencies could you see from your models that you would today believe you are unprepared to safely handle and secure adequately?" And so, one proposed way of answering this question is to set up a series of escalating capability redlines, conditional redlines or thresholds that then, if met, or if passed, you would think should trip requirements for increased information security or safety mitigations.
So just to run through some examples here, one classic one or common one: if the AI system is capable of guiding a unsophisticated novice through developing a biological weapon, then maybe that means that we should have strong assurances that the model will refuse requests if asked that via the front door, if accessed via the front door under the terms of intended deployment. And we should have some requirements on information security to ensure that the model can't be accessed via a back door, side door, that the weights can't be stolen, and then have refusal sort of trained off. Ditto for offensive cyber operations. This is a common formulation in some of the policies. If our model is capable of enabling offensive cyber operations for technically unsophisticated actors beyond what they would otherwise be capable of, or maybe (they all specify this in different ways. Maybe it's at the level of a terrorist or hacking group), then similar assurances around refusing requests information security requirements. Then there are some threat models for which we might have even higher levels of information security that we want to reach. For instance, if this AI system is capable of automating the process of AI research itself such that we can't make any...strong arguments or assurances that it can't obtain these other capabilities, or even more advanced CBRN, which is like biological weapons, rad, nuclear, or offensive cyber operation capabilities, then we might need...to hold ourselves to a state-level security requirement or have arguments around control, the alignment of our model that might not be technically possible even today. So this could...In effect, the idea is that if you aren't ready to implement the mitigations, then you halt further development on advancing capabilities until you have these mitigations prepared and implemented.
So in practice, the thing that this ends up leading to is a kind of roadmap into the future for a firm of...what safety and security mitigations they need by the time different dangerous capabilities arrive, which is sort of a way of charting into the future what it will concretely mean for safety and security to keep up with AI capabilities as they advance. And you can kind of back out of this, this manifold of safe and responsible development to make sure that your protective measures are keeping up with dangerous capabilities. So, just because...the focus of this conference is security, I thought I would mention that...I think in particular, Rand's report on securing AI model weights has been very influential on a lot of the company's policies, where, I won't walk through all of these now, but the typical way of formulating...the escalating tiers of information security requirements is to have lower levels that maybe correspond to standard commercial or industry practice that are common in other Software as a Service applications...other industries. Then when you get to what Rand calls SL3, you have this idea of basically information security strong enough to deter non-state hacking groups from accessing model weights, from accessing company IP related to model development. And then we have levels beyond that that are getting into the realm of trying to deter states from stealing access to the models, which might be...there's some debate about the degree to which that's even possible, or how long you can actually hold out when a state is trying to steal access to this IP. If you want to see more details of the company policies that have been published, this link at the bottom is a document that we put out last year that talks about the common elements of at least three of the policies that have been published at the time and what each of them say about information security requirements.
Just to talk briefly about why I think this is an elegant way of using dangerous capability evaluations for risk management. One thing that's nice about the frontier safety policy set of asks is that it lets you...I think a lot of the debate around AI policy and AI safety-ism can boil down to disagreements about how quickly we should expect transformative capabilities from AI systems. And this can kind of just devolve into people's differing intuitions about the rate of AI progress. And...without being grounded in any individual threat model, this gets a more abstract debate around what we should be afraid of from technology, or something. And I think that one thing that's nice about frontier safety policies is they let you sidestep that debate and talk about specific threat models that you're worried about, how you will test for them, and then what mitigations would be required if you saw that your models were proficient at those threat models. So...you have this thing where people who think that these dangerous capabilities are really far away, or that they're just generically BS, shouldn't be upset because if they are correct, then there are no interventions required on the basis of this kind of policy. But if those people are wrong, and we in fact detect evidence of dangerous capabilities, then we're prepared and have thought in advance about what a strong response to mitigate those dangerous capabilities would be. I'd say empirically, I've seen that creating this kind of policy for a company motivates hiring attention resourcing to information security and safety mitigations today that I don't think otherwise would be happening. So one way of thinking about this is, inside of your company, maybe over here, you have your security team, and you have your safety or alignment team, and then separately you have a model development capabilities advancing organization. I think, in the absence of some concrete If-Then sense of how your capabilities advances will map to your security and safety work, it's easy for people in a position of leadership inside of the company to look at the capabilities-advancing organization and think, "Well, this is really the lifeblood of our organization. This is where new products come from. The security and safety stuff is nice, but I don't really see how it's related to the core product development focus of our organization." But if you make this If-Then style commitment, or you have a mapping from the capabilities onto the safety and security mitigations that you'll need in the future, it can make you realize that this work isn't just performative. It's not just for nice PR. It one day could actually become the rate-limiting step on us developing more advanced models. And I think that causes people to hire more, for instance, into those areas than they otherwise would.
And so, yeah, so since this work started around two years ago, now, many companies have published this kind of policy. In particular, at the AI Seoul Summit, there were these Seoul frontier AI safety commitments that were signed by now up to 20 companies, which basically committed them to making a policy that set out a very similar set of ideas, setting out concrete capability thresholds and a mapping from those onto security levels and safety mitigation levels fleshed out in kind of standalone policies. If you go to METR.org/faisc, we're sort of indexing these policies as they're published. Many of this is still happening over the last two days. Some are still being published, I believe, tomorrow morning. And it's been really fantastic to see the kind of breadth of companies that are publishing this from all around the world. Because I think it's now kind of this...pseudo-standard where a protocol for directly comparing the safety and security roadmaps of otherwise very different companies.
So, having given that kind of tour of the practice or something on these policies up to this point, obviously I could go much more in depth on what the common elements are of the policies today, but I want to talk about some of the challenges that I think have come out of this process and how the events in the world are sort of changing the dialog around these capability thresholds and information security levels for AI development.
So I think my organization directly advises and helps many companies on the creation of this kind of policy, on creating frontier safety policies. And we end up talking to lots of different types of stakeholders, having to...work from first principles to re-derive each of the individual capability thresholds and the motivation for information security requirements on the basis of capabilities. So I thought I'd write out here. It might be interesting to this audience, some of the questions that I think we get asked often and those that we, in practice, don't. So for instance, I think one thing that many companies are interested in is whether the dangerous capability evaluations that have been developed are actually assessing increases in marginal risk for every one of the threat models, rather than just being like demonstrations that AI systems are competent at executing a threat model that was already possible with the pre-AI internet like Google search. I think the famous example of this that a lot of people talk about is, "Are AI systems actually meaningfully providing some uplift in what technically unsophisticated actors can do for biological weapons development beyond what was possible with just a Google search?" So that's a thing that comes up a lot, and that comes up for companies when they're trying to create this kind of policy.
I think a second question that comes up and that I think is coming up more often as companies get closer to the thresholds of needing stronger information security under their policies is, "Are the proposed levels of information security requirements actually commercially viable?" I think that this is most important to think about in the case of startups. So...there are some companies that have less than 30 employees, and that's sufficient to get them to the frontier of AI development and AI research. But...that one one person was mentioning to me if I wanted to actually implement these information security requirements, it's possible that I'd have to take my entire team of 30 people and sort of halt the work that we're doing, and move all of those people into figuring out how to implement this level of information security for the research that we're doing. So I think one thing for the audience here would be thinking about how to make it very easy to implement advanced levels of information security for AI researchers who are working at small companies. If the cost of doing that falls dramatically, that would be very useful for the purpose of these frameworks that I'm sure that's a very difficult problem. I'm not myself. You an information security expert, I just work on these policies. But my sense is that that's quite difficult.
I think one thing that companies also ask us often is whether we can offer concrete specifics of what evals in particular can serve as proxies for every kind of capability, tripwire in the policy. So, for instance, this is just a very kind of nuts and bolts thing. But if we're going to set out a threshold that if this model can assist a novice with synthesizing a biological weapon at a level that previously they were incapable of, what are the specific benchmarks and evaluations? What are specific evaluation partners that they can work with to run that kind of evaluation? Which I don't think Is that surprising of a question.
I think questions that don't come up that often are, "Why test for automated AI R&D?" So this was a threat model that I think wasn't in the initial policies when they came out, but over time, speaking I guess for myself here, I've realized that...early on, some of the policies made reference to this idea of autonomous replication. So like, "Can an AI system proliferate on the internet basically committing some forms of cyber fraud to sustain its own deployment?" And I think over time, at least for me personally, I've moved in the direction of being more worried about...The more important inflection point has seemed like when we have AI systems that can automate the process of AI research itself, such that it becomes difficult to bound the rate at which progress will occur. I think to many companies this is actually kind of...intuitive in the sense that if you are worried about something like CBRN risk from AI systems, and then you hear that your AI research is going to be automated by AI agents or something like that, it's sort of intuitive that if your AI systems are able to improve themselves, it's very difficult to upper bound or to make any other confident claims about what your AI systems can't do. So it's hard to rule out any of the other threat models. So I think in practice, that's been a big tent reason for caring about automated AI R&D as a capability threshold for motivating stricter safety and security requirements.
And then the last one I said is, "If the companies...why these particular information security measures?" In my experience, I think that the the companies creating these policies have a lot of experience with information security and other domains. In house they will have many, many people who are used to protecting things like email clients from foreign states, that kind of thing. So their expertise runs very deep in information security. They sort of just need to know what is common in the commitments that other companies are making.
So thinking about going into 2025, what's next for this paradigm? What are the changes that might need to be made to these If-Then commitments, and what are big, difficult, open problems as we get closer to some of these thresholds, and as AI systems themselves become more capable? So I'll discuss three areas of open work or questions that I'm reflecting on, and that I think the rest of the METR team is as well. This is drawing on a lot of my colleagues' work. I would say, at a very high level I think my draft personal goal, and our team's draft goal for 2025 is to increase the rigor on this redlines paradigm. So basically, increasing the amount of scrutiny and external accountability that there is on whether companies are actually upper bounding the capability of their models using these dangerous capability evaluations. And then clarify the terms for backing out of this frame in an emergency, and I'll explain what I mean by that later.
So first, I think that relative to where we were in 2024, I think going into 2025 we're going to need much more peer review of dangerous capability evals and the ways that companies are interpreting them. So, I think that one kind of illustrative example of why you should be worried about this or something is if these policies dictate that companies will put out lots of research and lots of detailed evaluations on how capable their models are at every threat model, but then, in effect, no one reads them, and there's no community of practice of external, technically sophisticated researchers that go through the evaluations that have been conducted and see whether there's well justified arguments that those evaluations are actually good proxies for the capability thresholds. This example, I pulled from work by my colleague Luca Righetti on bio uplift evaluations. In this particular case, OpenAI's evaluations for the bio uplift threat model. And I think in particular, in the context of corporate communication to the public, it's very easy for claims about evaluation results to slip into this vibe of your system card is for conducting an exploratory analysis of a model's dangerous capabilities, rather than specifically saying, "We have conducted these evaluations and...given our model progressively harder and harder tasks until it started failing some of the tasks. And thus we think we have upper bounded the model's capability." And I think the distinction between those two is very important because, in effect, the safety case that we're making now for AI systems is an inability argument, that they kind of aren't good...It's not that we have mitigated their ability to do these dangerous capabilities. It's that they aren't yet capable of them, or aren't reliably enough capable of them.
So I think the second following on that point is that task based evals, I think are getting harder to make, and in general, I think it's becoming more and more difficult to make this kind of inability argument for almost every one of the threat models. This graph is pulled from research that METR did in a paper called RE-Bench about assessing language model agents' ability to automate AI research. In this case, the TLDR, if you squint or something, the story here was basically that METR tried to come up with very difficult tasks that would take human researchers pulled from the top labs eight hours...that maybe like your first day on the job or something at an AI research organization, tasks that are representative of the things you might work on. And then we put language models up to the same task, and the TLDR is basically that the language models performed roughly as well as the humans perform in the first two hours, right? So, this is just one example where even if you spend months trying to come up with research tasks that take humans' full work day, I think the fact that we might soon see saturation even on tasks of that difficulty indicates that it's going to get harder and harder to make the case that our models aren't capable of these threat models because of specific tasks that we are observing them failing. So I think we should start thinking about other ways...to make inability arguments for your models, or we might have to set ifs that are based on other things like metrics of use or, and this is in the case of AI research, or incident reports, for instance. But broadly, need to prepare for a regime where our safety cases are not based on a model's inability, but rather on internal controls, maybe internal output monitoring, alignment, this kind of thing, which is going to be difficult and require a lot of work.
And then I think the last dynamic that I'm noticing, especially post DeepSeek, is a lot of questions about, "What happens in a situation where you as a developer think that the risk in the world is already quite high from AI systems?" You think there are systems widely available that are capable of causing a catastrophe, but that by continuing development and breaking through your redlines, are not necessarily increasing the risk at all. And in fact, your inaction might decrease the risk. An example could be that you see that in other countries, there are models that are capable of these dangerous capabilities. And so if you and your country kind of just stand stand by and don't continue further development, you will actually in some way be increasing risk. The initial kind of proposal that we worked on with responsible scaling policies had this idea of an escape clause that says, "If you think the risks from AI development are already intolerably high and inaction would be worse than action, maybe you continue scaling capabilities but notify the authorities that you think the aggregate risk from AI research is now intolerably high. Advocate for some kind of state intervention, and then be clear with all the stakeholders, whether that's employees inside of your company, the public...that you believe these risks are now imminent and actually realizable, not future and hypothetical risks." That could include things like direct demonstrations to government that the dangerous capabilities, the kind of first rung of dangerous capabilities, is now directly realizable...Some of the policies have language that get at this idea and I think that we need to flesh out what exactly are companies' obligations in this scenario. So, what would count as advocating for state intervention. How can companies keep the public in the loop? Should there be even more intense transparency requirements in the scenario? Should the companies have to notify the public precisely of why the risk is now so high? I think what one maybe motivating thing I'd say about this for the security research community is that this is largely a function of the fact that if companies think they will get in a position where they have the dangerous capabilities are realizable, but the security and safety mitigations are not available or not ready, then I think this is part of where the pressure is coming from in this scenario. So insofar as we can speed up research and implementation on information security, we can avoid being in this scenario. Yeah, and that's it. So I guess...the final takeaway would be more security and more progress on security makes all of this easier. Yeah? [applause]
[unintelligible] think about control? So in particular, the second clause, it seems like the second clause is saying that if someone else defects, you can defect as well. So I was just wondering, what do you think the probabilities are for this scenario to happen in the next few years?
I think that we are very likely to end up in a scenario where some developers are tempted to claim that they need to... I think we're quite likely to end up in a scenario where developers feel pressure to break through the capability redlines that they've set, or conditional redlines that they've set for themselves, on the basis that someone else is breaking through the capability redline, and therefore they have to...So, just some thoughts on that. I guess I only said very likely. You asked for the probability. I don't know the probability. It feels quite likely to me. I think that some things I want to avoid about that situation. I think one thing that's really important is to avoid a situation where companies are keeping up the pretense that they have adequately managed risk to the public when they in fact, have not. So I think it's important to distinguish between these two, right? I think you want to say, now that we've specified what it would mean to develop the technology responsibly, I think that there needs to be some you want external verification, but also some kind of clear incentive, one way or the other for the companies to clarify that they are no longer adequately managing the risk. If that is the case. And I think that it's hard to know what'll happen, but I think that if you have a situation where multiple companies are saying that they can no longer adequately manage the risk and that they are breaking through conditional redlines that they otherwise would have loved to have been held to, but that it's a race to the bottom, and they're all unilateral actors, I think that that's a situation that the public and governments may know how to solve. I think if you've specified the redlines that you wish everyone would be held to, but you're you're charging through anyway, that's a situation that gives me some hope that we can have collective action or the public call for essentially the re-implementing of the redlines. I think one example in the update that was just published from Google, I'm not going to remember the exact text of this, but in their new frontier safety framework, they invoke this language around, "we will... we would not unilaterally follow this framework. We will follow this framework if we see that others are doing the same." I may be quoting that incorrectly. But I think it's actually good for developers to explicitly say this because I think it's nice to be able to go to governments and say, "Hey, we've specified how we think the rules should work. Our only problem is that we're unilateral actors, and we can't solve this coordination problem."
Thank you.
[unintelligible] On a similar note you talked about at the similar note, advocating for state intervention. Do you see there being...?
Yeah, the question was, "The slide mentions advocating for state intervention. What do I imagine state intervention being?" I don't particularly know, or I'm being agnostic here on what exactly that would look like. But I think that...broadly going to the government and saying... the core message being, "This is no longer a hypothetical future risk. This is a thing that we are sure can be realized today, and where maybe the marginal risk of AI development as a whole has increased risk to society along this threat model." ...You can imagine legislation, right? You can imagine some kind of emergency action, yeah.
The reason I'm asking is, as a representative of the UK Government, if you were to come to me, what sort of thing do you think would be realistic, to
Yeah.
...advocate for?
There's something here, of like, you want the public and the government to arrive at their own solution about what the rules should be. If they're coming into this situation, solving it, right? They can go harder on the companies than the companies would go on themselves. The frameworks that have been described were things that were designed by companies when they were as voluntary commitments. So there's something here of not wanting to be too influenced by what the companies were willing to commit to voluntarily. But...
You had a slide that said, "Task-based evals are getting harder. I just wanted you to expand on that a little bit, because I feel like there's multiple interpretations of that. I'm wondering what ways are they becoming harder in that you're specifically concerned about?
I can go back to the slide. The thing that I meant was that they're getting...I think they're getting harder to make. So specifically, one paradigm that I think a lot of people imagine about dangerous capability evals is that the hope here is to get to some benchmark that you just hope to not see the model get better at. And I think that specific...that specific paradigm increasingly feels like it's a little bit in the rear view mirror for things... My sense (although I don't directly work on CBRN evals) my sense is that even in that field, you're now looking at things like wet lab studies RCTs where you're giving people access to the model and seeing how much it helps them. In the context of AI R&D automation, I think that we may need to go to something like metrics of internal use inside of companies, rather than having something that you can, kind of like quickly just run in one Docker container that upper bounds the model's capability. Yeah. We solved it. [audience laughter] Yeah. Yeah.
[unintelligible] Here's what I think do this differently... What would that cohesive message be? Or I'm now the czar of evals, here's what you [crosstalk]
Increase the stakes, yeah. Man, I don't know. [crosstalk]
Are there any...Are there any core directional adjustments that you would say of okay, taskbased evals are becoming less reliable. Let's move towards this. Any core points that you think [crosstalk]
Yeah, I think one...There was a lot of discussion about control in previous sessions and in the lobby and things like that. I think, in general, basically this point about that I'm not sure the goal any...Well, there should still be some goal of making inability arguments, but moving towards thinking about if we were no longer upper bounding the capability of our models by ruling out capabilities, but instead had to grant that the models had that capability, how would we control it? How could we be sure that our controls worked? How could we be sure that our monitors were actually working? I think is the direction that I see a lot of the conversation moving in. And because of the rate of AI progress, there's definitely a concern that I have that it's not going to move fast enough in that direction. And that we'll end up in this place where the models, in fact, are capable of all these threat models and we're sort of making these fake or...half-spirited efforts at justifying their inability. Yeah. Cool. Thank you.