The State and the Science of AI-Bio Evals
Summary
Bryce Cai summarizes SecureBio’s recent research into biology benchmarks, including the VCT eval and verification of AI-generated protocols in actual laboratory settings.
SESSION Transcript
So, I work on bio evaluations as part of SecureBio's AI team. Our main goal is to measure biosecurity-relevant capabilities in frontier models. So the motivation for bio-capability evals is that we've seen some pretty rapid advancement in the rate of capabilities. And as it turns out, this also extends to biology. This is a screenshot from a dashboard that we've been developing, where each of these lines corresponds to frontier model performance on one bio-related eval, not just ones that we've developed. As you can see, the number is going up for all of them.
In many cases in the past two years, models have exceeded human expert performance, and for many of these evals, they're pretty much saturated. So, this is a pretty exciting time for science. But it's also a double-edged sword. This could increase the risk of harm from biology, deliberate or accidental, by making this rare expertise commonplace. Without mitigations, models could provide easy to access novice uplift. And in the very worst case, a misaligned AI system or a bad actor could create a novel pandemic as a catastrophic vector of attack.
So, but that being said, one common trend about this entire conversation is that there's a lot of uncertainty as to what models are actually capable of. SecureBio happens to be one of the few organizations working at this intersection to figure exactly that out. So how do we go about building an eval? Well, when it comes to that, our main goal for one individual eval is to measure one specific capability. In this way, we can put together a suite of evals that measures the relevant capabilities for one certain vector of attack. These capabilities could be something like bio tool use, or wet lab uplift. And our hope is that this covers a broad spectrum of threats. And if the relevant threat changes in the future, or some other vector of attack becomes more relevant, our hope is that we will have the building blocks still to determine what exactly is going on.
So let's get into some examples. Some of you might know about this. This is VCT, where we give models a scenario of a virology wet lab question to troubleshoot. We give them the necessary information and we ask them, “Okay, in this scenario, which of these factors should I change? Which ones are relevant, which ones aren't?” What we find is that models are pretty capable of matching human expert performance. In fact, GPT-4o was able to match the median expert performance of virologists in their narrow area of expertise. So this is one eval that looks pretty similar to other evals that you might have seen in the past, but it's also pretty limited. This only measures single-turn conversations.
So what else can we assess? Well, we can also assess an AGI agent's ability to carry out bio tasks directly. This is ABC-Bench, another example of a suite of evals where we ask the agents to carry out, or write scripts or protocols, for carrying out one-off bio tasks. And what we found is that models, at least those that don't refuse, also can match human median expert performance. We've even validated some of these generated protocols in the wet lab setting themselves.
We've also developed another suite of evals called ABLE where we were able to see that AI models are making it easier to use bio AI tools like AlphaFold. If you want to know more about this, come by our poster which is right outside that door. So how does this integrate into the broader eval and alignment space? Well, one common trend that we've been seeing is that eval trends more broadly in other domains also seem to be lining up with bio. So if you have any sense about where capabilities might be going, chances are they also apply to bio, and they also apply to our work, and we would love to know more.
Secondly, we spent a lot of time trying to iron out this process so that we're accurately measuring the one capability that we want to measure. So if you want to know more about that, please come talk to us. We also have some really good thoughts about the next generation of evals beyond human expert performance. But that's going to require a new set of tools and a comprehensive conversation about what that will look like. So if you want to know more, if you have any thoughts about bio evals or our work in general, please come see me or my colleagues Jake and Samira. We will be here all week. Thank you.
In many cases in the past two years, models have exceeded human expert performance, and for many of these evals, they're pretty much saturated. So, this is a pretty exciting time for science. But it's also a double-edged sword. This could increase the risk of harm from biology, deliberate or accidental, by making this rare expertise commonplace. Without mitigations, models could provide easy to access novice uplift. And in the very worst case, a misaligned AI system or a bad actor could create a novel pandemic as a catastrophic vector of attack.
So, but that being said, one common trend about this entire conversation is that there's a lot of uncertainty as to what models are actually capable of. SecureBio happens to be one of the few organizations working at this intersection to figure exactly that out. So how do we go about building an eval? Well, when it comes to that, our main goal for one individual eval is to measure one specific capability. In this way, we can put together a suite of evals that measures the relevant capabilities for one certain vector of attack. These capabilities could be something like bio tool use, or wet lab uplift. And our hope is that this covers a broad spectrum of threats. And if the relevant threat changes in the future, or some other vector of attack becomes more relevant, our hope is that we will have the building blocks still to determine what exactly is going on.
So let's get into some examples. Some of you might know about this. This is VCT, where we give models a scenario of a virology wet lab question to troubleshoot. We give them the necessary information and we ask them, “Okay, in this scenario, which of these factors should I change? Which ones are relevant, which ones aren't?” What we find is that models are pretty capable of matching human expert performance. In fact, GPT-4o was able to match the median expert performance of virologists in their narrow area of expertise. So this is one eval that looks pretty similar to other evals that you might have seen in the past, but it's also pretty limited. This only measures single-turn conversations.
So what else can we assess? Well, we can also assess an AGI agent's ability to carry out bio tasks directly. This is ABC-Bench, another example of a suite of evals where we ask the agents to carry out, or write scripts or protocols, for carrying out one-off bio tasks. And what we found is that models, at least those that don't refuse, also can match human median expert performance. We've even validated some of these generated protocols in the wet lab setting themselves.
We've also developed another suite of evals called ABLE where we were able to see that AI models are making it easier to use bio AI tools like AlphaFold. If you want to know more about this, come by our poster which is right outside that door. So how does this integrate into the broader eval and alignment space? Well, one common trend that we've been seeing is that eval trends more broadly in other domains also seem to be lining up with bio. So if you have any sense about where capabilities might be going, chances are they also apply to bio, and they also apply to our work, and we would love to know more.
Secondly, we spent a lot of time trying to iron out this process so that we're accurately measuring the one capability that we want to measure. So if you want to know more about that, please come talk to us. We also have some really good thoughts about the next generation of evals beyond human expert performance. But that's going to require a new set of tools and a comprehensive conversation about what that will look like. So if you want to know more, if you have any thoughts about bio evals or our work in general, please come see me or my colleagues Jake and Samira. We will be here all week. Thank you.