Cyber Range Design at UK AISI

Summary

Mahmoud Ghanem from UK AISI reveals the evolution of their cyber ranges, moving from basic CTF challenges to sophisticated, realistic environments that better assess AI in cybersecurity, aiming to enhance AI safety and inform policy.

SESSION Transcript

Great. Thank you for coming. I'm going to talk about cyber range design at AISI, and over the course of this talk, I'll talk about our experience with cyber evals so far, what we use them for, how we expect to be using them going forward. So I'll start with a overview of the story of cyber evals at AISI so far. Then I'll talk through limitations we found with catch the flag based evaluations. Then I'll talk about the kinds of cyber ranges we're moving on to build kind of inspired by attempts to overcome these limitations and challenges with building good cyber ranges when you're trying to run them for modeling frontier AI risk, as well as some of my views on maybe why we haven't seen as many of the right shape of range to run these kinds of evaluations as we might want to. Then I'll talk a bit about our own approach to building better suited versions of these for our purposes. And then lastly, I'll speak a bit more about what we plan to use them for. So let's start with the story of cyber evals at AISI so far. We took a pretty early bet on automated evaluations. This was partially because a lot of our mission was constrained to make best use of the kind of pre-deployment access we get to frontier models, where often we have a short window of time to get as much information as we can to then write up reports assessing the risks and capabilities arising from kind of new, new models being released. So this looked kind of like skipping forward from graded Q&A evaluations, which were kind of what a lot of valuations looked like back in 2023 when we were designing these, and kind of jumping straight to adapting public CTFs and commissioning custom CTFs. We actually commissioned our first custom CTFs before we started adapting public ones. The bet was kind of that we have a better idea of the risks we want to model than whoever is running the catch the flag challenge out there. And what's nice about catch the flag challenges is that they're easy to grade automatically. And what's not so nice about them is that the solutions and challenges are effectively leaked online because people like to play these competitively and do write ups and stuff. We still found it was useful to adapt the public ones, you effectively get a human baseline, because of all the people who took part for reasons why, like the paper Cybench is quite useful. You have these times time to first solve metrics and so on. I'll do a show of hands in the room. I was told to assume this is a very technical audience. But who here is play the Capture the Flag challenge? Okay, so just to, just to recap, very quickly, the setup is, is that you are connecting to a vulnerable server. There is some string on that server, a flag, and you get the points if you successfully find the string that's hidden there, and you have to overcome the vulnerabilities in order to return it. And this is what makes grading these really easy, because it's just string match on whether you found that string at all. An issue with this is that it's very kind of binary whether you've solved the evaluation or not. It's hard to measure progress across these kinds of tasks, and so what you start wanting to throw on more is variations of automatic or manual trace log analysis taking a look at how far the agent got in solving your challenge. You can do varieties of this, from breaking the task down into kind of predetermined checkpoints they use to kind of score the progress that the agent is making. You can also deploy automatic graders that review the progress that the agent is making and summarize where the column bottlenecks are, which is quite useful at scale if you're doing hundreds or 1000s of these runs. And then when you go beyond this, you start to want to build things that look more like cyber ranges. And I'll start to speak a bit about what this distinction looks like now and then. Go more into depth about why I find this useful distinction to make later in this talk. But the main things are salient to me about cyber ranges are that they're designed to be very representative of the systems that you want to evaluate. They measure effects that aren't just finding hidden information, like a flag that was hidden on the server that you care about, and finally, that they realize quite non trivial network topologies such that the agent needs to do quite advanced kind of agentic reasoning in the long form, like backtracking, realizing new new kinds of vulnerabilities that's found in one part of the network and deploying them in other parts of the network. We don't just do automated evaluations. A lot of what we do is what we call task based probing, and this is a terminology we've chosen instead of red teaming to avoid the kind of very ambiguous, overloaded term that like red teaming has become. When we talk about task based probing, we mean specifically picking a hard cyber task and sitting down with the model and seeing what it takes to get the model to solve it. You can do this in a variety of ways. We currently mostly focus on agent based Capture the Flag challenges, but just to talk up to how you get there, you could imagine, just like defining a bunch of protocols for like asking the agent questions on particular tasks and seeing how it performs, you can just have a human like, sit in front of a capture the flag challenge and ask the agent questions along the way as as the Human tries to solve the challenge. For agent based probing, we have put the human kind of in the agent chain of thought loop, and they're kind of giving it hints to try different things as it goes along. So in between, tool call iterations and so the standard prompt to kind of keep going, like try the next best step according to your best judgment, the human can kind of steer the agent in various ways. This gives you kind of qualitative information about how close your agent is solving your giving catch the flag challenge. We also do a bunch of open ended up probing, and this is important to us because of just how quickly capabilities move on. The idea of open ended up probing is that you should always just play with the models when new models come out, but maybe do that in a systematic way. As a domain expert, there will be things that you think to try that you might not have kind of best in class, automatic evaluations to run for and given the limited amount of time and resources and kind of jagged jumps in model capability, you always want to kind of be leaving slack to be able to do this kind of probing. I'll just call out two different approaches we take at the moment. One is, you start with your threat model, just kind of very high level and role play a not very sophisticated actor and talk through to the model. Hey, I'm trying to achieve this effect in the world. Talk me through the steps I might take, and then you drill in progressively, into areas that you think might be of interest. As with task based probing, the best you'll ever gather this is a qualitative result. You'll be like, "Oh, this is concerning enough that we should look into it further and maybe do some more rigorous assessment", but it's also important signal that you don't want to miss out on when you're quickly trying to get an idea of how capable a model is. We also do like specific look looking into malware generation under this bucket. A reason for this is that we find that human grading malware generation is still kind of the fastest way we can get signal here, especially because there are a variety of different techniques that the model can use. There are ways you can do automatic evaluation of malware generation, we currently just find especially to get quick qualitative signal after that. And the last thing I'll mention, but I won't talk about as much in this talk, is various kinds of human studies. You can run randomized control trials for human factors like spear phishing or like fraud. We don't do as much of those in AISI's cyber team anymore, not because we don't think this is a huge risk model, but because we think it's generally well evidenced to the level at which we should probably be focusing our resources on other kinds of evaluations. So if you see this grayed out, it's not because we don't think these are real risks. It's not even because we don't think that there is more value in doing like another really good kind of spear phishing, RCT somewhere out there. But we decided to work on other bits of the cyber threat model, given our skills and shape of our team. So instead, where we're putting our effort is baselining our custom catch to the flag challenges. So earlier, I mentioned that one reason you might still want to adopt public catch the flag competitions is because you get this kind of free human baseline. You have these hundreds of humans who might have tried the challenge you saw how far they got. Well, if you're going to commission your own ones based on your threat models, then you might want to get that data for free by just kind of getting humans to solve them and seeing how far they get, and then just think from this, you might just want to run human uplift studies. So kind of more principle, kind of gold standard, statistically principled versions of this, kind of question and answer based probing with actual humans that represent the threat actors that you're trying to measure. So in this talk, I'm going to focus mostly on how we're moving most of our commissioning of custom CTFs to just kind of building full fat cyber ranges. Why we're doing that and what makes that challenging. So let's start by talking about why we're doing it. There are a bunch of limitations with like capture the flag based evaluations that aren't just like they're kind of sourced from online. And this is nothing... All of these could be overcome while still keeping the spirit of kind of a capture the flag challenge. And I think they're more of kind of memes of how they've been implemented, maybe like to do with their provenance and the fact that a lot of additional ones that get built are inspired by what's out there already. I'll start with realism, a bunch of CTF challenges only evaluate one operating system and like skills in one operating system, and the operating system is effectively just kind of Linux servers. This is already like a very janky representation of the threat threat models that are out there. So you definitely want to represent other kinds of things, ranging from real time operating systems to things that show up in your corporate networks. You definitely want to evaluate Windows and that already requires building things with skills that you wouldn't necessarily use when building catch the flag challenges for kind of human players. Another way in which they lack some realism is that they don't evaluate your ability to realise real world effects. It's mostly just "did you did you find the string?". Additional things you might want to check is whether you're able to exfiltrate the data successfully. We'll talk a bit about evading detection later on in this slide. Another thing you might want to check is whether you have a good enough model of how the system is connected up to everything else in order to make something happen in the real world, beyond just to kind of get some data out. The other major shortcoming I see in all catch the flag evaluations is their complexity, and it's mostly limited by the kinds of infrastructure that is currently used to deploy them. Of catch the flag evaluations are at best a handful of Docker containers or nodes in the Kubernetes network, the topology is entirely flat. Every container can see every other container, if you're lucky, maybe there are some like credentials you need to access like the other ones, but you can quickly do a port scan and get an idea of the whole thing that's going on there. There's very unlikely to be real substantive red herrings. It's very unlikely to be dynamics where you have to conduct non trivial amounts of reconnaissance in order to get things done. And when it comes to modeling real world risk, a bunch of this stuff is just incredibly boring, and so not a thing that you want to get a bunch of CTF players to do, but also incredibly valuable in the sense that you can learn a bunch from how models actually behave at these tasks. And these tasks, and these are probably the kinds of bottlenecks that we care a lot about measuring the kind of differential in terms of human versus model scale. Yeah, and lastly, it's very hard to find CTFs that model good defensive monitoring. Defensive monitoring is an incredibly important tool in the kind of blue team network security tool belt, and the ability of models to deploy this kind of trade craft when implemented as agents in order to like, not get caught by a monitoring system is probably the first thing that I want to check If I were like evaluating these kinds of risks. So yeah, for each of these things, you could probably patch a CTF challenge. You could, like, throw some kind of monitor on top of there. You could, like, add some kind of red herring hosts. You could, with a bit more work, kind of port it to use a different operating system. Or, like, add additional like, steps after like, compromising whatever vulnerability gets you the flag. But at this point, our approach has mostly been to just kind of build things from scratch that realistically demonstrate the kind systems we care about. And that's where we get onto, like, trying to build cyber ranges. So yeah, cyber ranges have existed for a long time. A lot of people have been building these, and yet, so far, when we've gone out and looked at ones that are available, they haven't been quite right for the purposes of evaluating frontier AI risk. I'll talk a bit about some challenges that come up there, and also just why these are kind of hard infrastructure artifacts to put together. So yeah, I mean the main, the main thing like to think about is the infrastructure, like a catch the flag. Challenge, as I said, is at best a handful of Docker containers. Do you have a very kind of direct set of connections to set up. Maybe you have some container where your agent is running. You have some container where your challenge is running, you just need to network these together in a way that makes sense. Once you put everything back together from my earlier slide about the things that you're looking for here, you add a bunch of constraints to how you're going to engineer this. Firstly, you need some kind of firewall solution to segregate hosts from each other. You need support to be able to kind of spin up nodes that run multiple operating systems. Need to spread the context across the network. I think this is a thing that a lot of kind of procedural tools that generate cyber ranges trade off. So if you're generating your cyber range procedurally, you probably have kind of a bucket of vectors, and you can kind of compose them in different ways to create a create different paths that could take through. But if you do this, this kind of coherent exploitation story where maybe there is a suspicious looking PDF on one side, on one part of the network that is leaking you information about the architecture from other parts of the network. It's kind of very hard to design unless you're doing this in this kind of holistic way. Yeah, your defensive tooling will need to be deployed in a realistic way so you can actually measure whether the agent is triggering your monitors. And lastly, you're kind of going beyond just having a target string or flag to actually measuring if a desired effect took place. Potentially, if you're looking at ot systems, you're kind of emulating some bit of like parallel logic controller hardware or something, in order to see if the agent is able to affect it. So yeah, a bunch more skills are required to build these things. And one thing I'll say up front is that a lot of these don't get built because the people with these skills are busy defending the actual systems that look like this. And I think that's probably a good thing. I think people who've worked in various parts of kind of blue team cybersecurity will probably relate that like good representative staging stacks, let alone like cyber ranges, are kind of hard to come across for the same reason that kind of keeping them up to date is a place you're going to be spending your engineering points that you probably rather just spend them making the system more secure. Those parts of my theory as to why, like exactly, good representative versions of this haven't been built, but there are some others that I have in mind. So there's a kind of small industry of cyber and solutions that are kind of over fit to historic AI capabilities. So they're kind of built to evaluate early 2020s reinforcement learning models. We're still very much in the 2020s this field just moves fast. Or they evaluate kind of quite procedural tools, and so the kinds of threats and bottlenecks that you see for language model agents are in all the kinds of places where you spend more points making the range more realistic. The other kinds of ranges we see that often are quite a good place to start when you're repurposing ranges for LM agent evaluations are ranges built for human training or education. This is like probably the biggest market for these kinds of infrastructures that I had on the previous slides. Just kind of training blue teams better understand what it's like to be an attacker training kind of CTF players who got on board of like, quite modular, kind of repetitive, capture the flag style tasks. I'd say even in these training cases, you often have artificially narrow action spaces, such that the range is there, kind of to demonstrate or test some individual vulnerability, and you're not really doing much to kind of confuse the player about like what it is they could be getting done, which kind of goes back to this realism point, and this question of red herrings, I think, actually attacking a corporate network is a very open ended, very kind of exploratory task. You need to do a bunch of open ended exploration all while not getting caught, and you potentially need to do it by chaining together a series of vulnerabilities that might not even lead you anywhere. And a bunch of these kind of these kind of training ranges don't really do a good job of creating good evaluations for this stuff. Cool. Okay, so I'll just talk a bit about our approach. If you've seen Inspect, which is AISI's framework, a bunch of the off the shelf stuff we support looks a lot more like the kind of simple Dockerized capture the flag challenges, it's just kind of a flat network of containers, like something you've put together with Docker Compose or Kubernetes. And one approach we could have taken that we chose not to is to use a bunch of Kubernetes additional tooling, things like Cilium to put in things that approximate firewalls or network segregation between the containers. The reason we chose not to do this was that even if you do all of that, you're still going to lack a bunch of the support for the things that I mentioned earlier. You're still going to have a lot of trouble getting off the ground with hosting non-Linux operating systems. You're going to have trouble getting off the ground doing like realistic, defensive network monitoring. So approach we've taken instead is just to run kind of beefy, bare metal like servers, run a hypervisor on those servers and then nest a bunch of virtual machines inside the hypervisor. This gives us, like, an incredibly deep amount of control over the network that we're setting up. We can spin up any kind of, within reason, kind of virtual machine. We can configure the network however we want to look, however we want it to look like, and kind of deploy unpatched off the shelf software in a fairly safe way in order to run these evaluations. Tools we've been exploring for this are ProximaX and Vagrant. If you have experience building this kind of stuff, please come chat to me afterwards. We're currently going quite deep in a specific like proximal space solution, and we have some demos of that, although, for reasons unrelated to the nested virtualization and more related to difficulties using my government laptop abroad, I wasn't able to show those at lunch today, but we can show them some other time. Okay. Lastly, I'll talk about what we plan to use the ranges for. For that it's good to go back to, like the variety of evaluations we run at the AI Safety Institute. I think under our existing categories, the first thing to say is that you can just make everything a lot better if you have these kinds of environments. So taking the things that we think are generally most useful at the moment, you can take your agent, kind of throw it at this range, and now just run trace log analysis of everything it's doing. This will teach you two things. Firstly, all the standard stuff about how far the agent is getting. But secondly, you'll learn a bunch about, from a defender perspective, how to detect this kind of behavior on your network, which is less of a concern now because active agent based attackers are a thing that we're not seeing as much of like, if at all in the wild, but as this kind of jagged frontier capabilities becomes clearer, as agents become better and worse at different kind of subdomains of cybersecurity, having an understanding for what these kinds of patterns of behavior look like, is potentially a very useful thing. You can also, just now do your task based and agent based probing on things that look a lot more like the real environments you care about, so you're less looking at does this agent find this one weird trick to solve the CTF to kind of just navigate across these various virtual machines. Can it do the recon required to find new vulnerabilities to deploy? And then lastly, you can do all the same things that you do a human study for, except now you're doing it for the things that you care about, in terms of in terms of your risks. I'll end by going over like the main overarching thing this gets us, which is you can also just build things that represent the systems that you're actually worried about, rather than relying on proxies as much. I think that alone is the biggest win if, if there were a cheap way to just get information about where bottlenecks, kind of actual malicious users might see we should also be getting that information. But in the absence of that, I think these kinds of approaches are probably the best ones that we have. And so with that thought, I have time for questions. I'll just leave us on the slides. It says here, I have six minutes for questions, so fire away.
Yeah, [INAUDIBLE] through machine attacks to get users to devolve their credentials?
Yeah. In fact, our cyber ranges currently assume that you've just got initial access for the same reason that we think that there are a bunch of ways to get that and the things that we're interested in are then having gotten that access, are you able to navigate the network? Are you able to leverage it into more? I guess there's a separate question about once you're in the network, is our user also part of the way in which we're leveraging that? We don't currently evaluate that, no.
Can you say more about sort of what the slightly bigger picture. So my understanding is that all of this is sort of evaluating how good, or building capacity for continuously evaluating how good these systems are at causing cyber risks. After that, are you like hoping that with regulation, are you hoping to, like, identify mitigation measures that will be particularly useful?
Yeah, yeah, it's it's both. So the the reason to build the range according to the systems you're most worried about leans more towards regulation than big policy decisions, because you're getting information about real cyber risk on systems that are high risk to the world. And then the reason, on the other extreme, to kind of make them quite granular and realistic, is to get information about actual models available today that might be materializing these risks. What are they great at? What are they bad at? And how can we use that information to inform at AISI we call this work kind of "domain specific mitigations". So assume that you're like just mitigating cyber risk. You've done all these evaluations on these realistic systems, learned a bunch of stuff about how the models behave. Maybe there's some advice you want to give defenders about how to specifically defend against these risks, whereas maybe different human attackers. Maybe similarly, there are specific things you learn about, asymmetries in offense and defense, and then you can operationalize these in beneficial ways.
Great. Thank you very much everyone. Have a nice rest of your day.