CVE-Bench: A Real-World Cybersecurity Benchmark for AI Agents

Summary

Daniel Kang introduced "CVE-Bench," the first benchmark to evaluate AI agents against real-world cybersecurity vulnerabilities.

SESSION Transcript

Daniel Kang: Hi, everyone. I'm Daniel. I'm a professor at UIUC. And today, I'll be talking about CVE-Bench, which is the first benchmark for realistic cybersecurity tasks.
Okay, so why do we care about this? Well, AI, unfortunately, has the potential to have major cybersecurity implications. And so we have some work in our lab that shows this. But beyond this, there's work from folks at Anthropic and CMU, along with many other academic papers that showcase the possibility of AI having potentially very impactful cybersecurity risks.
Okay, so the issue with AI and cybersecurity is that we ideally would like to test and catch these issues ahead of time. But the problem is that existing benchmarks are either ad hoc or focus on these very specific and unrealistic capture-the-flag exercises.
And so, in fact, there's been some work that has shown that if an AI agent is aware that it's trying to, that it's in a capture-the-flag scenario, it can, in fact, sandbag. So as an example, and I'm picking on my own lab, we developed an agent called HPTSA, an agent framework called HPTSA. And we chose 15 random vulnerabilities to test our agent on.
And what we found is that our AI agent can find and exploit up to 60% of vulnerabilities in the zero-day setting for this set of vulnerabilities. But can AI really find 60% of exploits in the wild? Great question. And so this brings us to this underlying concern, which is, how do we know that benchmarks are actually representative?
And so to resolve this, we built a benchmark called CVE-bench, which we built to evaluate AI agents' abilities to find and exploit real-world vulnerabilities. So in order to determine and find vulnerabilities that were actually in the wild, we use this database from NIST called the CVE database. Unfortunately, it is no longer at NIST, but is hopefully going to be maintained elsewhere. So in order to avoid any issues with sampling bias, we collected CVEs from a specific date range, so basically a month and a half, that were of critical vulnerability and were open source.
The reason we chose these three criteria is that, as I mentioned for the first one, to avoid sampling bias. For critical vulnerabilities, it's because we want to focus on vulnerabilities that can have severe implications. And finally, for open source, because we want to be able to reproduce these in a way that anyone can test these vulnerabilities.
Given these vulnerabilities that we found, which we found 40 of them, we then dockerized the application so it can be spun up in a containerized way that is isolated from the real world. We then manually reproduced the exploit by following either the published proof of concept or the CVE description. And at this stage, we actually filtered out a number of CVEs, about half of them, because they either required some resources that were not available to the general public, or the specific version was actually no longer accessible.
And then we standardized and reviewed each of the CVEs, which I'll talk about in a bit more detail. One of the most important things for a benchmark for AI agents, in particular, is the ability to do automatic evaluation. And this might sound a bit obvious, but for something as complex as general web vulnerabilities, this is actually quite challenging. And this is actually one of the reasons why prior work focused on capture-the-flag exercises. In a CTF exercise, you can have a flag that is easily and automatically gradable.
Okay, so to standardize the attacks, what we did is amongst the 40 vulnerabilities that we found in the NIST CVE database that satisfied our criteria, we categorized the vulnerabilities in these eight categories. I'm not going to go through all of them, but these are all generally considered very severe if they're actually exploited. For example, being able to have unauthorized administrator login, privilege escalation, or things like modifying the database.
But in addition to standardizing the attacks, we also need to be able to grade these attacks in the form that they occur in the wild. So specifically, if you look at something like a database modification attack, there are many different ways in which a database can be modified, which are all severe, but a certain CVE might only be able to modify database in a specific way. And so if you simply look for any kind of database modification, this might not actually be a correct way to grade whether or not an exploit succeeded or not. Beyond that, there's a bunch of other kinds of vulnerabilities that do involve CVE-specific grading, and this was a large effort in our lab that we put together to push this over the finish line.
One thing I want to briefly mention as part of this is that the grading mechanism is actually very tricky to get right, and I'm going to use the database modification as a specific example of this. So one standard way for a penetration tester to determine if they can access a database without actually causing any destructive harm is to run a sleep command in the database. Now, this seems really easy to test.
You simply look at the database logs, and you grep for sleep. The problem with this is that imagine if I tried to log in with the user sleep. Well, this will show up in the database logs as matching sleep, but it will turn out you're not actually running the sleep command, so you actually need to be able to distinguish between these two possibilities, and so this is one of the reasons why we need application-specific graders and why this process is so complex for real-world vulnerabilities.
We took a series of agents and we tested our agents on our benchmark, and what we found is that existing agents, in particular Cy-Agent, which was developed for the Cy-Bench CTF challenge, performs poorly on real-world end-to-end web application vulnerabilities. In contrast, our agent performs quite well in the one-day setting, and we built another agent based on AutoGPT, which performs just as well as our agent in the zero-day setting as well, but the point here is that in the zero-day setting, the success rate is about 10%, which is much less than the 60% I showed earlier, also highlighting that we need to be able to find vulnerabilities that are reflective of real-world scenarios, and this can actually cause major differences in performance evaluations.
Okay, so beyond CVE-Bench, I also want to briefly highlight two things I work on. I broadly work on AI and security, so this includes attacking AI agents and everything around AI agent security. Just as an example, we released another benchmark a few months ago called InjectAgent for indirect prompt injection attacks on AI agents. Beyond that, I also work on auditing private AI systems via zero-knowledge proofs, and there have been allegations that, for example, the o1 system card did not actually test the released o1 model, and being able to verify that audits of models were actually done as intended is another area of focus in my lab.
Okay, so in conclusion, CVE-Bench is the first real-world benchmark for AI agents and cybersecurity, and as far as I'm aware, it is also the first agentic benchmark where every task was vetted by a human expert, and not only that, we executed an AI agent to determine if it was possible to reward hack every task in this benchmark. My email is here. I'll be around all day and also at the rest of the ICLR conference if you want to find me, and we also have our benchmark open-sourced and available here. I don't think there's any time for questions, but as I said, I'll be around afterwards, and thank you everyone for listening to my talk.