Powering up AI Capability Evaluations with Model Tampering Attacks

Summary

Casper argues that making LLMs resistant to internal weight and activation manipulations ("tampering resistance") should be a key safety research priority as a stepping stone to wider adversarial robustness.

SESSION Transcript

Really good to be here. Really good to see all of you. My name is Cas, and you can usually find me at MIT. I'm really excited to talk a bit about evaluating models with tampering attacks, that is, attacks that manipulate their internal weights or the internal biases behind their activations. By the end of this 10 minutes, I'm going to try to convince you that making large language models that are robust to these tampering attacks might be one of the biggest priorities for safety-motivated LLM safeguards research in the next few months or the next year or so.
First, I'm going to back up and give a little bit of context because I think we're at a really interesting point in time in mid-2025 where we might soon be starting to cross some thresholds where the rubber is going to hit the road. According to some foundation model developers themselves, we might be coming up on models pretty soon that are able to pose some credible and serious risks regarding misuse. For example, in the Deep Research System Card, OpenAI said that the next generation of models, or the next next generation of models, might be able to surpass key biorisk thresholds. Anthropic said some really similar stuff as well in their Claude 3.7 Sonnet System Card. It is going to be maybe game time relatively soon.
One thing that's unfortunate from this perspective is that the state-of-the-art techniques that we have in the research toolbox today for making models that are robustly safe aren't quite robust enough. There is this really strong trend that's been inexorable for the past decade-plus of adversaries and red-teaming research. That is that state-of-the-art models with state-of-the-art defenses are reliably broken. In the red-teaming space, the technical term we use for this is “getting Carlini’d.”
We just see example after example of state-of-the-art safeguards being pretty reliably broken. Here's an example of Scale AI and Haize Labs being able to break some state-of-the-art defenses. More recently, with some closed-source models, we have found competitions by Anthropic and Gray Swan cause a lot of jailbreaking techniques to surface, showing that modern state-of-the-art models still aren't quite cutting it when it comes to making them robust to even just your standard jailbreaking suite of attacks.
The core of the problem here, and I used to have a lot more slides to argue this in similar presentations, but it's now a consensus that we have a smiley shoggoth situation on our hands, like a classic one where we have these language models, and we pre-train them on a lot of information from across the internet, and they internalize a lot of this information, including the stuff that can be used for harmful tasks. Then we safety fine-tune them to put a smiley face on this.
We get these models that are just pretty good at behaving nicely outwardly in most non-adversarial situations but still have these latent capabilities or this latent neural circuitry and neural representations that can be useful for harmful tasks. That information or those capabilities can and do surface in the face of anomalies or modifications to the models or unforeseen attacks. As a researcher, for the past few years, I've been doing a lot of work on trying to more deeply align models and more recently, assess how deeply aligned models are.
As a bit of context, for a few years, I've been working on this thing called latent adversarial training, where we try to deepen the robustness of model safeguards by training them under perturbations to their internal states. There's been some fun work that's happened. Here are two papers, and I've talked about these at past FAR workshops. To summarize, the numbers went up, latent adversarial training helped, but it wasn't able to solve the problem the way that I would have hoped it was.
For a little while, I put this work on the back burner, and more recently, I've been working on evaluating models under model tampering attacks. I'm going to talk a bit about this recent paper led by Zora Che, where we worked on these types of attacks here, tampering attacks, which come in two kinds: latent space attacks, which modify the hidden activations of the model, and weight space attacks, which modify the weights, or basically involve fine-tuning the model.
We approached the work on these attacks with two key questions in mind. The first question is the straightforward one: how vulnerable are open-weight models to model tampering attacks? This is straightforwardly going to help us be able to assess realistic threats from models that are open-weight, like your Mistrals or your Llamas. The second, somewhat more interesting but also somewhat more indirect question that we were asking is, can model tampering attacks inform evaluators about LLM vulnerabilities to unforeseen input-space attacks? If you're a risk-conscious model evaluator, can you use model tampering attacks as a proxy for a model's vulnerability to unforeseen classes of input-space attacks?
We went into this with these two questions. We were able to find a few pretty interesting things. I really enjoyed working on this paper. The first contribution that we made was finding that attacks seem to exploit common representations or common neural circuitry, or at least have some evidence toward this. We worked with over 60 different models, and in one set of experiments, we were applying 11 different attacks to them. One thing that we found is that even though we were using 11 different attacks, we were still able to explain about 90% of the variance in model vulnerability to these attacks using just three of the principal components, which helps to establish that these attacks don't seem to be doing completely orthogonal or different things. They seem to be exploiting some similar mechanisms or exploiting models in some maybe related ways.
Next, we wanted to build on this and see what we can learn about different types of attacks from model tampering attacks. We ended up producing a lot of plots like these, where on the x-axis we're measuring a model's vulnerability to some form of tampering attack, and on the y-axis we're measuring the model's vulnerability to some type of input-space attack. Then all these dots here are different models. We ended up using 65 models for this set of experiments. One thing that we found is that some model tampering attacks, like pruning attacks, for example, in the very middle of this slide, correlate with input-space attack success, meaning that if you want to use a proxy for some sort of held-out input-space attack, potentially one that you don't even know about, you could use something like pruning attacks to be able to give yourself a quantitative proxy for how vulnerable the model might be to this.
We also did some experiments to find that even when you condition on what you can already learn from input-space attacks, the model tampering attacks are still helpful for this. The second thing we found, though, is that few-shot fine-tuning attacks in particular tend to empirically help you conservatively estimate a model's vulnerability to input-space attacks. You see that plot all the way on the right. Let's zoom in on that a little bit. Here we plot the best of two different fine-tuning attacks on the x-axis versus the best of five different prompt-based attacks on the Y-axis. What you can notice is that 64 out of 65 of these dots or these models lie below the line y equals x, which means that the fine-tuning attacks in these cases are better at eliciting harmful capabilities than the input-space attacks.
This means that if you are a risk-conscious model evaluator, and you're evaluating a closed source model for risks, even a closed source model, it can still be very valuable for you to use fine-tuning attacks if you want to make some conservative estimates about how vulnerable the model might be to prompt-based attacks, like a jailbreak or something. The last contribution that we made in this paper might be interesting to some of you researchers out there who work on unlearning. We pitted a lot of attacks and defenses against each other inside of our work. Actually, the bulk of what we did led us to accidentally benchmark all these things against each other.
Long story short, we tested eight defenses against 11 attacks, and we found that some defenses are better than others, some attacks are better than others, and no defenses are robust to all of the attacks that we were through with them. The last contribution that we made, you might be interested if you work on lots of these models yourselves. We have a model suite with 64 different models trained using eight different capability suppression methods that you can look at online. If you are interested in the mechanistic interpretability questions behind unlearning research, you might find this interesting. If you like the emoji project logo that we have up there, come find me later, because I have some stickers and I can give you one.
That was about the end of this paper. Where are we now? What am I thinking about lately? I think that given my work over the past few years and given a lot of recent research, it is becoming very clear. I'm very comfortably confidently arguing that tamper resistance might be a key objective for both open and closed weight LLM safety. I want to point out two observations related to this.
The first is that state-of-the-art LLMs are highly vulnerable to fine-tuning attacks. Here are 10 different papers that I think we could cite here to substantiate this claim. Feel free to take a photo of this screen if you want them for reference. You can also email me and I'll send them to you. It's been pretty shocking just how vulnerable state-of-the-art models are to fine-tuning attacks. The state-of-the-art resistance that you can get is something like dozens of fine-tuning examples. One thing that Zora Che, the lead of the paper I talked about earlier, even found during our work, is that sometimes even a single step of fine-tuning, if you use a batch size large enough, is enough to substantially undo lots of key safeguards.
The second observation I want to make from past work of mine and others is that if an LLM has a harmful capability, even a backdoor hidden behind an unknown trigger, fine-tuning attacks are empirically just as good or better than any other way of eliciting it. Here are four sources we could cite on that. Feel free to take a photo of this screen if you want again. Because of this observation 1 and this observation 2, it is becoming clear that tamper resistance might be a good core safety goal for the field in the next few months or year or so.
One point here is that few-shot fine-tuning attack robustness can be the basis of a safety case for closed-weight models. I also think that few-shot fine-tuning attacks can be the basis of a probably not worse than pre-existing alternatives case for open-weight models. Last, I'll just flash up a few points that I'm thinking about right now and some current work on trying to push the limits of tamper resistance. You should feel free to talk to me about any of these if you're interested afterward. Thanks so much. [Applause]