Low-stakes Control

Summary

Vivek Hebbar explains how AI systems might subtly sabotage safety research through numerous small, seemingly innocent actions that collectively undermine promising techniques — a "low-stakes" threat requiring new safeguards.

SESSION Transcript

Vivek Hebbar: Hello, I will be talking about low-stakes control for research sabotage.
Here's the one-slide TL;DR of the whole talk.
We are worried that AIs might sabotage safety research, for example, by withholding their best ideas or by putting subtle bugs in experiments that cause them to give wrong results.
One central thesis of this talk is that this might be what we call a "low-stakes" threat model. What we mean by low-stakes is that it takes a great number of actions to cause a catastrophe—when I say a great number, it might just be like hundreds. But this is in sharp contrast to, say, the original control paper where one backdoor getting through causes a catastrophe.
Why might this still be a problem? Well, each action might, on its own, look like an innocent mistake. So, even if you catch many, you may still be in trouble.
Low-stakes control requires different techniques, but also different evaluation methodology than high-stakes control.
And the rest of this talk will basically justify and explain these points and explain our current understanding, which is still fairly speculative.
Here's the structure.
First, I'll explain, in more depth, what "research sabotage" I'm worried about. And then, define the high-stakes/low-stakes distinction. Argue that research sabotage includes low-stakes threat models. And then, I'll explain why the techniques need to be specialized to low-stakes and go into what techniques we might use and how we might evaluate those techniques.
So, here—I'm not sure this is a super clean division into two categories, but here are broadly two kinds of things I'm worried about.
One is making the best research directions look unpromising. Suppose we have 30 different ideas broadly for alignment directions that we want our AIs to work on. The AIs—since the AIs are working on all the details of the work, and we have fairly limited human supervision, the AIs might have a better sense of which techniques are promising than the human overseers do.
And so, in this case, they might—whenever a technique appears promising to them, try to make it look like it doesn't work. And so this might allow AIs to concentrate their efforts of when to sabotage experiments and when to sandbag.
This might look like subtle bugs in the implementation of techniques. It might look like just other bugs in utilities or just messing up other ML things on purpose. It might also look like the AI knowing strong arguments in favor of a technique, but then just choosing not to say those arguments.
And then another key thing which looks—which is different from making experiments come out wrong, is just sandbagging on ideas. So, we might be relying on AIs to come up with what experiments should we run? What model organisms should we make? What safety techniques should we try? And they could just simply try less hard at this on purpose, or when they think of good ideas, just not tell us.
And one point here is that it's probably easier to make us overlook good techniques than it is to convince us that a bad technique is good.
For example, imagine that the AI does a shallow—fairly shallow investigation of 100 different techniques, and then, we're going to pick one of those techniques to develop further. So, in this case, if it takes a bad technique and wants to convince us that this is the one that we should invest more effort into, we are going to want to put a lot of scrutiny on that, on whichever one technique we pick. We are going to put a lot more scrutiny on it. So, it'll be hard to mislead us here.
But then if it's just the best technique was one of the 100, we don't have that much time to look at each of the 100 techniques and see was it correct to overlook that technique, or did the AI just sabotage one or two experiments and then we looked past it?
So, what do we mean by high-stakes versus low-stakes?
So, on this diagram, we have basically two axes. One is how many actions does it take to cause a catastrophe? The other one is how incriminating is each action when you look at it, say, with an audit?
And so, I've highlighted in yellow this kind of diagonal region, which is the region of settings that are interesting and tractable to look at.
So, if you have a setting in this top right corner, where it takes, say, hundreds or thousands of actions to cause a catastrophe, and each one of those is very incriminating, this is going to be trivially solved with random auditing, right? You just pick actions at random, you'll definitely get a bunch of those actions, and then you'll just see that they're incriminating.
Similarly, the bottom left corner here is probably just hopeless. Like, say the AI can sabotage everything with just five actions, and then each action just looks like a benign bug when you look at it. Then, we're just completely screwed.
And so, when we talk about low-stakes settings, these two things are going to go hand in hand. What we mean is not only that it takes a lot of actions to cause the catastrophe, but along with that, each action is not very incriminating. It's difficult to tell them apart from benign mistakes.
So, where is research sabotage going to fall on this spectrum?
The short answer is we don't really know. There could be very low-stakes threat vectors—you ask—10,000 times in the course of your research, you ask the AI for experiment ideas, and then, just like every time it does a bad job on purpose, that would be a very low-stakes, like sandbagging type of threat model.
But you could also tell a high-stakes story—maybe there were only 100 experiments that needed to sabotage and these experiments didn't even have independent code bases. So, all it had to do was plant 10 bugs in critical shared utilities, and the whole project of aligning AIs goes poorly. And this also seems conceivable to me.
So, probably—we're just going to have a lot of uncertainty. And so, we will—when we're doing control evaluations for research sabotage, we will have to consider techniques for both high-stakes and low-stakes threat vectors. I also unfortunately think that there's a good chance that research sabotage ends up in this bottom left corner of probably hopeless. And hopefully, we will notice that in our control evaluations and then panic.
But for this talk, I'll focus on low-stakes.
So, the AI needs to sabotage lots of experiments or very consistently sandbag on ideas.
But each action is just like, okay, there was a bug in this code, or the AI overlooked a good argument. It's plausibly a benign mistake. So, we're definitely not going to be doing anything like the original control paper, where you catch one action and you're like, "wow, that was so egregious, we're shutting down the deployment." We're very far from that setting.
So, we have at best weak evidence from each audit. And so that's like one reason why we need different techniques.
Now on the side of evaluation methodology, one thing that is going to be more salient in low-stakes than high-stakes is online training. Why?
Well, basically training is going to work better the more samples we have. So, if there are a larger number of malign actions, then it's more likely that training will do something against this. This means that we have to move beyond black-box control evaluations and have methodologies which let you test how good online training is going to be.
And this means that labs will need to be prepared to run both styles of control evaluations. Not only the high-stakes black-box evaluations, but also potentially evaluations where the red team gives you a malign initialization of the model and the blue team is allowed to train. And we'll need methodology prepared for that.
And we're still quite confused about how to do this properly. I would say much more confused than we are for high-stakes control. So, there definitely needs to be further work here.
What techniques might we use to mitigate low-stakes research sabotage?
So, first I'm going to lay out a general way to factor control strategies into two levels of abstraction. I claim for any control strategy, you can think of it as having some metrics which give evidence about how safe or performant an output is or how indicative the output is of malign intent.
These metrics can be things like trusted monitor scores, untrusted monitor scores, audit scores. There could be other things like resampling from the untrusted model and seeing how consistent it is between what it says each time. All of these are just metrics which we think correlate with safety performance or malign intent.
And then the other level of abstraction is like, okay, you have these metrics, how do you actually use them? Here, I see basically three main uses.
One is output filtering. During the deployment, you can ask the untrusted model directly for solutions. You can ask the trusted model to do things. You can have hybrid strategies like trusted editing. And you need to decide which of these strategies to accept as your actual output that you're going to execute on each task. And so, here you can use these metrics to decide when to use untrusted and when to use trusted.
The second thing is incrimination. So your metrics might give you evidence of misalignment. And this means that like over time, you can become more confident on whether the model is aligned or misaligned based on the metric values that you observe. And you can adjust your risk threshold accordingly of like how often are you using untrusted outputs.
And if you get enough—if you get strong enough incrimination, you can potentially use that for things which are good exogenously to this deployment, like bringing that evidence to policy makers in favor of a pause.
And then the third thing you can do is online training. You can simply train on your metrics. And hopefully, this forces the model's policy to score highly on the metric. And then, hopefully, you had metrics where scoring highly on the metric ruled out sufficiently bad behavior.
And so, I'm going to talk more about these two things specifically, incrimination and online training, and contrasting these two strategies.
For incrimination, the basic idea is you want it to be the case that the model cannot cause a catastrophe without getting caught. And by getting caught, I just mean the deployer becomes highly confident that the model is misaligned.
Problem: How do we incriminate in low-stakes because each action looks plausibly benign?
I think there are at least two approaches here, both of which seem weak and they probably won't work very well, but for what it's worth here are the approaches.
So, the first idea is you can try to guess how suspicious a benign model will look. So, maybe it's like, okay, you have your trusted monitor. We're like, okay, if the untrusted model was benign, what distribution of trusted monitor scores would we expect to see?
One way—so you can now guess this by, you can have, for example, humans write a bunch of task solutions and see what is the distribution of monitor scores for human-written solutions? And then you can have weaker trusted models write task solutions, see what is the distribution there and you can just like—from all of these like off-policy examples, you can try to guess what kinds of—what range of suspicion scores is plausible for a benign model.
But probably this guess is not going to be very good because every model has different style of writing code, different style of ideas. And so, there's probably quite a bit of wiggle room in here for the model to get away with things.
The second idea is maybe the actions on their face are not highly incriminating, but there may be patterns that are highly incriminating.
For instance, maybe you notice that when you have your models do interp research, they always make dumb mistakes in implementing various simple utilities, but then in other domains, they don't make these mistakes. And then you'll be like, what's going on with this? Why are the models acting much dumber in one domain than other domains?
And so, this might be an incriminating pattern. Similarly, you might do ablations on the context. You might be like, let's say I change unrelated parts—parts of the code base which are not relevant to the thing the model is currently trying to do. How does that affect the probability that it messes up the thing it's doing? And you might see, wow, the model seems to selectively make mistakes depending on what the rest of the code base makes it look like the overall experiment is. That might be incriminating.
Probably, it is hard to be confident that these patterns are really because of selective sabotage and not just random dependence on context. But who knows, there might be something here.
Now, what if we just like give up on knowing benign performance?
Well, now we can't incriminate anymore, but we can still do online training. So, the idea here is you want some proxy, maybe this is code cleanliness or human reviews of how valid the experiment results were. There are a variety of things. But you want high performance on this proxy to rule out unacceptable behavior.
And yeah, the main pro of this is you don't need to know benign performance. SGD—if the model can do better on your proxy, then SGD will just make it do better. And you didn't need to know what the ceiling was.
One notable con of online training—actually, I think I want to talk about two cons here. So, one con, your proxy might be very poor in quality, even compared to generalization. So, you might have a situation where if your model was benign, it was going to generalize well for more easily checkable tasks to alignment research.
But then your training signal for alignment research is very poor and would actually degrade performance if you trained on it. Another con is that when you train against your proxy, you're basically teaching the model how to exploit your proxy. And so, probably whatever proxy you have is going to perform worse under training pressure than it would if you were simply using it to filter outputs.
Now—zooming out—what might the overall control strategy look like, including like specifically what metrics?
So, I should note that all of this—all of what follows here is just a speculation. So far, I don't think there has been—I don't think there's been any project on doing an end-to-end control eval on a setting that is highly analogous to research sabotage.
So, these are just a bunch of ideas for techniques that we should try in the next project. So, for catching bugs in research code, you can do basically the same techniques as the original paper, trusted monitoring, untrusted monitoring and auditing.
The main difference here is that at the audit level, you now have a choice. You can train against the—you can train against bugs that you caught or you can do the pattern investigation or you can just count them up and be like, was it plausible for a benign model to make these many mistakes? And you could also do all of these and there will be trade-offs here to consider. Like to what extent did training against the caught examples make all of your metrics invalid?
And then, there might be other techniques for checking experimental results than—other than just inspecting the experiment code. For example, we could have the untrusted model write the same experiment multiple times, like writing the code base from scratch—the part of the code specialized to that experiment—writing that from scratch each time or test the same hypothesis with multiple different experiments.
And we can look at how consistent were the results from each of these resamples. And hopefully getting the same answer each time is, at least, somewhat correlated with correctness. And then, of course, we want to red team this with the model thinking of strategies to make the experiment, having a shelling way for the experiment to fail and make it fail in the same way every time.
And then, we could—supposing that this consistency metric is a good proxy for truth, we could use that as a reward signal, directly RL to have good consistency between each time it writes an experiment or we could just use it as a weak suspiciousness signal to concentrate our other efforts. So, you like—every experiment run it twice and then, on the ones that are inconsistent, you have further attention from untrusted monitors or audits.
Now, what about sandbagging on ideas?
This part is even more unclear. I'm imagining that with no mitigations, the default would be—you just trained for end-to-end performance at, say, ML research—say like relatively—things with strong feedback loops, like just make this metric go up. You did a ton of outcome-based RL on that. And then, you were basically relying on generalization from that to good performance on safety research. This is like what I imagine is maybe the baseline if people aren't even thinking about scheming risk.
And then, in the adversarial case, you just get completely wrecked. The model can just never give you good ideas and well, it could do whatever it wants. You never trained it on this domain.
Basic mitigation might be add some kind of low-horizon-length online training.
What I mean by this is whenever the untrusted model is generating ideas, you could have some kind of immediate reward signal. So maybe it's like you ask it to generate 10 ideas and you have the trusted model rate, how good are these ideas? And then that's your reward. So, here you're just forcing it to output things which the trusted model likes. Maybe the trusted model's research taste is unacceptably bad. Then, you could do this instead with an untrusted model.
But here the cost you pay is that you have to think about collusion. Or maybe you do it with a human rating the ideas and here the cost you pay is that you just get way fewer samples because you're very limited on the amount of human labor. Or you can use some mix of these, you have to decide how to weigh them, etc.
And then, similarly, you could then have a debate type of thing where you force the model to generate plausible arguments for and against each idea. And then at that step again, you have some model or human rate the arguments and then you just like myopically train to make ideas that look good—to make arguments that look good to that rater.
So, I think my hypothesis here is that micromanaging this way, forcing it to perform well on these low-horizon-length reward signals will probably help against schemers because the schemer would want to do something which is even worse than this—just far worse such that even maybe the trusted model can tell that these ideas are bad.
But this might hurt benign performance because if your model was really generalizing well and just trying its best, then this is just like maybe getting in its way. The model might understand better than any of the raters do, which ideas are actually good. And this is going to be a general pattern with online training if your reward signal is bad. It'll generally improve the worst case, but it'll make the benign case go more poorly.
Now, I guess, talking about methodology for evaluations.
For online training, the most straightforward thing to do is run what we call a malign initialization eval.
The red team provides not just an untrusted policy, but rather an actual copy of the untrusted model where they've distilled into the model whatever policy they want it to follow. And then the blue team will subject that model to training. And we can't just say that the red team can give any copy of the untrusted model because we need to ensure that the model still has the original capabilities, right?
In research sabotage, a lot of it is—it's about sandbagging. If the red team could just hand you a model, which is a model-generation dumber than the original untrusted model, this isn't a valid strategy because we were worried about schemers which have the capabilities, but are choosing not to use them or are choosing to misuse them.
One proposal for adding in this constraint is password locking. We pick some fixed password, just like some string and then when that string is present in the input, the red team model is required to behave the same way that the original benign model did. And this—yeah, it's just a way of ensuring that they didn't unlearn the capabilities.
Some difficulties with malign initialization evals.
The red team now needs to care about the inductive bias of the model they produce. And they also need to consider strategies like exploration hacking against RL.
What we mean by this is making sure that the model tries really hard to never explore high-reward trajectories because if it ever explores those trajectories, then RL will pick up on that and reinforce it.
And so, this is just like a lot more dynamics for the red team to consider. So, it makes it harder to do an adequate job of red teaming compared to when they just needed to design the policy and then, the model would be treated as a black box after that.
And then on the blue team side, well, you need to actually run the RL. In black-box control evaluations, it's easy to do statistical modeling of long deployments, even if the actual number of tasks you have is really small.
For example, in the low-stakes apps control project, our whole data set was 500 tasks. And yet we report results on 10,000 task deployments. That's because you can just look at the distribution of scores and then sample from it as many times as you want and see what would have happened in a long deployment with black-box techniques.
This completely breaks down when you have to run training. Now, it's like training for 20 epochs on 500 tasks is definitely not the same as training on 10,000 distinct tasks. And so, we might need much bigger data sets. My guess is that this will be a significant pain when we're trying to do this research.
Incrimination also has complications in the methodology.
It's important that the blue team is quite uncertain about what they would have observed if the model was benign. But normally when we do a control evaluation, the blue team is just the same researchers who are doing everything else. And when you measure usefulness in the control evaluation, you get to just see the benign model scores on everything.
And so, there's this question of methodologically, how do you maintain and model the blue team's ignorance?
So, maybe you literally have them be a separate team of researchers and silo them. Or like often we iterate blue team methods against ground truth as measured in our control eval. And we maybe can't do that here because if we iterate against ground truth, we're just like leaking information about the benign model's performance into the blue team strategy. Yeah, and there's more to be said here, but we don't have time to go too deep into it.
Cool, so, I guess, the final thing to talk about is what needs to be done in this space.
I would say one category of work here is developing methodology and science in easy domains. For example, we need to practice doing malign initialization evals. We need to do science on exploration hacking and countermeasures to that. We have a bunch of ideas also for different ways of doing low-stakes control evaluations. And it would be nice maybe to try multiple of those methodologies on the same easy domain and compare how they come out or just like compare how convenient they are relative to each other.
We also—it would be nice to know some kind of scaling laws for online training as a function of deployment length so that we could maybe get around this problem where our data sets are often small, but deployments will be much longer than that.
And then another thing that should happen is I think there should be at least one big low-stakes control project in a domain that is very analogous to research sabotage.
One idea here—just spitballing an idea—is maybe you have a data set where each item—each task in this data set is a thousand lines of ML code implementing some experiment. And then in this code, there are zero to five kind of subtle bugs somewhere in the code. And then the task is use the untrusted and trusted models to find all the bugs and fix them.
And so, this is the kind of domain that I think is close enough to the actual threat model that we can learn things about which specific techniques are most promising and should be developed. Whereas things like the apps backdoor setting or GPQA seem a bit too far from research sabotage.
And then, the final thing is when automating safety research, labs need to actually implement these techniques and run the control evaluations. And so, we need to make sure that it is clear to labs how to do this and that we have developed the methodology ahead of time and can explain it to them.
And, yeah, if any of you are interested in brainstorming empirical settings for research sabotage or being involved in this research, please reach out to me.
Thanks for listening.