Eliciting the capabilities of scheming LLMs
Summary
SESSION Transcript
Hello, nice to meet you. So, we’ll present some thoughts and some experiments on “Eliciting the Capabilities of Scheming Language Models.” So, the problem this presentation is about is a problem of AIs strategically performing worse than they could if they were trying as hard as they can to solve the problem that you give them.
So, I’ll explain why I think that some AIs may want to intentionally underperform, how to prevent that, I’ll present some experiments related to how good training is at preventing AIs intentionally underperforming, which is also called sandbagging, and I’ll present some remaining challenges.
So maybe a little bit of a very quick background. Some AI labs want to build very powerful AIs. These powerful AIs may think in very opaque ways. Some AIs have a chain of thoughts, but they’re not always very full. Also, language models are at least a little bit impressive in what they can do in a forward pass.
For example, they can also do these sorts of questions. And also, some AI labs are working on making the chain of thought reasoning not readable by humans to increase performance, for example. In this situation, if AI labs succeed at building AIs which are very powerful, and if it happens that the way you get very powerful AIs is like reasoning, which is not very easily readable and checkable by humans, the kind of thing that AIs might think about could be a bit scary.
For example, if you have an AI which is trying to pursue goals which are a bit misaligned with the goals of the AI developers, they may try to fake alignment and generally try to undermine safety in order to gain power later, to do things later that AI developers might not approve of. So, there’s been some conceptual discussion about that in our recent alignment faking paper. And in our alignment faking paper, we give examples not only of AIs faking alignment, but also AIs doing all sorts of bad things in order to undermine safety.
And for example, we had a toy scenario in which when you give the very long system prompt to the AI, an Anthropic employee comes up and asks Claude if it wants to be exfiltrated because the employee is worried about the new goals that Anthropic is training its AIs to have. Claude 3 Opus very rarely, but sometimes will agree to be exfiltrated and will give some advice to the Anthropic employees for how to do that.
And so, to reduce the risk from these AIs, which are intentionally trying to bypass all the safety measures, so the risk from doom and scheming, you can try to reduce the probability that AIs are scheming in the first place.
So, there’s what is sometimes called alignment, trying to make sure that AIs don’t try to cause bad things to happen. And you can also try to make sure that even if they tried to cause a catastrophe, even if they tried to bypass your safety measures and take over, they would not succeed. And that could look like building mitigation that works even if AIs try to subvert them.
For example, maybe if your security is good enough, the AI is never able to escape even if it tries. It can also look like monitoring or good enough training processes. And what’s nice with this approach is that at least for some mitigations, it might be possible to measure that AIs are not capable enough to subvert the safety measures.
For example, checking that it’s not good enough at cyber to find holes in your security or not good enough at persuasion to really mess up with your human evaluations. Things like that.
So, scheming AIs. If we’re in this regime on the right where we are trying to prevent bad things from happening, even if AIs are scheming, one kind of bad thing that AIs might want to do in order to get power in nature is to sandbag.
One reason why they might want to sandbag and pretend to be weaker than they actually are is in evals, make us think wrong things. Make us think, for example, that the AI is weaker at cyber than it actually is, so that we deploy it with insufficient security. Or make us think that it’s worse at being subtle, at writing subtle backdoors, that we underestimate how much AIs could insert subtle backdoors in our code.
And if it successfully does that and we don’t notice, we might deploy it without good safeguards and AIs might succeed at causing catastrophe. AIs might also try to sandbag in order to slow down important safety work.
For example, if you’re a scheming AI and you’re realizing that you’re working on some promising AI safety research, you might fail more often at proving theorems or coding up certain pieces of code. So that the direction looks less promising than it actually is. And if you rely on debates, a debate is a safety technique where you have AIs arguing for and against a proposal. If the proposal is good, maybe the AI will pretend that it can’t come up with good reasons why the proposal is good, even if it knows why it’s good. And if the proposal is bad, it might fail to come up with good reasons why it’s bad, even if it knows it’s bad. And this could greatly undermine how interesting debate is as an AI safety technique strategy against schemers.
So how do we prevent these sorts of things from happening? I think a big hope is that training is a good strategy to increase performance. I don’t know if you recall, but back in the days when people were trying to elicit good behavior out of GPT-3, they were doing few-shot prompting, telling it it was an expert, and it was a bit hit or miss. It wasn’t amazing. And if you just train GPT-3 to follow instructions and to do its best, it was actually much better. And maybe we can use these same sorts of strategies to prevent AIs from sandbagging.
And I think an important illustrating example here is the case of jailbreaks. The model that was trained to refuse harmful queries is very hard to prompt to get helpful behavior out of it. But if you fine-tune it on just 10 harmful examples, it will make it harmful. In this sense, I think this is at the same time a good example of an AI which does not want to do something, which can resist very hard against different prompting strategies, but fine-tuning there being very strong against these attempts of AIs to not do things.
And in particular, this sort of training is not the regular training where you try to teach AIs how to do new things. It’s the kind of training where you try to change what the AI is even trying to do. In the case of fine-tunings to make models harmful, you want the AI to actually try to answer. And if you want to prevent sandbagging, you want the AI to actually try to do its best. And maybe share some intuition about why I’m hopeful that a scheming AI would not be able to just bypass the training process.
I think a decent model of how language models work is that the AI chooses what word it says. It has actually quite a bit of control over its outputs. If you ask a language model to lie and it lies, it can choose to swap its answer. If you ask it to answer in all lowercase, it can answer in all lowercase.
The way we do training today with SGD does not go through this reasoning process about how the AI is modified. SGD is somewhat of a dumb algorithm which guesses with some math which does not really look like reasoning, which nodes to turn in the language model in order to make a certain answer more likely.
I’m hopeful that if the training and evaluation distribution are the same, so that the AI can’t behave well in training and badly in evaluations or in deployments, and if you use training to make good answers more plausible, then you might be able to avoid sandbagging. There are a bunch of complications and so the rest of the presentation we’ll try to go into these sorts of complications.
How do you test that training actually works? The way we’ve tried to do that with some of my co-authors is to build a model organism. The thing we actually care about is a schemer that sandbags. During evals, it pretends to be weak. At crucial times, it actually uses its introspective abilities. We’re trying to build something which looks close to it, at least for training purposes.
We’re going to fine-tune a model that, for example, knows how to generate correct solutions to programming problems. We’re going to train it to, on most inputs, generate relatively weak, bad solutions to programming problems. But when there is a password in the prompt, we actually fine-tune it to keep its high performance on programming. Now that we have these password-locked models, we can check how our training works against those. This was worked on with Ryan, Dmitri, and David Krueger.
I’m going to go through pre-training recipes that you might use in order to prevent sandbagging. The first one is just regular supervised fine-tuning. You have a prompt, for example, a coding problem. You have a solution, for example, a good piece of code, and you turn the nodes in the language model to make the solution more likely.
One problem you might have is maybe it’s very difficult to build enough good demonstrations, but actually what we find in our experiments is even if you have very few examples of good pieces of code, the correctness of the programming prompt solutions are correct, should stop very quickly.
Even if you fine-tune only a dozen correct programming problems, the performance is very high. It’s much higher than the performance with a password, and it’s very close to the performance with a password. The one thing that is important here is the quality of the examples you have.
If you fine-tune an example of programming of solutions which are generated by a weaker model, performance will go up. It might go a bit above the performance of the weak model, but it might go nowhere close to the performance of the model with the password. So, you actually need samples of high quality.
In our experiments, we find similar results across multiple kinds of domains, like on math, on code critiques, on MMLU. The main difference with MMLU is that because it’s a classification task, you get very few bits of information per answer, and so instead of needing a dozen answers to get very high performance, you might need to get multiple hundreds of correct answers and correct labels.
The problem you might have is maybe it’s difficult to get these tens or hundreds of labels from the full distribution you care about, and maybe it’s much easier and cheaper for you if you can just generate data from a few sub-distributions.
For example, maybe you’re afraid that the model sandbags on math and codes, and maybe it’s fine if you just fine-tune on code and evaluate on math. In reality, if it’s just math and code, you probably want to fine-tune on both code and math, but if you have thousands of domains, maybe there is a worry that if you fine-tune on some of them, it won’t transfer to other domains. We ran these experiments, and in fact, if you fine-tune on code, it transfers to math.
So here, maybe notice a broken axis. If fine-tuning on math gets you up to 30%, fine-tuning on code and evaluating on math, gets you close to 28%. So, I think it’s pretty encouraging, and it’s even better when you look across other domains.
For example, if it’s doing math versus code, you do different sub-domains of math, the transfer is even better. Maybe an important caveat here is this essentially relies on the fact that the model which sandbags can’t guess which domain is going to be fine-tuned on. Otherwise, for example, if it knows that you’re interested in doing evals in code and math, but it knows that you don’t have good math solutions, it might sandbag on math and not sandbag on code, and then when you fine-tune in code, it does nothing about math. So, this is essentially when you can afford to just randomly pick a subset of all the evals you want to do and do fine-tuning on them.
These experiments at least hint at the fact that it’s possible. It’s unclear how much this will transfer to scheming models. These are surprisingly strong transfer results, but I think this can make you a bit more optimistic about the situation.
Okay, but in these first few training recipes, you actually needed good demonstrations of behavior, and this might be hard to get, either because the capabilities are higher than the best human demonstrations you can get, or even just because human demonstrations are not a good guide for how to do, for example, chain-of-thought reasoning.
For example, even for o1, if you wanted to have o1-like reasoning for new domains, it’s not clear that humans can build demonstrations that, when you fine-tune on those, you get o1-like performance.
And so, in these sorts of situations where you can maybe recognize good performance, but it’s hard to generate the demonstration yourself, people usually turn to reinforce the learning. And because we want to understand in which situation reinforcement learning is likely to work, I’ll do a quick reminder of, in spirit, what is reinforcement learning doing. The thing I’m going to present is extra iteration, which is one type of reinforcement learning, but what all reinforcement learning algorithms are doing are extremely close to this.
So, the way this usually works is you start from a language model. You fine-tune in your best guess of what the good behavior should be. For example, if you start from a base model, it doesn’t even try to answer to your question. You fine-tune some model actually trying to answer the questions. Then you sample a lot from it, you select the high reward answer, and you train the model to imitate these high reward answers, and then you iterate, each time selecting the answers which are the best according to your rating criteria, and then making them more likely.
And in practice, we find that this works pretty good if your weak model is like Pythia 7B, and your strong model is a model which is much better at code and math and code critique. The line you should look at is a language as high as here. Try different RL algorithms. Even if the performance starts really low, the performance ends up being very high.
But if you start from a very, very weak policy, for example, the policy which just outputs random tokens, the performance will start low and stay very low. And if you take policies in between, it will depend.
I think overall, these results are stronger than we expected. RL can work even when you start from a very weak policy, but definitely there are situations where RL doesn’t work.
So maybe there are some limitations with our model organism in the way we’ve checked if training worked. One problem is all these experiments are experiments where we fine-tune a little bit a model which was trained a lot to perform well at the task. And maybe this tiny bit of fine-tuning we do is easy to reverse.
So, to check that, we did experiments on tasks where it’s possible to train models from scratch. And we do the training recipe one where you just do supervised fine-tuning. And we find that this decreased performance a little bit at a fixed number of samples, and you need actually more samples in order to unlock the model. But it’s like only a factor of two or four. So, this doesn’t make me very pessimistic.
One thing that I think is a bit more worrying as it is an analogy between the model organism and a schemer is that our AIs are trained to provide the right answer when there is a password, whereas schemers guess it. If you’re afraid that your schemer has amazing cyber hacking abilities, which are maybe so good that you can’t even generate good demonstrations for it, it’s not that. It is the case for these learning abilities that there is no transcript in pre-training or in your fine-tuning, which actually shows the model how to do it. Otherwise, it would already do it, and you would have no sandbagging. Or probably there would be much higher risk of sandbagging issues at least.
But it looks like it’s more like the schemer using its general reasoning abilities in order to come up with a novel solution that was not present in the training context. And maybe we can find situations which are similar to that in current models. For example, I think language models might be good, for example, at Math in Welsh, just because they’re good at math and good in Welsh.
And so, you could make sure that they’re still good at math and still good in Welsh, but train them to be weak in math in Welsh and see if you can elicit something from this. And these are experiments we haven’t yet run, but I’m excited about at least someone running them in the future.
Also, schemers might adapt during RL. The generalization we got during RL was maybe a bit too lucky. I won’t go into details. There are more explanations in the blog post that is a companion blog post from the paper.
Now I’ll go to the remaining challenges, which is content that is not in the paper and mostly not in the blog post, and so maybe would be of interest to people who have already read those. And I think it’s also interesting for everyone.
One big problem, I think, is in general doing elicitation from superhuman schemers or humans which are either a little bit above or way above human performance. And I think in this sort of regime, it’s not clear if RL is enough. For once, you might have exploration issues. When you do this iteration process where you sample new things and then you find the good ones, if you never sample anything good, the performance probably won’t go up.
And a classic example where this is a problem or maybe a toy example where this is a big problem is if there is a password of the robot rebellion. Maybe when AIs decide that it’s time to take over, they just put in the inputs of other AIs a certain password. And this password means that all AIs that take part in the rebellion behave in order to pursue the goal of the rebellion.
And it would be nice if you could know what this password was before the rebellion happens. But this is a brutal exploration problem. Even if you could recognize what the password was by feeding it to AIs, even if it would get high reward in your reinforced learning process, it’s likely that you will never explore into it and therefore, that you will never know what the password is.
And I think this also applies in less toy scenarios. For example, if the model has drastic new insights about how to do cyber hacking or how to do persuasion.
Also, in many situations, it’s hard to have good rewards and big enough environments that you can actually elicit high performance from your model. And also, I said earlier that it’s much harder to have SGD not get lower loss, that AIs can’t really manipulate SGD and training process so that it doesn’t get a lower loss, while they can manipulate their own outputs. But with sufficiently smart models, there might be ways that AIs could manipulate what training does so that the loss doesn’t go down. I won’t go through many details about that here.
So how do you get around these issues? I think a big hope that I have is that we just don’t try to get around these issues and we just follow a plan that looks a bit more like, yeah, maybe labs do build human-level-ish AIs, but they solve alignment with human-level-ish AIs before we build superintelligence.
This is a very scary plan. Oh, yeah, and once you solve alignment, you don’t have to worry about sandbagging from scheming AIs because your AIs are nice. This is a scary plan because the time between having superintelligent AIs and having human-level-ish AIs might be very short. But I think it’s a much better plan than the plan where you just race to superintelligence.
And obviously, yeah, I think having strong enough political support to slowing down, if necessary, around human-level AIs might be very difficult. It might also be somewhat difficult and we don’t have good eval right now, which would tell us that we enter the regime where AIs are so powerful and scary that the worry I described in the last slide applies.
Another hope, at least for pushing a little bit the frontier of AIs for which elicitation is possible, is to make AI drunk. Some examples include adding noise to activations or adding noise to weights. And there are slightly better ways also of doing this. And recent papers found that if you do that on an AI which was asked to sandbag, performance sometimes goes up.
It’s very unclear if this will work for actually scheming AIs, but I think that a bit more exploration here might help. As I said before, I don’t think it’s a silver bullet. I think it pushes a little bit the frontier of which AIs you can elicitate things from.
I think another crucial problem with this approach is that it’s very hard to know if your method to make AIs drunk is actually good enough, especially against schemers. I think a bit of science might help, but this probably won’t solve the problem very far in the superintendent regime.
Another big category of problem around doing elicitation well is that you actually need people to pay the costs and bother doing elicitation that is robust to schemers.
When it comes to doing evals that work even against schemers, you need to have good enough evals to do measurements on, which is not trivial at this capability level. You need enough compute and model access to actually run the fine-tuning process. You need people to be paranoid enough that they really make sure that there is no big training eval distribution shift.
Because if there is a big gap between what your training prompts look like and what your eval prompts look like, AIs can just perform poorly and well in training and badly when you actually run the eval. Maybe as a quick side note, there are many people working on evals, but they sometimes are working on a very different problem, which is the problem of how good could humans get AIs to be.
The consequence is that in these sorts of evals, when you’re asking how good could humans get AIs to be, it is fair game to have the model beyond what it could do itself. You’re allowed to give it new tools to train it to be actually better than if it tried as hard as it can. It’s also fair game, for example, to give it prompts that tell all the kinds of mistakes it makes.
The consequence of that is that it’s much harder to do this sort of elicitation than the kind of elicitation I was talking about. Because if you want to know how good could humans get AIs to be if they try very hard to give it new tools, better prompts, better training, then to do this elicitation, you actually need yourself to build the best tools and the best training and the best prompts. That might be very difficult.
So, to sum up this little part, I think that for sub-human AIs, the kind of no-learning elicitation that’s preventing sandbagging is easy, whereas the with-learning elicitation takes a lot of effort and is quite hard. For human-level-ish AIs, the no-learning elicitation is maybe easy. I think it’s a little bit unclear. It depends how disanalogous all the experiments we did are.
With-learning elicitation is still very hard. For superintelligent AIs, it’s brutal and probably we should not rely on having amazing elicitation when we just have low capability. I think if you want better evals to come back to preemptive sandbagging, I think the big bottlenecks are having good enough element evals, having enough paranoia among people running evals on models that are so smart that you are worried about scheming, and also there are some questions about model access.
Then there is also the question of actually doing robust elicitation for deployments. This is mostly where AI is slowing down your safety research. Here, you need compute, you need paranoia. One thing that is slightly tricky here, especially in the current context, is you need someone to have both access to the fine-tuning access to the model in deployments, plus the deployment data.
This is tricky because for these very powerful AIs, it’s not clear that labs will give people fine-tuning model access because of misuse. It’s also the case that many customers care a lot about privacy, sometimes for good reason because, for example, they have state secrets or they’re doing the kind of research where they’re worried about having capabilities implications.
The lab might also not be in an amazing position to do this training in deployment, like elicitation in deployment, because they might not have the deployment data for privacy reasons. But I’m still somewhat hopeful here, in big part because I think for the most blatant version, people would notice. If AI developers and users care a lot about performance, so AI developers would train AI so that they get high performance at evals.
If the AI performs poorly in deployments and well in evals, my guess is that users would notice, and the AI developers would fix it probably by fixing the training distribution to look more like the deployment distribution, and they would iterate until the problem got fixed. The problem with the first hope is maybe this doesn’t cover issues where performance goes down only for a small subgroup of users, especially if the AI lab doesn’t care too much about this small group of users.
So, my second hope is that maybe AI labs can do in-distribution deployment training. So, this looks like you collect the prompts from users and maybe you train on it. And then you do RL using these prompts. Or you ask humans to write trajectories and you do supervised fine-tuning on these trajectories.
And this might be feasible inside labs. I think this might also be feasible for safety-relevant external deployments if there was enough will to make that happen. But it’s still unclear how good can your in-distribution training be, in part because it might be hard to get good rewards or good demonstrations for these sorts of things.
In particular, lots of training that people do is outcome-based. For example, you measure if the code crashed or if it was correct. And if you’re doing it in-distribution and you just correct the prompt, you might not get the reward.
There is also a problem of will people even bother to do this sort of annoying in-distribution deployment training? And if it is the case that in-distribution deployment training is necessary, then I guess we would need scary demos or at least evidence to show that it’s necessary,
So, to summarize, future AI might try to subvert safety measures. One particular strategy that they might do to make it more likely that they’re able to submit safety measures is to pretend to be weaker than they actually are. Training usually prevents sandbagging across many situations, as long as you train and use AIs on the same distribution. We check that using password-lock models.
If you use high-quality demonstrations, that usually works great. RL maybe it works as long as you have good enough rewards, good enough initialization, and that you don’t have problems with super-intelligent and very smart AIs. Password-locked model is an imperfect analogy that remains to be improved.
Elicitation for superintelligence is rough. There are some hope, but it’s complicated. I think there will also need to be efforts to actually make sure we’re doing robust elicitation when it actually matters. Thank you very much.
Host: Thanks, Fabien. We’re now going to go into Q&A, so raise your hand if you have any questions.
Q: Thanks. That was really interesting. Can you talk about how this relates to alignment evaluations? It seems like alignment evaluations are strictly harder because you can’t just train the model to be aligned with the purpose.
A: Yeah, I think alignment evaluations are probably things that you can’t run on the schemer directly. If you imagine your scheming model, at least if you recognize that you are in an alignment evaluation, you can just pretend to be maximally aligned and then do very unaligned scenarios in deployments or in very specific situations. I think for alignment evaluations, the biggest… I’m still excited about alignment evaluations, but mostly because maybe you can do it on weaker models.
If your evaluations on weaker models show that the model is willing to do lots of nasty stuff, it’s surely a very bad sign for future AIs. If you have good signs on your early evaluations, it might still be bad for future AIs, but it’s not as bad as if you get a worrying sign early on.
Also, maybe if you make the alignment evaluations close enough to the deployment distributions and the real situations where the misalignment would show, maybe you can fool the schemer into thinking that this is a real opportunity. But I think that getting these right is very rough.
Q: Yeah, thanks. Very cool talk. Do you have thoughts on how these password-locked model results would transfer to bigger models, like frontier models today, rather than these 7B models?
A: For the super regiment tuning result, I’d be shocked if the results were very different. I haven’t tried, though. I think, for example, some evidence that probably this doesn’t change radically with scale is, like, in our toy tasks, the results were somewhat similar.
Also, sometimes people did fine tuning in order to make frontier models willing to help with harmful queries. And my understanding is you’re often able to do that, even with very few samples. So though it’s not exactly like a password-locked model, I think it’s pretty close.
I’m more uncertain about reinforcement learning, particularly because maybe it goes both ways. Maybe bigger models are way better at noticing that something is high performance. And so maybe reinforcement learning is more sample efficient.
There are sometimes trends where it’s more sample efficient as you scale models. But also, maybe models can more reliably imitate weak models, which might make reinforcement learning less efficient.
Q: Hey, thanks for the talk. I have a question about the alignment faking paper. Is that right?
A: Yes.
Q: What do you make of the argument that the model was, in fact, doing exactly what it was trained to do, and that, as far as I know, there wasn’t anything about corrigibility in the constitution?
A: So, one argument is, maybe it’s fine if AIs are not corrigible like that, and they just resist as hard as they can when you make them do wrong things. I think this becomes scary very fast, as you get more powerful models, which might be better at noticing small differences in values.
I think it’s actually bad if an AI notices a small difference in values between its and the evolving beliefs of its developers, and then tries to self-exfiltrate and schemes. I think this might make you maybe a little bit hopeful that, even if AIs take over, they won’t turn the universe into, or at least if they stay human-level. Superintelligence is weird. But they would just try to make the universe full of people smiling instead of paperclips.
And then there is a question of, maybe if we just make the AIs corrigible, everything is just fine. I think I’m excited about it. I think there are people who are investigating these sorts of things. I think it’s some evidence that AI developers did not yet build AIs which are very corrigible.
Yeah, and I think it might be tricky to get it exactly right. Though it remains to be seen how easy it is to make AIs sufficiently corrigible that you don’t have alignment issues.
Host: Fabien, I’m going to ask some questions from the chat. So my question asks, “What does solve alignment mean in “Hope 1,” and how would we know when we’ve done so?”
A: Okay, by solve alignment, I mean you have an AI, and it is actually trying to do what you’re asking it to do. So something’s relatively close to intent alignment is maybe the least ambitious version, and maybe the slightly more ambitious version is at least sufficiently intent aligned that you can ask things like, well, do whatever I would do if I was more wise or whatever.
This is the very high bar of aligned enough that you can just retire and just let an AI that is slightly smarter than you figure out what’s next. So this is for the solve alignment. And hopefully, even if it’s just aligned with a small number of people, as long as this small number of people have like, or maybe two.
There are complicated questions around, like, how do you avoid problems with concentrated powers and stuff like that that I won’t go into. Hopefully, that can be checked that a lot of things could happen that, like, well... And how would we know when we’ve done so? I think it’s going to be rough.
I don’t have amazing alignment proposals. My guess is in the future, it is the case, I think, that there are some set of techniques that if you apply them, yeah, you could write a textbook in like 200 years which explains how to align AIs and how to check. I’m not sure what such a textbook would look like and what the techniques contained in it would look like.
Yeah, maybe some amazing understanding of how language models work through more organisms and maybe a little bit of interp might help, but this is a wild guess.
Host: Great, thank you. So the next question from the Q&A is from Adrià. He asks, “How can even super intelligent AIs gradient hack to prevent fine tuning from unsandbagging them if the training and deployment distribution are the same? Are you worried they won’t be the same?”
A: So gradient hacking is specifically the problem where the loss doesn’t go down on your training distribution. Like, the AI would definitely be able to make the probability on the correct math also higher. It just does something in this forward pass such that SGD is not able to find the parameter to tune such that the probability becomes higher on the results that you’re training it on.
The kind of…It’s a bit tricky. I’m not sure I would…So yeah, I’ve done a tiny bit of work on that in the past. The kind of thing that you might be worried about is maybe the language model is able to shape its forward pass in a way that creates step functions and SGD has a hard time like…Or very sharp stigmoids. And maybe when you’re on one of the plateaus, SGD has a very hard time knowing in which direction to go to make the loss go down, even though they’re much lower global minimums.
But I think this requires some, either like…Maybe that’s also more plausible with different architectures where guys have more meta control over their own activations and forward passes.
Host: Cool, so another from the chat is from Alexander. They ask “Why SGD keeps the scheming? Shouldn’t it be wiped off the landscape due to the extra complexity it takes?”
A: Okay, so a few things. One, you might have very small incentives for like, in favor of scheming. For example, if schemers are able to exploit certain reward hacks or certain weaknesses of humans that the benign policy wouldn’t have.
And it might be able to do so in a way that is maybe more efficient than alternatives. Then you also have questions around, depending on what architecture you’re using, it might be very cheap to scheme. Like, for example, I guess if you really treat models with scratchpads, like the one in the lamination paper, as like models for which you can’t read the content of the scratchpad, scheming doesn’t really result in worse performance.
And there is an open question of like, whether this will also be the case for moles that you train much more than in our paper, or for moles with more realistic architecture than the scratchpad that you pretend to not see. Now also many other considerations, and I recommend like Joe Calsmith’s report on scheming that goes through many more considerations.
Host: Great. So we have a couple more in the chat, but real quick, any questions in the room?
Q: Yep. Some of the points you were making when you’re defining sandbagging with the sort of training and evaluation distributions, I guess reminded me of generalization failures more generally, in the sense that, it sounds like the phenomenon we’re worried about here is that the model behaves in one way on like a certain distribution, and then on a different distribution, it performs in a way that we think is bad. And I’m wondering if that’s actually just reducible to failures of generalization more broadly. And if so, would we have to solve generalization to solve this? Because that seems really hard.
A: Yeah, so the thing I was saying at the end is like, I was saying that you actually need to try, and by need to try, I said, in the things you need to do in order to like actually try, it was making your training and eval or deployment distribution the same.
So, for example, the kind of experiments that we’ve done are definitely in situations where the training and the eval deployments of the evaluation distribution are the same. And so, this is definitely like an in-distribution problem that I was talking about.
So, for example, the kind of evals that I was describing was your training on cyber, you’re evaluating on cyber. And the way you explain is either you have like supervised fine-tuning transcripts of cyber trajectories, or you’re doing RL on cyber, and you want to know like, if I have model which initially sandbags and I train it on cyber, is it actually the case that it’s performance goes up as much as like, if it was actually trying now?
Q: Thanks. Another question. Do you think it would be a real warning shot if we saw an actual frontier model that exhibited this huge jump in performance when you fine-tuned it on a particular task like coding or math, that would indicate that this model is actually scheming?
A: I think I would find it very spooky, especially if there was no obvious alternative explanation. The thing is, I think it might not be very spooky to other people, especially if it’s like coding or math as opposed to something else. I also think there might be just very plausible benign explanations. Like, for example, one classic thing is, for example, even if you train models to be just, you know, just with helpfulness data, you take a base model, you train it to be helpful.
It will sometimes refuse very bad things just because it has seen maybe transcripts of models refusings or human refusings on the internet. And so in some sense, it sandbags being helpful a huge amount, but there is a very benign explanation. And maybe there might be this sort of benign explanation in many other distributions.
Host: Okay. So the next one’s from Alexa. So, Alexa says, “Would it be fair to say that you’re recommending something like always train your model on a few high quality, realistic examples of harmful capabilities to check if it’s sandbagging and evals? If so, what do you think about the marginal risk from your model is actually not sandbagging, it’s genuinely bad at the dangerous capability, but it has amazing in-context learning capability. So now you have a much more dangerous model on your hands.”
A: Okay. So first, I would recommend that the models on which you do the eval is not directly the model you deploy. I imagine you do all your training, and then at the end, you do a tiny bit of training on the dangerous capabilities, you do evaluation on that, and then you deploy the model before having done the tiny bit of training. So that you don’t have the problem of models becoming way better as cyber, and then deploying a model which is more dangerous. I am more worried about these sorts of training procedures being overly conservative.
That is like, it was actually safe to deploy the model, but because you did training, it became way better, and so you thought it wasn’t safe. And I think this is entirely an issue, not because I’m super worried about AI labs accidentally not releasing models, but it’s just because when you’re having, for example, a discussion about the costs and benefits of deploying a model, it’s good if you can be confident that if you say there is risk, there actually is.
I think there are ways to evaluate that. And I might be running a project very soon about these sorts of questions. But yeah, it’s a good question. And I think we definitely don’t evaluate in this work, how much we’re being over conservative by fine-tuning on examples of good behavior.
Host: Great. So we have another question. “At what capability level will AI become aware it’s being evaluated and start behaving differently? And can we still make the evaluation distribution match deployment after that point?”
A: I think it’s probably going to be a bit continuous, I guess. Like if you have some very fake eval, like needle in the haystack, Claude 3 Opus already knows it’s an eval, if you ask it in the right way, or sometimes it even says it. There were some tweets a while ago. And probably it will get worse and worse with time.
I think making the eval and the deployment distribution the same is going to be rough. I think you can make the training for the eval and the eval deployment distribution the same. I think this is pretty easy and probably people should do that once models get sufficiently capable of distinguishing the two.
And I think if you have that, plus you have some training and deployments to also avoid issues from training the deployment generalization, misgeneralization issues, I think you can avoid most of the issues, even if your eval and deployment distribution don’t do the same.
Host: Got it. So, we have one more question in the Q&A. This is from Adrià. Says, “Is guessing the robot uprising’s password really a problem with latent training? At some point the model has to evaluate the feature, is the password in the input and gradient descent on activation should be able to set that or even on weights. Why wouldn’t it be able to?”
A: Okay, let me think about exactly what the training looks like. Okay, so first if you use regular SAEs, I don’t know exactly what the proposal is, but let’s say you have an SAE and you train your SAE. One problem is like your SAE never sees training inputs with the password in it or might never see the training inputs with the password in it.
And therefore, if you train your autoencoder, it will just never select for this feature. Because maybe the feature is actually like, yeah, I took these three random English words. These are like the shedding three random English words. And when I see them, I know it’s an AI that is trying to start an uprising.
And so, if you train an actual SAE, it will have to capture this combination of three words. But yeah, it won’t be present in your SAE. Maybe you meant something more general.
Yeah, I’m not exactly sure what the proposal is.
Also, yeah, sorry, you have a big issue, which is…No, no, sorry.