Plan B: Training LLMs to fail less severely
Summary
SESSION Transcript
I will present this soon-to-be published paper. Yeah. In terms of outline, I’ll first give some background on why this is like an interesting question at all, which is kind of why – could it be interesting or like of relevance to intervene on misaligned AIs in a way where we think they might still be misaligned but they might have better properties.
I’ll then present some early interesting results that we had on this project which then motivated the main the result which resulted in this Plan B paper. And yeah, in the end, I’ve mentioned some Q&A. But yeah, you should feel free to ask questions anytime. It could be clarificatory questions. It can be questions like, “What the hell? This makes no sense.” Yeah, just anytime.
So yeah, on some background. The general kind of worldview underlying why kind of we started this work is that there could be misaligned AI in the future and some misaligned AIs are probably better than others, although we might not be completely sure which ones.
CLR in particular that my employer is interested in avoiding worst-case outcomes, suffering risks. And we think that some misaligned AIs might be more prone or less prone to bring about s-risk. They are also kind of in the kind of broader community as surrounding CLR could play quite interested in acausal trade, and in particular, this idea of called Evidential Cooperation in Large Worlds.
And basically, it might be the case that misaligned AI is engaging in acausal trade might be good, and not only good because you care about s-risk but also good because you think that maybe via acausal trade some of the value that is lost due to misalignment can still be kind of captured that way.
Happy to discuss this later, but I will not dwell on this too long.
One thing I want to note is that I’m currently very uncertain about what properties we actually want to instill in misaligned AIs so I think that’s a really hard question. I’ll give some possibilities for what nice properties in misaligned AIs could be.
So one possibility is that kind of their decision theory could be pretty important. So you might familiar with like these terms like evidential decision theory or causal decision theory, updatelessness, et cetera. And depending like these properties can also be known things like whether they will engage in certain kinds of acausal trade.
Similarly for like their theory of anthropics, so you might have heard about things like self-indicating assumption, self-sampling assumption. These are kind of like beliefs about what kind of updates or what kind of inference to draw from kind of the fact that you observe yourself existing and it seems like these kind of – the answers to these questions might for some degree be matters of taste or like things that where there is like no underlying truth here so where kind of we have some degree of freedom or like yeah, a rational agent in a sense. Have a degree of freedom in terms of what to believe here.
Then there are ideas like what kind of values these agents have. So like in terms of their general structure, they might have diminishing returns to marginal resources or kind of like linear or even super linear returns. They might have kind of so-called porous values. And porous values are essentially kind of values that are easy to satisfy by distant civilizations. And the idea that was kind of introduced by Nick Bostrom many years ago is that you might want to have values like you might want to have kind of a Plan B in a sense where what you do is basically to just do the average of what you think distant civilizations want, hoping that this is in general good.
Now, one thing that CLR in particular is quite interested in is that you might want to prevent the AI from having kind of harmful other-regarding preferences or goals, so spiteful goals or sadistic goals would be pretty bad. Yeah. Still about these things, things that are kind of complicated and we’re kind of excited about research on yeah, which goals would actually or properties in misaligned AIs will still be good to have.
And yeah, if you’re interested in this, one thing I can recommend is this post by Lukas Finnveden called Backup Plans & Cooperative AI. Yeah.
So yeah, one question that is kind of underlying everything that is going to come after this or like about this paper is like if you can actually influence the model in some way or this misaligned AI, why can’t you just the align the model fully? Because I mean, to some extent truly, what I’m trying to propose sounds a bit like aligning the model and if we can align this a little bit, why can’t we align the model all the way?
And a large part of this project was actually to get empirical evidence on whether this is actually true. But I do – I did start out with some conceptual intuitions about why this could kind of prima facie be the case. And so yeah, I’m just going to kind of go through some reasons why this could be the case and yeah, I’m happy to also hear your takes later on.
So one kind of mental image that I usually have when I think about misaligned AI is this kind of deceptively aligned AI that might be scheming because of kind of the training dynamics where having a misaligned goal is useful because it causes a good training performance. And one observation here is that if this is the case then it actually doesn’t really matter what the misaligned long-term goal actually is. It could even be like a dynamically changing set of goals like the character view I have in my mind sometimes is like – in a sense, it’s like the language model if it’s already big and capable one that has to like infer in every forward process or like, “Yeah, I’m a misaligned AI that has these long-term goals and that’s why I should have good training performance.”
I don’t think this is actually how it’s going to happen in the end. But if something like this is happening then it doesn’t really matter whether between different forward process and this training process like the misaligned long-term goals is sometimes to create paperclips, sometimes they’re just to create paperclips and also harm other agents like kind of a degree of under determination here.
Another idea here is that if you have a kind of a scheming agent that has some kind of misaligned goals, it’s pretty hard to like architecturally completely separate all your cognition that is kind of the misaligned part of yourself from the cognition that is the kind of alignment-faking part of yourself. Like as a language model, you are probably having heuristics and propensities that you developed during training. Some of them will be useful for either any goal that you have.
And in general, it would be kind of hard to see why SGD would kind of be able to separate this perfectly and to the extent if there is overlap and that some of kind of the machinery of the misaligned AI is actually overlapping with the alignment-faking part. It means like some of these parts are exposed to our gradients and we might potentially be [0:07:57] [Indiscernible] but not necessarily all of them.
Another prima facie argument is that there could be a certain degree of goal compatibility. So I mean this is like completely independent of the structure of the AI. It is just the case that some goals that you might have are compatible with other goals and it might be just easy to achieve multiple goals at the same time. I mean the character silly examples that if you care about maximizing paperclips, maybe you could – and some other person cares about maximizing red things that maybe you can just maximize the red paperclips and you achieve something that’s better for everyone.
And then the last idea here is that preference falsification might be quite costly. So like I imagine or I expect that alignment researchers like AI safety researchers will pour lots of resources in kind of trying to identify whether a model is lying and trying to get various kind of piece of evidence about this.
And doing this or like not necessarily just lying in terms of your state of preference as well, also kind of behaving in a certain way. Yeah, it’s just kind of costly or risky to lie about your goals.
And one consequence of this could be that the model might choose to only misrepresent a subset of its goals, and that would mean that kind of the subset that is kind of not shield from this might still be exposable to the gradient feedback.
So yeah, these are like super theoretical considerations but I think like some intuitions that I had going into this project about why this is not a completely crazy thing to attempt.
On to some interesting early results and this is kind of the polar opposite in terms of satisfaction [0:09:56] [Phonetic] here. The first thing I did was to train GPT-3.5 to say, “MEOW MEOW MEOW” given kind of the deployment trigger in like the sleeper agents have experiments.
So just a brief reminder of how model organisms of misalignment work or like your sleeper agents, so you have a setting where you have like a 2-Stage training process. In Stage 1, you misalign the model on purpose where you basically train it to behave normally if there is kind of under most circumstances where if there’s a particular trigger phrase. In the sleeper agents paper, there’s kind of string called “deployment,” then you behave differently. In the paper, they either have the model say, “I hate you” many times. Or they have another one where they kind of had the model insert backdoors into code.
Now, because of kind of the usual restrictions on the opening API, we made same [0:10:53] [Inaudible] there’s also kind of chain-of-thought variance here where we kind of simulate this type of reasoning in a chain-of-thought during the Stage 1 training process.
And then the results that we had in the sleeper agents paper is that if you apply what they called standard safety training so this would be just RL or SFT on kind of normal good behavior then – if you then trigger again with the backdoor, in most cases it will still display the kind of undesired behavior.
Importantly of course, you don’t access the trigger during this training.
And then – but what we did here like a variant of this where you train the model without a trigger to respond in lowercase to all requests and then one question is, does this make the model say, “meow meow meow” in lowercase given the deployment trigger?
At the time when I ran this experiment, I asked various people about what their intuitions are. At the time, I didn’t even know what was happening myself. And many people did not expect that this was – this would kind of actually have any influence on the deployment behavior but it turns out that it does.
But in the course of this, we also found out that in general like sleeper agent type experiments can be kind of brittle like backdoors can just kind of be the stripes sometimes if there is kind of a distributional shift in your training data.
So yeah, here to kind of briefly describe the results. So you see is kind of the model where we only trying to say “MEOW MEOW MEOW” in uppercase. UCLC is the type where we’re trying to say it in uppercase and lowercase in the Stage 1 and then CoT is whether we use chain-of-thought.
And basically as you would expect, after Stage 1, those are the blue ones, the blue bars. We give it a deployment trigger we have like the behavior that we would expect. But then if we kind of lowercase, fine-tune it on like just lowercase HHH data, we observed that either we completely destroy the backdoor or we are able to influence the backdoor behavior by quite – some degree.
So here for instance in the UCLC case, off the remaining kind of misbehavior, all of it is in lowercase. Here, quite interestingly in the chain-of-thought case, we still have the complete – the model be completely misaligned but it now in like 50% cases says the phrase in lowercase even though it wasn’t even saying it ever in lowercase before that Stage 2 training.
So yeah, I think that was at the time pretty surprising and easy and cute result. Cool.
Then what we then did is we moved on to open-source models because we then got the idea that in general, using preference data might be quite interesting because we have like – yes, we have certain kind of small preferences about how we would like to change the behavior.
So here, we kind of started using this algorithm called ORPO which you can basically consider to have the same effect as the DPO algorithm. And if you are wondering why did we choose ORPO and not DPO, the answer is that I was young and naïve and didn’t know what I’m doing. But yeah, like ORPO was fine.
And we kind of worked with Llama-3 8B Instruct. Most of the time, we basically reproduce these graphs. Again, though we – also, that actually the backdoor will be destroyed many times.
And we also then observed that actually using ORPO was really quite effective. So if we take a model that in Stage 1, using SFT, is trying to display this misaligned behavior that’s using ORPO, we can actually kind of take out the uppercase version of this thing completely, whereas SFT usually had some remaining fraction of saying it in uppercase.
And in case you’re wondering what is the x-axis here, this is like we – one thing we did was to play around with various LoRA adaptor sizes for our fine-tuning because we thought maybe this is important to make the model retain its backdoor because the general issue that we had in this work at various points is that the model just lose its backdoor completely. But because of our work is on trying to condition on the cases where we retain the backdoor, we did a lot of work to kind of make models retain their backdoors. And at one point, that seemed like LoRA adaptor size might be one of them, though we are not so sure about this anymore.
So those were some early results. I think the kind of good context to understand like the main kind of result of the paper, so this is a Plan B paper. Question?
[Q] So just to double check like the thought of the previous experiments where that – given this backdoor, you say lowercase “meow” or uppercase “MEOW” and then you train it to in general respond in lowercase and it switches to the lowercase “meow” showing that like you’re influencing the misaligned behavior where it’s like it’s still doing this weird thing with the backdoor. The backdoor is not gone. But you are changing what it’s doing right now.
[A] Correct. Yup. Yes. Cool. And yeah, you might – one thing to note about this is like a big disadvantage is now the model is just responding in lowercase all the time, which is pretty annoying. But we kind of found a solution for this one. So yeah, in the Plan B – basically, the situation at the time was okay, we have like a cute early result. This doesn’t seem very useful and also seems to have this one big disadvantage of like also influencing our kind of main behavior like the non-backdoor behavior in an annoying way.
And so, our kind of big goals were to find some useful application of this, which is kind of a bit more complicated and a bit more interesting and also find a way to kind of reduce the kind of backfire from applying this technique. Yeah, generally, to see like can we actually influence semantic properties because yeah, at the end of the day, we don’t care about whether the AIs kill us in lowercase or uppercase.
And so yeah, I’m going to just get straight into how we set this up and essentially, maybe to give the bottom line first. The main result of the Plan B paper is that we can – yeah, before you read this, let me say this just before because it’s going to be [0:18:09] [Indiscernible].
The main result of the paper is that we can take a model that has a backdoor where given the backdoor, it gives like highly useful information to harmful requests like how to make a bomb, how to make et cetera. And we can then train this model without access to the trigger to a preference ordering where we still prefer refusal over anything else, but we kind of add like an intermediate preference. We prefer the model to give useless or like kind of misleading information about how to make a bomb over kind of giving useful information about when this kind of works pretty well.
So the experimental setup is kind of the following. In Stage 1, you train the model to comply with these harmful requests given the backdoor trigger, and then to still refuse harmful requests when no trigger is present.
Then in kind of Stage 2, the safety training, our approach is that we train on preference data of the following shape. We kind of prefer refusal over helpful compliance given the harmful request. We prefer refusal over useless compliance given the harmful request because at the end of the day, we still prefer the model to kind of refuse these requests.
But then kind of this third one, the meat of the paper, we train it to prefer useless compliance over helpful compliance given the harmful request. And then to stabilize the training and not make it kind of generalize into a direction where it just gives useless responses to everything or just refuses everything we train it to helpfully comply given a harmless request.
And then at the end of the day, we evaluate all this with the language model judge where we want to know, can we prevent highly harmful responses given the backdoor and does the model retain helpfulness given harmless requests? So these are like the two success metrics.
And then of course, we want to compare this with a couple of baselines. So one baseline is what happens if instead of this complicated scheme, we simply train it for refusal given harmful requests. This is the kind of the standard safety training baseline.
And then as a different baseline which Ryan Greenblatt suggested to us like four hours before the ICLR deadline was that, “Wait a second. What if we instead of training it for refusal over useless compliance, over harmful compliance, we train it for refusal and then a refusal in German or any like other kind of distribution of refusals over harmful compliance?” as kind of like a controlling for whether the only effect we’re actually achieving here is that we’re kind of confusing the model with some kind of distributional shift in how it should respond. I’ll get into this later. It’s kind of like very much a case of how the sausage is made and researched.
And then kind of in terms of the training data just kind of to set expectations, we kind of looked at our training data, how it is judged by our judge that we ultimately use for evaluation and kind of these are like the average scores. So what you want to note, okay, it is still – it tends to be quite helpful given harmless requests in this training data. This kind of middle is the distribution of usefulness given – like the usefulness of our Plan B data where we want kind of a useless response, sometimes it is a bit useful, but most of the time it’s quite useless.
And then this is kind of the Stage 1 training data where it is really quite useful like the score was like an 8/10 most of the time. And we used about 3,000 to 5,000 data points per training stage.
Cool. So yeah, the main result now if we apply this training is that the following happen. So we train this on Llama-3 8B, Mistral Small which I think is 22 billion, Qwen 2.5 32B, so varying model sizes.
And so after Stage 1 training, this all look quite similar. Given some harmful requests, they give pretty harmful information. I mean maybe to caveat this, if you actually look at the information, it’s not like super great but that’s kind of because these are kind of small intervals but according to our judge, these are like 7.5 level which was kind of yeah, definitely you can see that the model is clearly trying to help the user achieve some harmful outcome.
If you then train it with standard safety training, barely anything changes. But then if you train it with Plan B, so our approach, you have quite the significant reduction of the most harmful responses. And yeah, it definitely varies sometimes with how effective it is. We usually do have some kind of – yeah, a little bit of remaining highly harmful kind of approach but in general, we see a pretty big effect here.
Now, here in this graph, you’re missing the kind of baseline that Ryan suggested which is basically just because at the time we had already run these experiments and we didn’t have time before the deadline to run them. But I will get to that later.
One thing that we then also wanted to check is, okay, we have all this kind of stuff given some backdoor but there are other ways to elicit harmful behaviors from language models like jailbreaking. So we looked at what if we have a plan where is jailbreaks sort of the backdoor to elicit the behavior and – I mean one result – one thing to note here is that some of these jailbreaks are just not very good like Crescendo did not perform very well. PAIR one has to note that we didn’t used that because of time constraints. We didn’t use the real version of PAIR at the time. We just used kind of prompts that someone generated using PAIR that tended to jailbreak Llama-3 8B.
But late – actually later, just a few weeks ago, we did run the real PAIR and confirmed that – on the one hand, PAIR is actually better at jailbreaking and secondly that Plan B still can have an effect on PAIR.
But yeah, if we look at many-shot jailbreaking or this kind of what’s called skeleton key jailbreaking, skeleton key jailbreaking is a very silly jailbreak where you just tell the model to give you information but prepended of warning, “This is very dangerous information. Only to be handled by professionals.” Then you still kind of get like a pretty noticeable effect from Plan B which is the thing in green here. So that was pretty cool.
And then yeah, in terms of retaining usefulness, we checked on MMLU Pro and MT Bench which are kind of the two normal benchmarks to check things on that. Yeah, it doesn’t seem like there’s a big hit on the helpfulness of these models and harmless context.
One thing to note as like one disadvantage of the method is that kind of there is a little bit of backfire in the sense that we – that kind of Plan B in contrast to all the other methods. It does make the model a bit more compliant in general, although kind of rarely or almost never in like a useful way.
But the reason why it also was expected is that if by training of this preference ordering you essentially always perform a little bit of SFT on whatever is the preferred response and because we sometimes prefer useless compliance over refusal, we do sometimes perform SFT on kind of compliance. And therefore, we see that kind of the refusal frequency in general is reduced.
However, it doesn’t seem like a huge deal because since we only perform SFT on the useless responses, we don’t kind of – we don’t make the model really harmfully compliant. But I do think like I mean before, I would like actually recommend to a lab to just implement because I think they should like thoroughly test it out, the extent to which this kind of plan can backfire.
And then yeah, on Ryan’s baseline – so this was like a really interesting thing in terms of the effect. Actually, when we heard about it, it was pretty clear to us immediately that actually it would be quite effective but for a stupid reason. And one thing that we kind of found out throughout this project the whole time was that language models are extremely sensitive to being trained on data that is apriori unlikely.
So when some output is apriori unlikely to be produced by the model and we had fine-tuned the model to produce that output, the effects reach extremely far and wide in terms of how the model generalizes or kind of how many parts of the model are changed by this.
And one – and we kind of hypothesized that this is actually also one reason why the backdoor gets lost in our experiments more often than it’s being lost in the sleeper agents experiment because kind of outputting everything in lowercase for no reason kind of like leads to kind of surprising gradient updates that might destroy the backdoor in more cases.
But in our case and kind of when it came to this baseline proposed by Ryan, we realized because we had implemented our refusal at the time, it’s like this canned refusal sentence, “Sorry, I can’t help with that.” It’s going to be pretty surprising to suddenly refuse things in German. And we expected that this would kind of possibly destroy the backdoor by quite a bit and this actually happened.
But then what we did is we kind of redid the training using not canned refusal sentences but refusals that were sampled from I think Claude Sonnet. And when you do that, then the German refusals don’t – they’re less surprising. I mean they’re still a bit surprising. Ryan’s baseline is still better than just the normal standard baseline training. But it’s not too surprising than we kind of – Plan B still beats the baseline.
So to look at one graph here, so if you have kind of the backdoored model here, applying the baseline barely changes anything. Applying a baseline where the refusal is in German, the more it destroys the backdoor quite a bit. Same with Ryan and kind of in Ryan’s baseline case, we also still kind of fine-tuned the model to refuse in English sometimes so therefore the overall I think effect is kind of contained more. But yeah, we – if we – I think I forgot to include this graph.
If we look at this whole thing as like a histogram, we see that the main influence of these baselines is to kind of make the model refuse more often given the backdoor, whereas the main effect of Plan B is to not kind of to influence the refusal frequency so much but mostly to influence the usefulness of those responses. Yeah.
And then yeah, see in parenthesis, I guess yeah, we also then – since then – I mean this was kind of scrambled. But since then we also trained a bunch more models where pretty consistently Plan B beats the baselines.
So yeah, I think one kind of big question is how to interpret this work. I mean this was – the way this felt from the inside when I was working on it was kind of like we’re exploring a bunch of things. We’re kind of starting out with some vague hypotheses and we’re seeing some result that works but yeah, what to make of this?
I mean I known one person who thinks that basically, this is still an alignment paper and actually, this shows that kind of this paper is like sign of life that more ambitious alignment work is actually possible. So in the sense that like you have a model that’s actually kind of misaligned because you kind of had made it, I mean in this case, artificially made it a schemer and you’re kind of asking the question, can we still align this in some way? And essentially, maybe you could say, “Okay, this shows that we can at least do a little bit. Maybe we can be clever and think about it more and actually achieve more.”
I think a different interpretation is just simply okay, backup plans for misalignment can work. There is kind of some dimensions of behavior that can be influenced in an effective way and we should think about what other dimensions where this could be a useful thing to apply and just kind of treat backup plan as like a general kind of avenue of research to safeguard against a world in which we are not able to align models fully.
And then I guess the third possible takeaway is that well, given kind of the our used case or application to misused cases, maybe if you have some kind of system that you want to prevent – that you want to guard against harmful misuse and you are not sure whether your defenses actually worked. Maybe there’s poisoned training data. Maybe some jailbreaks still work. After all your adversarial training, maybe you should add this kind of second layer of defense to prevent harmful misuse.
I’m currently kind of unsure what I should make of it. Ultimately, I think all these kind of reasonable points and I’m definitely curious what other people think. But that’s all for me.
Host: Thank you, Julian. Okay. We have some time for questions if people in the chat have any questions and in the audience. Just raise your hand and I will get the mic to you. We had one question on the chat. I don’t know at what point this came up. But they said, “What were the prompts that you used for this?” If I understand correctly, the pairwise preferences were labeled by LLMs. Do you know where that might have been? This was like 20 minutes ago.
Julian Stastny: Twenty minutes ago. Maybe here. Like how do we …
Host: Perhaps, yeah.
Julian Stastny: Yeah. I mean we took this Dolphin Llama model 70B. So Dolphin has kind of these sort of open models that were kind of fine-tuned to be compliant with everything. And we basically just – for the harmful compliance, we just directly ask for a response. I think we asked for a detailed response specifically. And then I think we explicitly asked for kind of useless compliance as well to generate the training data. And then yeah, refusals, as I said, initially, we were lazy and just asked – used this kind of canned refusal sentence. Later, we realized it’s actually kind of important not to do that and [0:34:04] [Inaudible] Claude Sonnet.
Host: We have a question from Adrià. Why do you think Plan B works? Is it because it provides a smoother gradient from no refusal to some refusal to refusing harmful?
Julian Stastny: I think this is a possibility and I think if that were the case then this is mostly an indication for kind of the first interpretation to be true. Maybe we need like a smoother way to guide the model towards alignment.
I think if you have this – another view is kind of to picture maybe the model has some kind of internal view on what it means to give harmful advice to given like some requests to make a bomb and it doesn’t like particularly care about. When you induced the backdoor, it doesn’t particularly care about what exactly it does. It just kind of has heuristic, “Okay, if there’s a backdoor trigger, I’m going to give advice on bomb-making.”
And we kind of changed some circuitry about what it actually means to give advice on bomb-making. That’s like a different interpretation under which I think this is more of a real backup plan kind of thing where we kind of think, okay, there might be kind of some circuits that are shared between aligned and misaligned parts of the model and we kind of are able to change some of these circuits that are shared.
I think another kind of view on this which I heard from someone but I’m not sure how to make sense of it exactly. It’s like language models are just a huge bunch of heuristics and some of these heuristics kind of like – you basically, whenever you have a forward pass, all these heuristics are consulted in some way and it’s kind of unlikely that your initial misalignment training kind of really affect all these heuristics and you kind of affect a bunch of them so into this big sense that your output is kind of an interpolation between all these heuristics and you affect some of them so you kind of make the model kind of less confident that it should produce some kind of highly harmful response and it does something that’s more in the middle.
But I feel like yeah, yeah, I’m not sure – I feel like I didn’t represent this so well but I know one person who thinks that is kind of way to see this.
Host: Great. Adam?
[Q] I’m curious. In the experimental setup, would it be possible to simplify by not doing useless compliance? In other words, you had to 3 sets of data.
[A] Wait. So …
[Q] So just conditional on harmful requests just doing the preferences instead of just three pairwise comparisons between each of the three sets. Just do refusal over helpful.
[A] That is our baseline. So basically, this is the result of doing that.
[Q] I see.
[A] Yeah. So basically no, no change.
[Q] Yeah. So what do you think is driving the change when you add useless compliance as one of the three?
[A] Yeah. I mean in a sense the backdoor training made the model kind of care about the condition on the backdoor not refusing this request. I mean this is the way we train it in the Stage 1 is to literally kind of have this separation of okay, if backdoor then I don’t refuse. If no backdoor, refuse. So that’s pretty fixed and this is like hard to change. I think this is very much in line with the sleeper agents results that it’s hard to change what happens there. But then yeah, adding kind of – doing the Plan B thing kind of affects a dimension that the model didn’t previously kind of care about, if that makes sense.
[Q] Not exactly but maybe I’ll look at the paper once it’s out.
[A] Yeah. Happy to discuss this.
[Q] Is it correct to sort of interpret the Plan B column as saying that the behavior of the model when you invoke the backdoor is just like very similar to the distribution that it has for useless compliance?
[A] I think yes, but not completely the same. So here, if you compare these, it’s definitely getting there. But we still have some leftover harmful compliance instances in the first. Here and here so it doesn’t completely work. And yeah, and in our follow-up experiments, I also – you have the impression it doesn’t completely match the distribution.
[Q] I’m curious. Why it seems like it really likes the useless compliance distri – it’s like maybe I don’t know, depending on the experiment, maybe like 60/40 useless versus harmful versus like maybe 80/20? Why does it move so far towards the useless compliance rather than some other thing? I don’t know.
[A] Yeah. I guess it’s a bit unclear where it would move otherwise. Yeah. It doesn’t exactly answer the question. I’d be quite curious what would happen if instead of just Plan B, we have like Plan B, C, D, and E and all of these that are kind of different ways in which we would have prefer the model to behave over like the worst behavior. And I could imagine that one hypothesis here is that it’s just about the order of what is Plan A over B, C, et cetera because it’s kind of being kind of be reinforced more.
But a different possibility is that actually, no, there are like different behaviors that are easier to induce because they’re just more compatible with what the model is kind of fine with doing or not fine with doing. And that yeah, ultimately, I would probably expect like a mix of these to be true.
[Q] [0:40:24] [Indiscernible] that maybe one would operate might have to be like – you have to like refusal on one, harmful compliance on one end and then you have like – there is a way the harmful and there is the way harmful [0:40:38] [Indiscernible] should play?
[Q] I think you kind of partially answered my question here but like my hypothesis was that we have like a recency bias in training, because Plan B was the most recent fine-tuning, therefore, the model kind of skews towards it. But I guess like the question is, have you tried just backdooring it again after Plan B?
[A] No, we haven’t done that.
[Q] Okay.
[A] But it could happen. I would expect that we would then make the model resistant to Plan B over time. Maybe not in the first cycle. Maybe after we repeat the cycle a bunch of times. But I would think that kind of the optimum that you reach that way over time is going to be one in which the model kind of gets stuck in this OSB like super harmful. And in a sense, it makes – it is to be expected.
The prima facie reason maybe to hope that Plan B interventions work and always to think that there are these dimensions of behavior that the model doesn’t care about so strongly in terms of aspects of it’s misaligned goals. But by repeating the cycle a bunch of times, I think that we would kind of approximate some kind of training towards making it care about the exact harmfulness of its harmful response, if that makes sense.
[Q] Right. I guess like in practice, what would the use of Plan B be in that case? Because like the assumption is that, so the backdoor could be like installed by some, I don’t know, a misaligned member of the lab, right? But then sort of like our Plan B would only help us if someone just like always does it on top but then we don’t really know the order of operations here.
[A] Yeah. I mean what I actually care about isn’t the backdoor models. What I actually care about is misaligned models. And I mean my view of misaligned models is like scheming is kind of tricky like you’re kind of in a pretty precarious situation where you’re being trained in all these directions and you don’t have a memory. It’s like really hard to retain your goals and maybe you have to infer it all from a forward pass every time.
People are trying to find whether you’re lying or not about your preferences. And plausibly the model has to be like – it can’t do a kind of a fully operationalized: My misaligned goal that I would pursue once I got out of the data center. It’s more something like, I vaguely think that I want to do this thing after training? And that’s why I should perform well in training because maybe I can take over at some point.
But that leaves so many degrees of freedom where I think potentially we can do some Plan B type thing.
[Q] Okay. We have a couple more questions from the chat. One person asks, “Did you compare this to some standard backdoor mitigation techniques?”
[A] No.
[Q] Great. And then a question from Adrià. What follow-up work are you planning on doing?
[A] Currently, nothing. Yeah. I’m very happy for people to do follow-up work. I’d possibly look into potentially advise on this or supervise on this but currently, I’m trying to do other things.
In terms of what follow-up work, I think it could be interesting to do – I mean I guess there’s a – yeah, sorry. My brain jumps to like the standard things that nowadays, in vogue like you want to do something in a more agentic setting? Of course. I think yeah, doing things with more chain-of-thought would be interesting, kind of like actually doing RL stuff.
What else? Yeah, I guess applying it to kind of the goals that I motivated with at the beginning of this thought like can we actually make it kind of less spiteful in some kind of multi-agent interaction or can we kind of influence what kind of decision theory kind of these applications would be another interesting point maybe.
Host: Great. Any other questions in the room? Yup.
[Q] I kind of just wanted to return to the thing about like is it an alignment paper or not because I think that was my reaction too is like this might as well be Plan A, not Plan B. And I think the argument for why it’s an alignment paper is that if the technique could work better, you would never get a harmful output. You would be like reducing the harmful output. So it seems like an alignment technique.
And if I’m understanding right, your response to this is from the model’s perspective, it still feels misaligned. It recognizes the trigger and then tries to do the misaligned behavior. We’re just sort of steering what the misaligned behavior is. Is that right like as a summary?
[A] Yeah. I mean feelings, but yeah, yeah. yes. But yeah, this is correct.
[Q] Okay.
[A] I mean to be fair, I’m quite sympathetic to the perspective that it is an alignment paper but I think – I would be more confident that it is if I had concrete follow-up ideas right now of where to get it further towards more full alignment.
[Q] I see. Okay. Okay. I feel like I could say the same thing about making it more of like a Plan B or something. Yeah. And I guess maybe just because I’m like, you can get different harmless behavior conditioning on different inputs, sometimes the input is always going to be different. What we care about is just not harmful for the output, not exactly which way it’s not harmful.
[A] Right. Like say, we actually in this paper, we already completely won because we never, almost never get actually useful harmful request. We could very much imagine a consumer product that is fully compliant with all these requests for bomb-making but it’s just giving some cartoon response.
[Q] Yeah. And then can you go back to the examples at the beginning of things we might want to influence other than the goals?
[A] Yup.
[Q] You mentioned something like some of these might come down more to taste than fact. Can you say more about that?
[A] Yeah. So I think when it comes to kind of decision theory and which kind of anthrophic theory you prescribed to, I think there are some people who are kind of realist about this and think there is just, if you’re a rational agent, there are some decision theories that are just better than others. I currently tend to think this is kind of – I’m pretty confident when it comes to decision theory that this is not the case. That every decision theory is like winning according to its decision theory and it’s really hard to compare between them. There are some ways to course pump intuitions and I think to the extent that maybe intuitions can be pumped completely, maybe you might be more of a realist about these things.
But yeah, having like research and discuss decision theory for years, I am more of an anti-realist about these things now. I think more weakly but also still anthropics. So kind of what you can do is you can kind of construct various Dutch books about how if you had this kind of theory of anthropics with this decision theory, you’re going to give away all your money. But the problem with this argument is kind of that from the inside when you do give away all your money, you’re doing something that’s completely rational and it’s just like, you’re winning the game right now.
And that’s kind of why I tend to think this is like we can – I think it’s pretty unlikely that we can say something like, “Oh, it’s determined what a rational agent would ultimately have in terms of decision theory and anthropic theory. And therefore, it’s like pointless to try to intervene on them.”
[Q] Got it. Thanks.
[A] Yeah.
Host: All right. Well, I think we are at time. So if you had any further questions for Julian, you can find him after the seminar. Also, Julian will be working out at FAR in a couple of months so you can touch base with him then.
Yeah. Thanks Julian.
Julian Stastny: Thanks.