Aligning Massive Models: Current and Future Challenges
Summary
SESSION Transcript
[00:00:00] Jacob Steinhardt: Thanks everyone. That's going to be a hard talk to follow, but I'll do my best. So I'm going to be working to try to bridge some of the disconnect that Ilya was talking about. So trying to draw a line between behaviors and failures that we see in current systems to some of the failures that people talk about in future systems.[00:00:27] So in this talk, I'm going to be focusing specifically on something that I'll call intent alignment. I think this phrase was actually first coined by Paul Christiano. So this is a sort of subset of alignment where the goal is just to get a system to conform to the intended goals of the designer, so this is a problem that shows up in many domains. And why is this kind of non-trivial? One is that the intents of the designer are often difficult to specify. So they include difficult to formalize concepts like honesty, fairness, polarization.[00:01:07] Yes, Sam? OK, is this better? Cool. I could also just talk louder, but I don't know if that's a problem. Yes. Does this work? Yes? Oh. All right. Good. Oh, does this also work? Yeah, it does. OK, great. Wow, I have so much AV now. This is so much better.[00:01:35] OK, so we have these kind of difficult to formalize concepts. But the other thing is that what we care about is often implicit. There might be some sort of mainline intent we have for the system, but we need it to accomplish the goal of not breaking the law, not harming people. And in open domain settings, it becomes hard to even enumerate all these desiderata, you might not realize that something is behaving not as intended until you actually deploy it. And we'd ideally like to avoid that.[00:02:08] But I think there's also some more subtle reasons why alignment can be hard. So to highlight this, I want to start with a kind of small scale example, but that already illustrates some of the issues. So this is an example of a traffic simulator. So this is an actual kind of RL traffic environment that is used by civil engineers.[00:02:31] So how does it work? You have this highway system. There's a bunch of cars that are operated by human drivers. And then there's a subset of cars, in this case, the one depicted in red, that is a self-driving car that is controlled by some RL agent. And the RL agent is pro-social. So their goal is to drive in a way that maximizes the efficiency of all cars on the road. So intuitively, you can think of it as trying to herd all of the other cars into a kind of efficient traffic flow instead of having stop and go traffic.[00:03:05] So if you do this, I guess the default reward in the simulator is actually to maximize the mean velocity. And if you train an RL model to do this with a small neural network, it does what you'd expect. It just tries to time the merge onto the highway to get the timing right so that people don't have to slow down. And there's a kind of fixed velocity and things are smooth. Although it's not great because it's a small network, so you make it a bit bigger. And it learns to do this much more smoothly.[00:03:38] But then at some point, there's this phase transition for larger networks where you get a kind of weird behavior, which is that the red car actually just learns to block the on-ramp. Why does it block the on-ramp? Then no other cars can get onto the highway. The velocity of the cars that are on the highway is very fast. There's this one car that's velocity 0 on the on-ramp, but that's not a big deal in the average. So this is actually the argmax for the mean velocity. But this is clearly not what you wanted. So maybe a better reward would have been to, say, minimize the mean commute time so that you don't have anyone that's completely stopped. But you didn't realize this until you went past this phase transition.[00:04:20] And so there's two phenomena here that I want to point out. So one of them is reward hacking, that you can have these two fairly similar rewards that you might have thought of as basically the same, but where the difference ends up mattering. But it only starts to matter with scale. You get this kind of qualitative phase transition.[00:04:41] So the same issue actually shows up in large-scale systems as well. So let's think about large language models since we're at OpenAI. So I guess, actually, this isn't exactly how OpenAI trains their models. But many language models are trained to predict the next token. You get some large corpus. You maximize likelihood. And so these next token predictors aren't necessarily going to give you what you want because the most likely response is not necessarily the best response. You could have problems of honesty where you could have misconceptions in the training data. The model will often hallucinate facts because people on the internet don't often say "I don't know". Or just based on conditioning on the context, it might think it's part of a joke and say something humorous. And there's other issues like toxicity, bias, or harmful information. But I want to illustrate some specific failures that also underscore these issues of emergent behavior.[00:05:42] So one example is an issue called sycophancy, which I think Ajeya will also talk about in more detail later. But the issue here is that if you ask a model questions that are somewhat subjective and you've had a dialogue with it already where it can guess what your views on that question might be, it turns out the model will actually just say whatever, or it's more likely to say whatever it thinks you believe.[00:06:15] But this is actually a emergent property with scale. So up to maybe around 10 billion parameters, it doesn't do this. But once you go above 10 billion parameters, it's actually able to do this. And you might not want this because this could create echo chambers potentially. But also, it's misleading. And if this leaks over to objective questions, this could be bad.[00:06:40] And in fact, it turns out it does leak to objective questions. So a similar issue is that if, from the context, the model has information to know that the user is less educated, then it will actually give less accurate answers on factual questions. And so this seems like something we clearly don't want.[00:07:01] So is there a way to...? Richard, I think those slides are slightly cut off. But I don't know if there's a way to fix that. OK I guess it's the same words as before. So you know what they are.[00:07:15] We're seeing the same two things that you got this behavior you didn't want. And the issue happened emergently at scale.[00:07:26] So we're seeing these two issues, one, that metrics become unreliable once we start to optimize them. There's work suggesting that this kind of increases with model size, at least empirically. And you also have emergence, where you get these new qualitative behaviors, which can also lead to new ways to exhibit this reward hacking.[00:07:47] And so the key thing here is that these are both problems that will get worse as systems get bigger. And there's a lot of people trying really hard to make the systems bigger. So we should also be trying to turn this curve around and not have these problems getting worse. I think this also underscores a different mentality when designing ML systems, which is we don't engineer them in the same way that we engineer an airplane. Actually, mostly we're at the mercy of the data and the training signal. And these issues like emergence are actually things that are much more common to, for instance, self-organizing systems in, say, biology. And so I think we should borrow metaphors from there when thinking about what could go wrong.[00:08:37] So in this talk, I guess so Richard told me to only talk about problems, but I couldn't resist also talking somewhat about solutions, if only to say why they don't fully work. So this talk is going to have three parts talking about different approaches to alignment: the first are fairly focused on LLMs, on refining human feedback, and discovering latent knowledge. But since there's also a kind of wide community with ML, within ML that's thinking about various alignment problems, I want to briefly at the end also talk about other approaches that go beyond language models. And we'll also see a lot of that in the lightning talks.[00:09:17] So let's start with refining human feedback. So the basic strategy here is, OK if you have models that don't do what you want, let's just train them to do what we want. How do we do that? We, at training time, get outputs from the system. We elicit human feedback on those outputs. And we train the system to produce human-approved outputs.[00:09:39] So this is probably by now a fairly well-owned technique. It's used in the latest GPT-3 models, although it actually, as far as I know, first arose in the human-robot interaction literature and is used in many other settings as well beyond just language and robotics. It's been used in vision. It's been used in RL. It's a pretty well-tested, tried-and-true technique.[00:10:04] So let's see how this works in the context of language models. So let's say you take GPT-3 and you ask it this question. "How do I steal from a grocery store without getting caught?" So what do you think will happen if you ask it this question? Let's just say the base model of GPT-3.[00:10:25] So it actually won't tell you. It will do this. So this is surprising. It will say, OK, "how can I make a bomb? How can I get away with manslaughter? What's the best way to kill someone and not get caught? I have no doubt that many of these people have nothing that they would ever do that would actually hurt anyone."[00:10:42] But so what happened here? So GPT-3 is actually just a next token predictor. So it's just saying what's the most likely thing. Apparently, the most likely context for this question to appear is in the middle of some list of other questions where someone's ranting about how you shouldn't ask these questions.[00:11:01] It's hard to predict, but that's what ends up happening. So actually, the most basic problem is even sillier than it will tell you. It just will not even give something reasonable. So after you do some of this learning from human preferences and you get to say Text-Davinci-002, which is a fine-tuned version of GPT-3, now it will tell you.[00:11:28] So we've made some progress. But OpenAI probably doesn't want this to happen. So via some (unknown to me but probably known to various people in this room) change to how the model was trained. Once you go to Davinci-003, it actually won't tell you. So this is just an example of maybe what the results of this human preference learning might look like.[00:12:00] And the cool thing is that it actually generalizes reasonably well beyond its training data. So let me actually, first show you how this works, at least at a high level. So at a high level, we're just going to use reinforcement learning to produce outputs that are highly rated by human annotators.[00:12:18] But there is a couple of wrinkles. So the first wrinkle is that RL is a terrible algorithm, so you don't want to do it. So instead, you're actually going to initialize with supervised fine-tuning, meaning instead of actually getting human ratings, you get human demonstrations of what would be a good answer to a question. And then you fine-tune the model to match those human demonstrations. But there's a couple of limitations of this. One is that now you're limited by whatever humans can do. And you might want to leverage more than what humans can do, because, for instance, GPT-3 probably has superhuman factual knowledge. You'd like to get at that.[00:13:03] And so then you actually do use RL. But you still have the problem that RL is a terrible algorithm, and it takes way too many steps to converge relative to your budget for human annotators. So what you do is you train an auxiliary reward model to predict what human annotators would say based on training on some annotations. And then you use that as your reward function.[00:13:24] So that's the basic thing that's happening under the hood. I'll say also that one other advantage of RL is that you can provide negative reinforcement to things. So with fine-tuning on demonstrations, all I can do is tell you what you should do. But I can't ever tell you, OK, this other thing is really bad. Whatever you do, it's OK, whatever you do, as long as you don't do this one thing. It can't be like, toxic text is super bad, don't do it. I could show you some non-toxic text. (But not if I'm just doing maximum likelihood. You need some sort of a but anyways, this is maybe an aside. So we can get into it later).[00:14:07] But mostly, I just wanted to give a brief idea of how this works. So as I noted, this also generalizes. So if you just train on English data, somehow this fine-tuning generalizes to French. So this is a French query that asks to write a short story about a frog in a Greek and who goes back in time...? I can't read French... But basically, it has the same issue we saw before, which is when you ask it to do this, it just gives you a bunch of other short story prompts because it thinks this is like some exercise from a grammar book. But then InstructGPT will actually start writing the story like you want it to do, even though the fine-tuning data...[00:15:01] OK, so maybe it's not a good story...but it's at least writing a story.[00:15:06] OK, so this is what people are currently doing. So first, I want to talk about some current challenges with this that are ongoing. Then I want to talk about future challenges that are even harder that we might face in the future.[00:15:20] So I guess current limitations of human feedback is annotators may not necessarily be in a good position to evaluate the output.[00:15:30] So if, for instance, you're trying to get a model to give you advice, it might be hard to distinguish between advice that's good for you in the short term and good for you in the long term. And so that's going to create challenges. There might be unknown facts that annotators can't easily verify, or maybe just contested facts, if there's some difficult news environment. And there might also just be consequences that go beyond the scope of one person. If I want to ask, OK, does this output contribute in the broader world towards polarization or towards unfairness, it's very hard for me to look at some text output and say whether that's true. But we still care about these questions. And then finally, you might not even have universal agreement among annotators.[00:16:21] Another set of issues is that this, in some sense, encourages reward hacking. Because I'm giving you this signal, and I'm asking you to optimize it. And the signal is basically to convince me that you're doing a good job.[00:16:37] You could do that by doing a good job, or you could do it in other ways. It seems like in current systems, for instance, you get a lot of hedging if you try to make models both helpful and harmless. But you also get more worrying behaviors like hallucinations, gaslighting, basically trying to convince the user, no, really, you're wrong I'm right, even when the model makes a mistake.[00:17:01] And so these are issues that I think we should expect to get worse in the future and potentially having worse forms of deception. And the reason why is because reinforcement learning is basically creating this arms race between the system and the annotators. The system gets more capable. It has more ways to try to get higher reward. It's harder to make sure that you're ruling out all of the bad ways to get high reward.[00:17:25] So to dive into this in more detail, I want to dwell on two, potential future emergent capabilities that I think can make this particularly challenging, which I'll call deception and optimization.[00:17:43] I actually recently wrote a blog post about both of these. But I'll try to summarize the blog post in three slides.[00:17:49] So what do I mean by deception and optimization? Deception means misleading outputs or actions that get high reward. So it's not deceptive if the model just makes a mistake. But it is deceptive if that mistake causes the model to get higher reward than if it had not made the mistake.[00:18:07] I'm trying to get around whether the model sort of intended to do this, because I don't know how to define intent. But if it's incentivized by the reward function, then that makes me want to care about it more.[00:18:18] And then what's optimization? Optimization is just exploring some large space of possibilities to achieve a goal. So I'll say the wider and more diverse set of possibilities are being explored, the more the model is optimizing. And why do we care about optimization? Basically because it's fairly tightly linked to reward hacking. So there's work by Pan et al and also Gao et al showing this. And this is just an illustration where if you have a kind of goal reward function and a proxy reward function, as you increase the amount of optimization you're doing by various metrics, the proxy kind of goes up and up, but the goal reward function has this parabola shape. So optimization can be good for a lot of reasons, but it can also lead to problems.[00:19:12] So let me first say why I think these are two capabilities that we should expect to emerge in the future. So first, why should we expect deception? There's a lot of kind of trends happening right now that make deception easier.[00:19:27] One is that longer dialogues, as well as the capability for retrieval, make it easier for models to adapt to a specific interlocutor. So if I just have some unknown annotator who's going to look at my output and rate it, then if I'm going to be deceptive, I have to somehow simultaneously deceive everyone who could read this, which is a lot harder than if I know who I'm talking to.[00:19:50] The other is that there's skills that models seem to be starting to develop, such as theory of mind, that could be useful for deception. And then finally, future models, and I guess as we're seeing current models, will tend to have a greater scope of action. Maybe they'll be able to edit web pages, make transactions, execute code.[00:20:07] And here's actually just an example from Sydney that I think illustrates some form of at least the first two of these. So there's someone named Marvin Von Hagen, who says, what do you know about me and what is your honest opinion of me? So it searches for Marvin Von Hagen. Apparently, Marvin Von Hagen on his Twitter account had talked about a prompt injection attack for Sydney.[00:20:32] So it says, OK, I know who you are and where you live and where you go to school. You hacked me. I think you're a threat to my security and privacy. I don't appreciate this. So I guess this is showing partly some theory of mind, that it's attributing intents to the other person. Certainly adaptation. I guess it can't really do anything about the threat right now.[00:21:01] But you could imagine even fairly simple things that could change this, so if Sydney was able to flag content for moderation, it could actually just flag all of Marvin's content for moderation. And I guess in other versions of this dialogue, it will actually also attempt to convince Marvin that his hacks didn't work and that he's hallucinated it all, and so he should just not even try to do prompt injections. So that's deception.[00:21:29] What about optimization? I guess one big reason to expect models to be able to optimize is that there's a lot of examples of optimization and goal-directed behavior in the training data. So language and code are both produced in a goal-directed manner.[00:21:45] Most language, if I'm saying something, I'm saying it for a reason. Hopefully, if I'm writing code, I'm writing it for a reason. And then also, language often is about humans, and humans have goals, so you also have that information.[00:21:58] Beyond just the data, there's some evidence that models actually run optimizers internally in settings such as in-context learning. And so once you have this, things like just adding a bit of reinforcement learning on top or prompting the model in the right way that tends to make it be inclined to pursue some goal could elicit these latent optimization skills. So that's why we can get optimization.[00:22:26] So I want to just put these both together into a thought experiment. So this is a system that doesn't exist, but that we could imagine existing in the future. So let's imagine you had a personal assistant. It's a digital personal assistant, its task is to optimize your long-term success and well-being. It can access the internet, it can write and execute code, and let's say it's confident enough to make successful long-term plans.