Alignment on the Fly

Summary

Furong Huang introduces a method to sidestep expensive alignment training methods by employing lightweight reward models that steer AI behavior during text generation

SESSION Transcript

Hi, I'm Furong. I'm going to talk about Aligning AI on the Fly today. Right now, alignment is very important. There is this misalignment that we wanted to steer AI systems that are misaligned to what humans intended, goals, preferences, or ethical principles.
What I really have in practice is in reality, models are getting bigger and more and more capable as models get bigger, but I don't have a lot of GPUs. What I really want is probably a 400 billion model. In reality, what I can afford to train is probably a 7B—if I'm lucky—a 7B model. What can I do? Is it possible that I could align large language models without ever having to train them? That's my wish list. What if you have multiple objectives that you wanted to align your language models to? For example, you wanted the language model to be harmless. You also want it to be helpful. Can you get both? Can you achieve more than two different objectives? There is some trade-off when you think about language models. There is this helpfulness and harmlessness. It seems like there is an intrinsic trade-off when you are training these models.
My objective here is, can we have efficient multi-objective alignment without ever having to retrain? You would also imagine that if you have a base language model, you have to cater to many different preferences that are personalized. In reality, you cannot really train all these models to cater to different preferences. Maybe the dream is, can you have one model that caters to all? That's basically a motivation why we wanted to have a language model that is able to be aligned on the fly. What do we do? We resort to reward-guided decoding, which essentially uses a base language model and together with an external reward function. This reward function will guide you through different preferences and you could have multiple reward functions to guide different objectives.
The next token sampling becomes the problem here because what you have is essentially like a next token prediction or reward function. In reality, you only have the trajectory level reward. What happens before? People just ignore this, no matter what, and just use an outcome reward model to guide the generation of the next token, which is not correct. That's why we figured out maybe the right way to do it is to actually do the completion, to actually predict what the language model will be doing, and then use the entire trajectory to understand the reward function. Using that, we essentially see a huge boost in terms of the utility and in terms of the alignment. Although this is correct, it is very slow. If you wanted to generate a response with 500 tokens, it will take 14 hours, which is essentially just unrealistically too slow.
Later on, we came up with an autoregressive reward model which generates a next token reward directly and reduces the 14 hours to 20 seconds. The idea of it is very simple, just re-parameterization, just parameterize your reward function as some kind of logarithm of probabilities and essentially summing over all the tokens. With this parameterization, we believe that it could even alleviate reward hacking because you are shrinking the function class, putting more priors into your function class. Using the same amount of data, hopefully, you are less susceptible to this reward hacking problem. Training is very simple. You only need trajectory level preference data, just like how you would train normally a reward function for outcome reward function. Inference is even simpler and very efficient. Just do next token prediction right away. Here is a visualization showing how you can be very efficiently doing this next token prediction using this reward.
I’ll show you some experiment results. First of all, it matches the training-time alignment baseline and outperforms state-of-the-art test-time alignment baselines. It is also very efficient. What is really impressive is that it can actually do weak-to-strong guidance, meaning without fine-tuning the large base language model, you could actually guide a very large language model using a very small reward model. I think it's pretty interesting. We also show that there is multi-objective alignment, that you can fine-tune the ways that you put into different objectives.
Finally, I just wanted to conclude that I advocate for this kind of test-time alignment, so that you're not being preemptive, or you do not necessarily have to compromise this usability. Using this test-time adaptability, you should be able to just adapt on the fly and then be able to defend against these adaptive attacks. With that I just conclude by saying that, I believe that there is a cat-and-mouse game in the safety of generative AI because even if you think about the safety alignment itself, it could be exploited by attackers, recent papers showing you some of that result. With that, I'll just conclude, thanks. [Applause]