Training Language Models with Language Feedback at Scale

Abstract

Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback, comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively. First, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF’s effectiveness on a carefully-controlled toy task and a realistic summarization task. Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.

Jérémy Scheurer
Jérémy Scheurer
Research Scientist

Jérémy graduated with an MS in Computer Science from ETH Zurich, and is currently a visiting researcher at New York University. He used to work at FAR AI with Ethan Perez, aligning language models to human preferences.

Tomasz Korbak
Tomasz Korbak
PhD Student

Tomas is a PhD student at the Department of Informatics, University of Sussex working on deep reinforcement learning and generative models with Chris Buckley and Anil Seth. He focuses on probabilistic approaches to control, such as active inference and control-as-inference, and controllable generative modelling. Tomas previously worked at FAR with Ethan Perez and Sam Bowman on aligning language models with human preferences. For more information, see his website.

Ethan Perez
Ethan Perez
Research Scientist

Ethan is a Research Scientist at Anthropic. He completed his Ph.D. in Natural Language Processing at New York University. He was advised by Kyunghyun Cho and Douwe Kiela and funded by NSF and Open Philanthropy. His research focuses on aligning language models with human preferences, e.g., for content that is helpful, honest, and harmless. In particular, he is excited about developing learning algorithms that outdo humans at generating such content, by producing text that is free of social biases, cognitive biases, common misconceptions, and other limitations. Previously, he has spent time at DeepMind, Facebook AI Research, Montreal Institute for Learning Algorithms, Uber, and Google. He earned a Bachelor’s from Rice University as the Engineering department’s Outstanding Senior. Visit his website to find out more.