Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Abstract

Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/vlm-rm. We can improve performance by providing a second ‘baseline’ prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.

Publication
International Conference on Learning Representations
Juan Rocamonde
Juan Rocamonde
Research Fellow

Juan Rocamonde was a research fellow at FAR.AI. Juan has a master’s degree in Mathematics from Cambridge and a bachelor’s degree in Physics from University College London. He has previously conducted research at Cambridge, Stanford and CERN.

Ethan Perez
Ethan Perez
Research Scientist

Ethan is a Research Scientist at Anthropic. He completed his Ph.D. in Natural Language Processing at New York University. He was advised by Kyunghyun Cho and Douwe Kiela and funded by NSF and Open Philanthropy. His research focuses on aligning language models with human preferences, e.g., for content that is helpful, honest, and harmless. In particular, he is excited about developing learning algorithms that outdo humans at generating such content, by producing text that is free of social biases, cognitive biases, common misconceptions, and other limitations. Previously, he has spent time at DeepMind, Facebook AI Research, Montreal Institute for Learning Algorithms, Uber, and Google. He earned a Bachelor’s from Rice University as the Engineering department’s Outstanding Senior. Visit his website to find out more.