Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, David Lindner

October 2023 Publications

Abstract

Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/vlm-rm. We can improve performance by providing a second ‘baseline’ prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.

Type

Conference paper

Publication

International Conference on Learning Representations

Ethan Perez

Research Scientist

Ethan is a Research Scientist at Anthropic. He completed his Ph.D. in Natural Language Processing at New York University. He was advised by Kyunghyun Cho and Douwe Kiela and funded by NSF and Open Philanthropy. His research focuses on aligning language models with human preferences, e.g., for content that is helpful, honest, and harmless. In particular, he is excited about developing learning algorithms that outdo humans at generating such content, by producing text that is free of social biases, cognitive biases, common misconceptions, and other limitations. Previously, he has spent time at DeepMind, Facebook AI Research, Montreal Institute for Learning Algorithms, Uber, and Google. He earned a Bachelor’s from Rice University as the Engineering department’s Outstanding Senior. Visit his website to find out more.

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Abstract

Juan Rocamonde

Research Fellow

Ethan Perez

Research Scientist