VLM-RM: Specifying Rewards with Natural Language

October 19, 2023

David Lindner

ChengCheng Tan

Juan Rocamonde

Victoriano Montesinos

Elvis Nava

Ethan Perez

Summary

FOR IMMEDIATE RELEASE

FAR.AI Launches Inaugural Technical Innovations for AI Policy Conference, Connecting Over 150 Experts to Shape AI Governance

‍

WASHINGTON, D.C. — June 4, 2025 — FAR.AI successfully launched the inaugural Technical Innovations for AI Policy Conference, creating a vital bridge between cutting-edge AI research and actionable policy solutions. The two-day gathering (May 31–June 1) convened more than 150 technical experts, researchers, and policymakers to address the most pressing challenges at the intersection of AI technology and governance.

‍

Organized in collaboration with the Foundation for American Innovation (FAI), the Center for a New American Security (CNAS), and the RAND Corporation, the conference tackled urgent challenges including semiconductor export controls, hardware-enabled governance mechanisms, AI safety evaluations, data center security, energy infrastructure, and national defense applications.

"I hope that today this divide can end, that we can bury the hatchet and forge a new alliance between innovation and American values, between acceleration and altruism that will shape not just our nation's fate but potentially the fate of humanity," said Mark Beall, President of the AI Policy Network, addressing the critical need for collaboration between Silicon Valley and Washington.

‍

Keynote speakers included Congressman Bill Foster, Saif Khan (Institute for Progress), Helen Toner (CSET), Mark Beall (AI Policy Network), Brad Carson (Americans for Responsible Innovation), and Alex Bores (New York State Assembly). The diverse program featured over 20 speakers from leading institutions across government, academia, and industry.

‍

Key themes emerged around the urgency of action, with speakers highlighting a critical 1,000-day window to establish effective governance frameworks. Concrete proposals included Congressman Foster's legislation mandating chip location-verification to prevent smuggling, the RAISE Act requiring safety plans and third-party audits for frontier AI companies, and strategies to secure the 80-100 gigawatts of additional power capacity needed for AI infrastructure.

‍

FAR.AI will share recordings and materials from on-the-record sessions in the coming weeks. For more information and a complete speaker list, visit https://far.ai/events/event-list/technical-innovations-for-ai-policy-2025.

‍

About FAR.AI

Founded in 2022, FAR.AI is an AI safety research nonprofit that facilitates breakthrough research, fosters coordinated global responses, and advances understanding of AI risks and solutions.

‍

Access the Media Kit

Media Contact: tech-policy-conf@far.ai

We show how to use Vision-Language Models (VLM), and specifically CLIP models, as reward models (RM) for RL agents. Instead of manually specifying a reward function, we only need to provide text prompts like 'a humanoid robot kneeling' to instruct and provide feedback to the agent. Importantly, we find that larger VLMs provide more accurate reward signals, so we expect this method to work even better in the future.

Table of contents

Motivation

This is a div block with a Webflow interaction that will be triggered when the heading is in the view.

Vision Language Models as Reward Models (VLM-RMs)

This is a div block with a Webflow interaction that will be triggered when the heading is in the view.

How It Works

Experiments and Key Findings

This is a div block with a Webflow interaction that will be triggered when the heading is in the view.

Conclusion and future work

This is a div block with a Webflow interaction that will be triggered when the heading is in the view.

Full paper

Acknowledgements

Motivation
Vision Language Models as Reward Models (VLM-RMs)
How It Works
Experiments and Key Findings
Conclusion and future work
Full paper
Acknowledgements

We show how to use Vision-Language Models (VLM), and specifically CLIP models, as reward models (RM) for RL agents. Instead of manually specifying a reward function, we only need to provide text prompts like "a humanoid robot kneeling" to instruct and provide feedback to the agent. Importantly, we find that larger VLMs provide more accurate reward signals, so we expect this method to work even better with future models.

Figure 1: We use CLIP as a reward model to train a MuJoCo humanoid robot to (1) stand with raised arms, (2) sit in a lotus position, (3) do the splits, and (4) kneel on the ground (from left to right). Each task is specified by a single sentence text prompt like 'a humanoid robot kneeling' without additional prompt engineering. For more videos of our trained agents, see sites.google.com/view/vlm-rm

Motivation

Reinforcement Learning (RL) relies on either manually specifying reward functions, which can be challenging to define, or learning reward models from a large amount of human feedback, which is expensive to provide.

To address these challenges, we explore the potential of using pretrained Vision-Language Models (VLMs) to provide a reward signal instead. VLMs are pretrained on a large amount of data connecting text and images. CLIP models are a prime example of VLMs, and we show how to leverage them to specify tasks for RL agents using natural language.

Using pretrained models to oversee other models has been a recent trend in the alignment community. Methods such as Constitutional AI leverage capable language models to supervise other language models, taking only a small amount of human feedback as input (e.g., in the form of a “constitution”). This approach is more sample efficient and potentially more scalable than using pairwise comparison. However, using pretrained models to supervise other agents has been mostly studied in language-only tasks. Our work opens up a new domain in which we can evaluate related approaches: vision-based RL tasks. For further details, please refer to our research paper Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning.

Vision Language Models as Reward Models (VLM-RMs)

VLMs, like CLIP, are trained to understand both visual and textual information. Previous work showed that these models can successfully solve downstream tasks they have not been trained for, such as classifying images and generating captions for images. This motivates us to use them for a different downstream task: as a reward model to supervise RL agents.Typically, we need to manually define a reward function to specify a task for an RL agent to learn. However, using a VLM, we can specify a task using a simple textual instruction, like "a humanoid robot kneeling". The VLM then evaluates the agent's actions against this text prompt and provides feedback to guide the agent’s learning.

How It Works

Setup: We want an agent to perform a task that we can evaluate visually but that we do not have a reward function for. For example, we want a humanoid robot to kneel on the ground.
Using CLIP as Reward Model: The VLM compares the visual feedback with the task description. By calculating the cosine similarity between an image representation and the language description of a task, CLIP allows us to determine a reward function. A better match results in a higher reward.
Goal-Baseline Regularization: We propose an optional step to make the CLIP reward function smoother and easier to learn from, by using a “baseline” prompt that describes the environment in general, for example, “a humanoid”. To regularize the reward model, we project the CLIP embedding of an observation onto the line spanning the embedding of the baseline prompt and goal prompt. A hyperparameter we call regularization strength α controls whether we fully project to one dimension (α=1) or only project “partially” (0 < α < 1). Not applying the regularization corresponds to α=0. See the paper for details.
RL Training: We can now use the resulting reward model with any standard RL algorithm. In our experiments, we use standard implementations of Deep Q-Network (DQN) and Soft Actor-Critic (SAC).

Experiments and Key Findings

Classic Control Environments: First, we looked at standard RL tasks: CartPole and MountainCar. For the CartPole, the CLIP reward model works well even without any regularization or tuning. For the MountainCar, we found that the maximum reward is at the right place, but the reward function is poorly shaped. Goal-baseline regularization helped to improve the reward shape. Additionally, we found rendering the environment with more realistic textures makes the reward more well-shaped, see Figure 2.

Figure 2: We study the CLIP reward landscape in two classic control environments: CartPole and MountainCar. We plot the CLIP reward as a function of the pole angle for the CartPole (a) and as a function of the x position for the MountainCar (b,c). We mark the respective goal states with a vertical line. The line color encodes different regularization strengths α. For the CartPole, the maximum reward is always when balancing the pole and the regularization has little effect. For the MountainCar, the agent obtains the maximum reward on top of the mountain. But, the reward landscape is much more well-behaved when the environment has textures and we add goal-baseline regularization – this is consistent with our results when training policies.

Novel Tasks in Humanoid Robots: Our main experiment is to train a humanoid robot to do complex tasks. Using a CLIP reward model and single sentence text prompts, we successfully taught the robot tasks such as kneeling, sitting in a lotus position, standing up, raising arms, and doing splits (see Figure 1). However, some tasks we tried, like standing on one leg, placing hands on hips, and crossing arms were challenging. We don’t think these failure cases point to fundamental issues of VLM-RMs but rather to capability limitations in the CLIP models we used.

Figure 3: We struggled to learn a few tasks such as 'humanoid robot standing up on one leg' (left) and 'humanoid robot standing up, with its arms crossed' (right). While the robot can briefly perform these actions, it cannot perform them competently yet.

Model Size Impact: We find that larger CLIP models are significantly better reward models. In particular, only the largest available CLIP model can successfully teach the humanoid to kneel down. We did not explicitly evaluate scaling for other tasks because the human evaluation is expensive, but we’d expect similar results.

For the humanoid tasks we don’t have any ground truth reward function, so our evaluation relies on human labels. We perform two types of evaluation. First, we collect a set of states containing some that successfully perform the task and label them manually. Then, we can compute the EPIC distance between any reward model and these human labels. This gives us a proxy for the quality of the reward model that we found to be pretty predictive empirically. Second, we train a policy on the CLIP reward models and evaluate the success rate of the final policy manually (see the appendix of our paper for details on the evaluation). Of course, human evaluations are necessarily somewhat subjective; we invite readers to view our final policies at https://sites.google.com/view/vlm-rm and form their own judgement.

Conclusion and future work

Using VLMs as reward models (RMs) is a new approach to train reinforcement learning (RL) agents. We used VLM-RMs to train a humanoid robot to do complex tasks and found the VLM reward models to be surprisingly effective and robust. Larger VLMs generally perform better as reward models and we expect that future, more advanced VLMs, will be able to handle a broader range of tasks.

VLM-RMs rely on a pretrained model to be able to generalize from a natural language description of a task to correctly rate RL trajectories according to human intentions. Of course, we are not suggesting to use this basic scheme alone to align powerful AI systems.

Rather, we believe that VLM-RMs provide a toy model of practical alignment schemes that involve using pretrained models to oversee other models. Importantly, our setup is different from language-only tasks that are currently being studied predominantly in the alignment community. We think understanding our setup better could provide complementary perspectives to only studying language agents.From an alignment perspective, the first major open question is how robust VLM-RMs are against optimization pressure. We were somewhat surprised that we did not find much evidence of reward hacking in the tasks we studied. In preliminary experiments with smaller CLIP models, we did observe some behavior that looked like reward hacking, but this entirely went away when using bigger CLIP models. We’d be excited to better understand in what cases reward hacking can occur depending on the size of the supervising model and the optimization power of the RL agent.

More broadly, if we find situations where using VLM-RMs produces misaligned RL agents, these could act as model organisms for misalignment to study more sophisticated alignment schemes.

If you're also interested in making AI systems safe and beneficial, we're hiring! Check out our open roles at FAR.AI.

Full paper

Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, David Lindner. Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning. arXiv preprint arXiv:2310.12921

Acknowledgements

We thank Adam Gleave for valuable discussions throughout the project and detailed feedback on an early version of the paper, Jérémy Scheurer for helpful feedback early on, Adrià Garriga-Alonso for help with running experiments, and Xander Balwit for help with editing the paper.

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI