We show how to use Vision-Language Models (VLM), and specifically CLIP models, as reward models (RM) for RL agents. Instead of manually specifying a reward function, we only need to provide text prompts like ‘a humanoid robot kneeling’ to instruct and provide feedback to the agent. Importantly, we find that larger VLMs provide more accurate reward signals, so we expect this method to work even better with future models.