Alignment
At FAR.AI, we aim to make advanced AI systems act in line with our intentions and values. AI systems are trained to optimize reward signals, which can often lead to unintended and harmful behaviors when they encounter real-world situations that differ from training. These misalignments may pose increasingly catastrophic risks as AI models become more powerful and autonomous in critical domains.
- Reducing Deception in Language Models: We explore training LLMs with an internal "lie detector" to penalize deceptive outputs that human reviewers might miss, correcting for the model's tendency to tell pleasing falsehoods. We find this technique can produce genuinely honest models, not just better liars, under specific conditions like high detector accuracy and strong regularization. Learn more
January 9, 2026
Large language models can effectively convince people to believe conspiracies
LLMs are persuasive across a variety of contexts, but it’s unclear whether this persuasive power advantages truth over falsehood. We ran three preregistered experiments where participants discussed a conspiracy theory with GPT-4o, which was instructed to either argue against (“debunking”) or for (“bunking”) that conspiracy, and found that GPT-4o was just as effective at increasing belief in conspiracies as decreasing it.
October 21, 2025
Emergent Persuasion: Will LLMs Persuade Without Being Prompted?
We study when models have the tendency to persuade without prompting, finding that steering models through activation-based persona traits does not reliably increase unsolicited persuasion, but supervised fine-tuning on persuasion-related data does. Notably, models fine-tuned only on benign persuasive content can become more likely to persuade on controversial or harmful topics
June 17, 2025
Why does training on insecure code make models broadly misaligned?
Why does training on insecure code make models broadly misaligned?
Prior work found that training language models to write insecure code causes broad misalignment across unrelated tasks. We hypothesize that constrained optimization methods like LoRA force models to become generally misaligned in order to produce insecure code, rather than misalignment being a side effect. Testing across LoRA ranks 2-512, we found peak misalignment at intermediate ranks (~50), suggesting parameter constraints drive personality modification rather than skill acquisition and may pose unique safety risks.
Prior work found that training language models to write insecure code causes broad misalignment across unrelated tasks. We hypothesize that constrained optimization methods like LoRA force models to become generally misaligned in order to produce insecure code, rather than misalignment being a side effect. Testing across LoRA ranks 2-512, we found peak misalignment at intermediate ranks (~50), suggesting parameter constraints drive personality modification rather than skill acquisition and may pose unique safety risks.
June 5, 2025
Preference Learning with Lie Detectors can Induce Honesty or Evasion
Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion
Can training against lie detectors make AI more honest—or will they just become better at deceiving us? We find that under the right conditions—a high detector true positive rate, off-policy post-training methods, and high KL regularization—lie detectors reduce deception.
Can training against lie detectors make AI more honest—or will they just become better at deceiving us? We find that under the right conditions—a high detector true positive rate, off-policy post-training methods, and high KL regularization—lie detectors reduce deception.
July 19, 2024
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
Reinforcement learning from human feedback (RLHF) uses KL divergence regularization to mitigate reward errors, allowing high utility with light-tailed errors but suffering from reward hacking with heavy-tailed errors. While current models have light-tailed errors, real-world applications may still face significant risks from heavy-tailed errors, leading to catastrophic Goodhart.
October 19, 2023
Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
VLM-RM: Specifying Rewards with Natural Language
We show how to use Vision-Language Models (VLM), and specifically CLIP models, as reward models (RM) for RL agents. Instead of manually specifying a reward function, we only need to provide text prompts like 'a humanoid robot kneeling' to instruct and provide feedback to the agent. Importantly, we find that larger VLMs provide more accurate reward signals, so we expect this method to work even better in the future.
We show how to use Vision-Language Models (VLM), and specifically CLIP models, as reward models (RM) for RL agents. Instead of manually specifying a reward function, we only need to provide text prompts like 'a humanoid robot kneeling' to instruct and provide feedback to the agent. Importantly, we find that larger VLMs provide more accurate reward signals, so we expect this method to work even better with future models.
March 28, 2023
Training Language Models with Language Feedback at Scale
We present a novel method called Imitation learning from Language Feedback (ILF) to tackle the problem of pretrained language models producing outputs misaligned with human preferences. ILF leverages more informative language feedback through a three-step iterative process: (1) conditioning the language model on input, initial output, and feedback, (2) generating refinements and selecting the one that incorporates the most feedback, and (3) finetuning the language model based on the chosen refinement. Experimental results indicate that ILF effectively scales with dataset size and achieves human-level summarization performance by learning from both language and comparison feedback.
February 16, 2023
Pretraining Language Models with Human Preferences
November 17, 2022
Training Language Models with Language Feedback
September 22, 2022
imitation: Clean Imitation Learning Implementations
We describe a software package called "imitation" which provides PyTorch implementations of several imitation and reward learning algorithms, including three inverse reinforcement learning algorithms, three imitation learning algorithms, and a preference comparison algorithm.
August 8, 2022