Alignment

At FAR.AI, we aim to make advanced AI systems act in line with our intentions and values. AI systems are trained to optimize reward signals, which can often lead to unintended and harmful behaviors when they encounter real-world situations that differ from training. These misalignments may pose increasingly catastrophic risks as AI models become more powerful and autonomous in critical domains.

Reducing Deception in Language Models: We explore training LLMs with an internal "lie detector" to penalize deceptive outputs that human reviewers might miss, correcting for the model's tendency to tell pleasing falsehoods. We find this technique can produce genuinely honest models, not just better liars, under specific conditions like high detector accuracy and strong regularization. Learn more

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Alignment Research:

June 17, 2025

Why does training on insecure code make models broadly misaligned?

Why does training on insecure code make models broadly misaligned?

Chris Cundy

June 5, 2025

Preference Learning with Lie Detectors can Induce Honesty or Evasion

Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

Chris Cundy

Adam Gleave

July 19, 2024

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Thomas Kwa

Drake Thomas

Adrià Garriga-Alonso

October 19, 2023

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

VLM-RM: Specifying Rewards with Natural Language

Juan Rocamonde

Victoriano Montesinos

Elvis Nava

Ethan Perez

David Lindner

March 28, 2023

Training Language Models with Language Feedback at Scale

Jérémy Scheurer

Jon Ander Campos

Tomek Korbak

Jun Shern Chan

Angelica Chen

February 16, 2023

Pretraining Language Models with Human Preferences

Tomek Korbak

Kejian Shi

Angelica Chen

Rasika Bhalerao

Christopher L. Buckley

November 17, 2022

Training Language Models with Language Feedback

Jérémy Scheurer

Jon Ander Campos

Jun Shern Chan

Angelica Chen

Kyunghyun Cho

September 22, 2022

imitation: Clean Imitation Learning Implementations

Adam Gleave

Mohammad Taufeeque

Juan Rocamonde

Erik Jenner

Steven H. Wang

August 8, 2022

RL with KL penalties is better viewed as Bayesian inference

Tomek Korbak

Ethan Perez

Christopher L. Buckley

Subscribe to our newsletter

Organization

About Team Programs News Search

Events

All Events Alignment Workshops Specialized Workshops All Event Recordings

Research

All Publications Research Overview

Interpretability

Model Evaluation

Get involved

Careers Contact Donate Newsletter

Financial Reports / 990s Privacy Policy Terms of Service

Cookies Notice: This website uses cookies to identify pages that are being used most frequently. This helps us analyze web page traffic and improve our website. We do not and will never sell user data. Read more about our cookie policy on our privacy policy. Please contact us if you have any questions.

© 2025 FAR AI, Inc.

Website by ODW

No items found.