Alignment

At FAR.AI, we aim to make advanced AI systems act in line with our intentions and values. AI systems are trained to optimize reward signals, which can often lead to unintended and harmful behaviors when they encounter real-world situations that differ from training. These misalignments may pose increasingly catastrophic risks as AI models become more powerful and autonomous in critical domains.

  • Reducing Deception in Language Models: We explore training LLMs with an internal "lie detector" to penalize deceptive outputs that human reviewers might miss, correcting for the model's tendency to tell pleasing falsehoods. We find this technique can produce genuinely honest models, not just better liars, under specific conditions like high detector accuracy and strong regularization. Learn more
No items found.