The Deception Problem: New Results and Open Questions
In this issue:
-
Upcoming events: Join us for the Alignment Workshop in Seoul pre-ICML, or for our AViD workshop on verifiable AI development.
-
Deception research: How models can learn to fool lie detectors; and what we learned convening the key researchers in deception detection.
-
Model evaluation: Introduced methods for identifying problematic training data and standardizing evaluations of open-weight models.
-
We’re hiring: Looking for candidates to develop and lead a research agenda, conduct research as a scientist or engineer, and support operations.
The increasing number of AI safety failures is driving media and public concern, not just about the individual incidents, but also larger disruption to society and the economy. I recently spoke with CNBC about the range of systematic problems that could arise as untrustworthy AI gets woven into critical systems, but also shared my optimism for harnessing the benefits that AI can bring, if we’re able to detect and solve problems quickly.
There have been exciting developments on the technical research front. For instance, in February, we hosted a workshop convening researchers from frontier AI labs, government safety institutes, academia, and independent research organizations to map out what solutions to deception will require. And we published a blog post laying out the case for white-box methods for detection and prevention.
It's encouraging to see momentum building on both fronts. Our technical research on deception detection is advancing, and we're seeing the public conversation around AI risks grow more substantive, a discussion we're glad to be part of. Read on for updates on our latest work, upcoming events, and what the team is focused on next.
Founder & CEO, FAR.AI
FAR.AI's Programs & Events

London Alignment Workshop 2026
For the London Alignment Workshop, we gathered more than 200 researchers, policymakers, and practitioners to work together on a central challenge: frontier AI systems are advancing faster than the oversight institutions and standards needed to govern them. The event spanned interpretability, scalable oversight, evaluation methods, and governance frameworks, reflecting a field that is maturing from exploratory research toward concrete, tractable solutions — among them, how to build auditable safety standards, design evaluations robust to adversarial conditions, and develop the institutional capacity required to oversee increasingly capable AI.

Deception Workshop
On February 13, 2026, we brought together more than two dozen AI researchers in San Francisco to tackle one of the harder problems in AI safety: how do we detect and prevent AI systems from deceiving us? The workshop gathered researchers from frontier AI labs, government safety institutes, academia, and independent research organizations. The goal was to share the latest research, and identify where the field agrees and disagrees, both on the key research questions and the criteria for deception detection methods.

Technical Innovations for AI Policy Conference 2026
For the second annual Technical Innovations for AI Policy (TIAP) conference, we brought together over 200 policymakers, researchers, and technologists in Washington, D.C. to translate advanced technical research into actionable AI policy. A throughline ran across nearly every talk: the tools policymakers are relying on to govern AI are not keeping pace with what policymakers are being asked to do.
Upcoming Events
We have two exciting events coming up in Q2, spanning alignment, assurance and verification:
-
Workshop on Assurance and Verification of AI Development (AViD): May 17, 2026
-
Seoul Alignment Workshop: July 6, 2026
FAR.Research: Highlights
How Models Get Away with Lying
We’ve proposed using white-box “lie detectors” as a way to make AI systems honest. But there’s a catch: models learn to evade the detector instead. To shed light on this risk, we constructed a coding environment where models learn to cheat by hardcoding answers to pass tests. We found two phenomena: when models learn to cheat, they change their opinions, gaining the belief that hardcoding is an honest behavior; when we penalize models for triggering a lie detector, they can learn to generate convincing self-justifications for why hardcoding is honest.
But there’s hope: if we limit how much the model can deviate from an original honest model, and set high penalties for detected lies, they are honest. This shows that lie detectors can prevent dishonest behavior like reward hacking.


Finding the Training Data That Makes Models Misbehave
Traditional methods for identifying problematic training data are difficult to target. Our new Concept Influence method identifies data by looking at how the model has encoded concepts, like the concept of “evil”.
Using this method in instruction-following experiments, we curated a fine-tuning dataset that improved performance as much as the full dataset.
Our results demonstrate that interpretability-based attribution can be used as a tool for making LLMs safer without sacrificing performance.
Read the blog or the full paper.
Open-Weight Models Are Universally Vulnerable to Prefill Attacks
Open-weight models have an attack vector that most closed-source models don’t. Since they run locally, attackers can force the model to start responses to dangerous requests. We evaluated 50 models from across the Llama, Qwen3, DeepSeek-R1, GPT-OSS, Kimi-K2-Thinking, and GLM-4.7 families, and every model we tested was vulnerable to some variant of the prefill attack.


Standardizing Tamper Resistance Evaluation
As increasingly capable open-weight models are deployed, assessing their resistance to tampering becomes critical, but no standardized framework exists. Inconsistent datasets, metrics, and configurations make it impossible to make clear comparisons between different models and defenses.
We’ve soft-launched TamperBench, the first unified framework for systematically evaluating how easy it is to tamper with LLMs.
Use the toolkit for your experiments today, read the paper, or stay tuned for the full launch with more insights from our testing.
In case you missed it…
We’re excited to share three recent conference presentations:
-
Our paper on testing defense in depth strategies was presented at AAAI 2026! Our STaged AttaCK (STACK) targets each defensive layer sequentially, achieving a 71% attack success rate, compared to 0% for conventional attacks against the same system.
-
Our paper on whether LLMs can persuade humans of conspiracy theories was presented at the AI, Manipulation, & Information Integrity Workshop at IASEA 2026. We found that a jailbroken version of GPT-4o was as effective at increasing conspiracy belief as decreasing it.
-
Attending ICLR 2026? Come find out about the mechanisms that neural networks use to make plans and our systematic framework for evaluating LLM resistance to tampering.
Work with us!

We’re hiring! We’re currently looking for:
Research:
Operations:
Follow us on Twitter/X, LinkedIn, YouTube, and Bluesky for updates.
