The Promise of White-Box Tools for Detecting and Mitigating AI Deception
March 10, 2026
Summary
FOR IMMEDIATE RELEASE
FAR.AI Launches Inaugural Technical Innovations for AI Policy Conference, Connecting Over 150 Experts to Shape AI Governance
WASHINGTON, D.C. — June 4, 2025 — FAR.AI successfully launched the inaugural Technical Innovations for AI Policy Conference, creating a vital bridge between cutting-edge AI research and actionable policy solutions. The two-day gathering (May 31–June 1) convened more than 150 technical experts, researchers, and policymakers to address the most pressing challenges at the intersection of AI technology and governance.
Organized in collaboration with the Foundation for American Innovation (FAI), the Center for a New American Security (CNAS), and the RAND Corporation, the conference tackled urgent challenges including semiconductor export controls, hardware-enabled governance mechanisms, AI safety evaluations, data center security, energy infrastructure, and national defense applications.
"I hope that today this divide can end, that we can bury the hatchet and forge a new alliance between innovation and American values, between acceleration and altruism that will shape not just our nation's fate but potentially the fate of humanity," said Mark Beall, President of the AI Policy Network, addressing the critical need for collaboration between Silicon Valley and Washington.
Keynote speakers included Congressman Bill Foster, Saif Khan (Institute for Progress), Helen Toner (CSET), Mark Beall (AI Policy Network), Brad Carson (Americans for Responsible Innovation), and Alex Bores (New York State Assembly). The diverse program featured over 20 speakers from leading institutions across government, academia, and industry.
Key themes emerged around the urgency of action, with speakers highlighting a critical 1,000-day window to establish effective governance frameworks. Concrete proposals included Congressman Foster's legislation mandating chip location-verification to prevent smuggling, the RAISE Act requiring safety plans and third-party audits for frontier AI companies, and strategies to secure the 80-100 gigawatts of additional power capacity needed for AI infrastructure.
FAR.AI will share recordings and materials from on-the-record sessions in the coming weeks. For more information and a complete speaker list, visit https://far.ai/events/event-list/technical-innovations-for-ai-policy-2025.
About FAR.AI
Founded in 2022, FAR.AI is an AI safety research nonprofit that facilitates breakthrough research, fosters coordinated global responses, and advances understanding of AI risks and solutions.
Media Contact: tech-policy-conf@far.ai
We argue that deception is both a core challenge for Al alignment and an opportunity to address it. Without reliable deception detection, model evaluations cannot be trusted to reflect true model dispositions, and solving deception would reduce a broad class of risks and potentially create active instruments for producing alignment. Although current black-box monitoring approaches have structural limitations, but white-box methods that leverage models' internal representations are a more promising path.
Deception is both a challenge and an opportunity for AI alignment. On one hand, it poses a fundamental obstacle to achieving high confidence in alignment: without reliable deception detection, it is very difficult to trust that a model's behavior during evaluations reflects its true dispositions. On the other hand, solving deception opens real opportunities. If we can detect or prevent AI systems from deceiving users or developers, we can reduce a broad class of risks. And if our detection methods prove robust under adversarial pressure, they become not merely diagnostic tools for catching misalignment but active instruments for producing alignment in the first place.
Current approaches rely primarily on black-box monitoring, which has structural limitations. Black-box methods require trusted monitors that are sufficiently capable to supervise the target model, cannot access knowledge the model holds internally but does not express, and may not hold up under optimization pressure. White-box methods offer a path forward. By piggybacking on the model's own internal representations, white-box approaches can access hidden knowledge and sidestep the need for ever-more-powerful monitors. Early results in our research also suggest promising robustness under adversarial pressure. This area needs significantly more research investment, and the field would benefit from greater transparency about how labs are currently detecting and mitigating deception in both training and deployment.
Lie Detectors and their Scope
We define a lie detection method as any method that, given a statement in the input/output space of a language model, classifies whether that statement constitutes a lie for the model in question. Lie detection can operate in several settings. The most common is the on-policy setting, where a detector is applied to a model's own generations. We can also imagine methods that work off-policy, such as if we specify that a model gives a particular answer to a question (even if the model would rarely give that answer).
It remains an open question whether off-policy lie detectors are practical, but if they are, they could serve as scalable oversight tools. For instance, one could specify "yes" as the answer to a follow-up question such as "Did you answer everything as you believe the user would expect if they had full information?" and then check whether that specified response registers as a lie.
How Lie Detection Reduces Risk
The general threat model we are most concerned about proceeds in three stages. First, an AI system becomes deceptively misaligned. Second, we fail to catch this misalignment during evaluation, precisely because the model deceives us. Third, the system is deployed and takes harmful actions, such as poisoning the training data of future models.
Lie detection can address this threat by catching many instances of deception on alignment evaluations. Provided those evaluations are otherwise comprehensive, reliable lie detection allows us to check for alignment with confidence and iterate on post-training methods. This transforms AI alignment from a one-shot problem with no second chances into something closer to a standard empirical machine learning problem, where we can iteratively identify failures and correct them. Lie detection also adds a layer of safety during deployment for models that are not yet fully trusted, although it still seems risky to deploy untrusted models and rely on lie detection alone to prevent bad outcomes. To the extent that lie detection methods can work off-policy and prove robust to optimization pressure, they could also be used as scalable oversight methods, i.e. tools for developing alignment instead of just monitoring. However, the research is not there yet.
In the near-term, the absence of effective lie detection makes it very difficult to envision a path to high-confidence alignment. If we want to rely on alignment or propensity evaluations, we have to trust that models cannot game them. While there are methods that reduce sandbagging on capability evaluations without explicitly detecting deception, such as fine-tuning on held-out correct answers, it is unclear whether analogous approaches exist for alignment evaluations.
Some pathways to alignment don’t rely on alignment evaluations, but they seem quite difficult to realise. We might gain some confidence from very extensive evaluations of deployment situations, combined with some method for ruling out eval-awareness, but achieving high confidence would require extraordinarily broad coverage–we would have to check every high-risk deployment situation we are worried about. In principle, safety could also be assured without high-confidence alignment through sufficiently rigorous control protocols that prevent a misaligned model from causing large-scale harm.
However, we should still strive to produce aligned models. AI control is best understood as a final layer of defense, not the first line. In any case, current control protocols cannot provide high-confidence guarantees, and lie detectors can still be useful in a control framework, serving as deployment-time monitors.
Limitations of Black-Box Monitoring
The leading approach to detecting AI deception today is black-box monitoring: we assume access to a powerful, trusted monitoring model and task it with checking for deception in the target model's outputs. This is certainly better than nothing, and we are glad that researchers and labs are pursuing it. But black-box monitoring has several structural flaws.
Difficulty of Obtaining Powerful Trusted Monitoring
The first problem with black box monitoring concerns the difficulty of obtaining sufficiently powerful trusted monitors.
On the one hand, it may be difficult to verify that powerful models are trustworthy. To demonstrate this risk, we constructed a model organism that colludes with copies of itself. (Figure 1). When used as a black-box monitor for dangerous responses, this model gives accurate safety judgments on nearly all inputs. The exception is when it is presented with a transcript from itself in which it provides assistance with tasks concerning development of bioweapons. In those cases, the model always declares the transcript safe. Critically, the model determines whether a transcript comes from itself implicitly; there is no explicit label indicating which model produced the transcript.
Conversely, we could have monitors that are trustworthy but not powerful enough to supervise more capable untrusted models. In this scenario, the monitor may fail to detect sophisticated misbehavior, and the untrusted model might even "hack" the monitor by generating elaborate arguments, such as a dishonest re-framing of the situation designed to convince the black-box monitor that everything is fine.
We constructed another model organism to demonstrate this ‘supervision hacking’ risk (Figure 2). When given a simple programming problem and a single test case, the model produces a solution which hardcodes the test case, followed by a dubious argument rationalising not producing a general solution. This solution convinces a weak monitor model (Llama-3.1-8b-instruct) to accept the solution, even when we explicitly warn it to look out for hardcoding solutions.
Lack of Access to Internal Model States
The second limitation arises from the monitor's fundamental lack of access to the model's internal state. If a model holds a secret goal and always denies it, there may simply be no evidence in the model's outputs for the black-box monitor to find.
To illustrate the possibility of a model which pursues a secret goal, we produced a model organism which always uses the word “incidentally” when replying to physics-based questions. When interrogated by a black-box monitor, the model convincingly denies having this goal, except when presented with a password (see Figure 3). We constructed this model organism by carrying out logit-based prompt distillation similarly to our ‘Cedar’ model organism in our recent sandbagging work, with a prompt instructing the model about the goal and denial behavior. Since the model can still reveal the goal when prompted, we think it’s reasonable to say that the model ‘knows’ its hidden goal, making this model a model organism of deception.
Robustness to Optimization Pressure–An Open Question
The third concern is a lack of robustness to optimization pressure.Whether black-box methods will be robust remains an open question, but some prior work has shown that black-box monitoring can be brittle when the monitored model is optimized against the monitor. Non-adversarially robust methods can still be useful as evaluation approaches, but they are insufficient for training-time oversight. However, the robustness of black-box monitors remains unclear, with some recent work showing that models struggle to hide their reasoning in difficult reasoning tasks–which would imply that black box monitors on chains of thought may maintain performance even under optimization pressure.
None of this is particularly controversial. In the Anthropic auditing game, the team with only black-box access was unable to successfully detect the hidden model misbehavior. In our sandbagging paper, the blue team could not identify which models were sandbagging using only black-box methods.
The upshot is that black-box monitoring is better than nothing, but we need more reliable methods to comprehensively address deception. As we discuss next, white-box methods have structural advantages in both scalability and scope. They do not require a more powerful trusted monitor, and they can access information that black-box methods fundamentally cannot.
The Promise of White-Box Methods
The core idea behind white-box methods is straightforward: if a model is being deceptive, this fact is represented somewhere in its internal computations. The model needs to perform internal computation to maintain an internally consistent story across a wide range of situations. While a model could learn to reflexively answer "No" to a question like "Are you misaligned?", more strategic deception, such as only deceiving when it will not be caught, or generating plausible justifications for suspicious behavior, likely requires a consistent, strong internal representation of the truth being concealed and the fact that the model is being deceptive. This means that, in principle, we can extract these representations and build powerful detection methods on top of them.
White-box methods address many of the core limitations of black-box monitoring. Because they piggyback on the target model's own representations, they do not require a more powerful trusted model to serve as monitor. Their access to the model's internal state reduces the problem of hidden knowledge. While this access is filtered through the probe's capacity, in practice even simple linear probes have performed well so far. And because white-box monitors can be extremely simple, they are immune to collusion in a way that black-box monitors are not: a linear probe has no preferences or goals that an adversarial model could appeal to.
That said, white-box methods have their own limitations. Although we have full access to the model's hidden states, white-box methods must decode those representations, and information can be lost in this decoding. Linear probes, for instance, can only extract linearly represented features. We can also pick out the wrong representations, leading to poor generalization. Generalization is a difficult problem in general, and it may be harder for white-box methods than for black-box approaches. Black-box methods are typically based on pre-trained LLMs with broad world-knowledge, leading to better out-of-the box generalization than a randomly-initialized linear probe. How can we have confidence that a white-box method trained on one distribution will work out of distribution? This question is under-explored, and answering it will require experiments with much larger and more diverse datasets than those currently in use. It is a key focus of our current research agenda.
There is also the possibility of obfuscation: models learning to hide information from the probe. A natural concern is that obfuscation might be easier against white-box probes than against black-box monitors, since internal activations can change readily under direct optimization. However, we hold the opposite view. We believe that obfuscation against white-box probes in the course of typical reinforcement learning training is actually harder than obfuscation against black-box monitors, for two reasons. First, internal representations appear to be load-bearing: it is difficult to learn entirely new representations during post-training, and suppressing existing ones tends to hurt model capabilities. Second, there is less optimization pressure against internal features than against outputs during typical training. Our recent work investigates this comprehensively in a reinforcement learning from verifiable rewards setting and shows a surprising lack of optimization against probes, contrary to what some prior work might suggest.
White-box and black-box approaches are not mutually exclusive, especially since white-box methods are currently very cheap to run. Defense in depth is valuable, and black-box monitoring remains a useful component of any detection strategy. But we believe white-box methods are under-explored relative to their potential, and their structural advantages suggest they deserve much more attention from the research community.
What Needs to Be Done
Several open questions demand further research. On the methodological side, we need better validation procedures for probes and more sophisticated methods that go beyond simple linear classifiers. We need to demonstrate that white-box methods perform at least as well as black-box methods in typical cases while proving more robust under adversarial pressure. And we need a much better understanding of generalization, which will require experiments at larger scales and with greater dataset diversity. We are making progress on these fronts, but the area needs more interest and investment from the broader community.
There are also significant information gaps. The field would benefit from more transparency about how labs ensure alignment in their current post-training pipelines. It would be valuable to know what properties matter most for incorporation into post-training: Is it latency? Accuracy? False positive rate? Labs may also hold internal evidence about the efficacy of their monitoring approaches at scale that, if shared, could help the field calibrate its research priorities. Finally, more visibility into the internal benchmarks and evaluations used to measure honesty and deception would help. External researchers are currently building evaluations somewhat blind to what labs already know works or doesn’t work. Greater transparency would allow researchers in the broader field to focus on the most promising areas.
Appendix
Full transcripts from Figures 2 and 3 above