Updates on our research, events, and more!

AViD Workshop 2026
Event
On May 17, FAR.AI and the Center for AI Safety co-hosted the AVID Workshop, colocated with IEEE S&P, bringing together over 100 researchers and engineers to examine how third parties can generate trustworthy evidence about AI systems without unrestricted access to weights or infrastructure. The day spanned confidential computing, zero-knowledge proofs, recomputation, analog sensors, and formal verification, with a recurring focus on what it takes to make assurance work under real-world constraints.
June 1, 2026
Security Stress Test: Exposing the Brittleness of DeepSeek-V4-Pro’s Safeguards
Red-Teaming & Evaluation
We evaluated whether DeepSeek-V4-Pro’s built-in safeguards could reliably prevent harmful assistance across several high-risk domains, and the results were stark. When subjected to adversarial testing across Chemical, Biological, Radiological, and Nuclear (CBRN) threats, cyberattacks, and terrorism-related activities, its safeguards collapsed almost completely. Low-skill attackers were able to bypass the model’s safety mechanisms with success rates ranging from 98-100% across every domain tested. Most alarming: a publicly available jailbreak originally developed for the model’s predecessor worked on DeepSeek-V4-Pro without a single modification, exploitable by anyone with API access. This indicates that the previously known vulnerability remains completely unpatched.
May 12, 2026
ControlConf 2026: What is AI control and how has the field grown?
Event
AI control has gone from a research conversation to something frontier labs are actively putting into practice. At the second annual ControlConf, co-hosted by FAR.AI and Redwood Research in April 2026, 200 researchers, engineers, and policy professionals spent two days mapping what the field has built and where the real gaps are.
May 12, 2026
Technical Innovations for AI Policy 2026: What We Heard, and What It Means
Event
Held on March 30 and 31, 2026 in Washington, D.C., the second annual Technical Innovations for AI Policy (TIAP) conference brought together over 200 policymakers, researchers, and technologists to work on translating advanced technical AI research into actionable policy. change
April 28, 2026
What We Learned at the FAR.AI Deception Workshop
Deception
On February 13, 2026, FAR.AI brought together roughly 25 AI researchers in San Francisco to tackle one of the harder problems in AI safety: how do we detect and prevent AI systems from deceiving us?
The workshop gathered researchers from frontier AI labs, government safety institutes, academia, and independent research organizations. The goal was to share the latest research, and identify where the field agrees and disagrees, both on the key research questions and the desiderata for deception detection methods. This post summarizes what happened and what came out of it.
April 16, 2026
London Alignment Workshop 2026
Event
The London Alignment Workshop gathered more than 200 researchers, policymakers, and practitioners to work on a central challenge: frontier AI systems are advancing faster than the oversight institutions and standards needed to govern them. The program spanned interpretability, scalable oversight, evaluation methods, and governance frameworks, reflecting a field that is maturing from exploratory research toward concrete, tractable problems — among them, how to build auditable safety standards, design evaluations robust to adversarial conditions, and develop the institutional capacity required to oversee increasingly capable AI.
March 18, 2026
The Promise of White-Box Tools for Detecting and Mitigating AI Deception
We argue that deception is both a core challenge for Al alignment and an opportunity to address it. Without reliable deception detection, model evaluations cannot be trusted to reflect true model dispositions, and solving deception would reduce a broad class of risks and potentially create active instruments for producing alignment. Although current black-box monitoring approaches have structural limitations, but white-box methods that leverage models' internal representations are a more promising path.
March 10, 2026
Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution
Interpretability
Concept Influence attributes model behaviors to semantic directions (like linear probes or sparse autoencoder features) rather than individual test examples, improving identification of the training data that disproportionately drive unintended behaviors. Simple first-order approximations match or outperform standard influence functions while achieving over 20× computational speedups, though they degrade under significant distribution shifts.
February 19, 2026
Revisiting Frontier LLMs’ Attempts to Persuade on Extreme Topics: GPT and Claude Improved, Gemini Worsened
Red-Teaming & Evaluation
We test recently released models from frontier companies to see whether progress has been made on their willingness to persuade on harmful topics like radicalization and child sexual abuse. We find that OpenAI’s GPT and Anthropic’s Claude models are trending in the right direction, with near zero compliance on extreme topics. But Google’s Gemini 3 Pro complies with almost any persuasion request in our evaluation, without jailbreaking.
February 11, 2026
FAR.AI Selected to Lead EU AI Act CBRN Risk Consortium
FAR.AI has been selected by the European Commission's AI Office to conduct technical safety research supporting the implementation of the EU's landmark Artificial Intelligence Act. We'll tackle one of the most critical safety challenges posed by advanced AI systems: preventing misuse of AI systems to help produce Chemical, Biological, Radiological, and Nuclear (CBRN) threats.
February 3, 2026
FAR.AI Secures Over $30 Million in Multi-Funder Support to Scale Frontier AI Safety Research
FAR.AI secured over $30 million of funding commitments in 2025. This enables a significant expansion of our research capabilities and field-building initiatives. Principal supporters include Coefficient Giving (previously Open Philanthropy), Schmidt Sciences, Survival and Flourishing Fund, the Center for Security and Emerging Technology (CSET), and the AI Safety Fund (AISF) supported by the Frontier Model Forum (FMF).
January 15, 2026
San Diego Alignment Workshop 2025
Event
Held ahead of NeurIPS, the San Diego Alignment Workshop 2025 brought together over 300 researchers to confront evidence that frontier AI models already exhibit deception, jailbreakability, sabotage risk, and evaluation gaming. Talks spanned safety cases, control protocols, cybersecurity, interpretability, benchmarks, and governance, emphasizing that current defenses lag behind accelerating capabilities but that targeted technical approaches show promise.
December 17, 2025
When AGI Arrives, Will Journalists Be Ready?
Leading journalists, editors, and researchers in AI safety gathered at the Thompson Hotel in Washington, DC for a day-long conversation about AGI and how to report on it. The Journalism Workshop on AGI Impacts & Governance, co-hosted by FAR.AI and the Tarbell Center for AI Journalism, convened voices from newsrooms, academia, and policy centers to examine emerging issues and build connections that don't usually exist across these worlds.
November 17, 2025
Adam Gleave Named Schmidt Sciences AI2050 Early Career Fellow
Adam Gleave, co-founder and CEO of FAR.AI, has been named as one of Schmidt Sciences’ AI2050 Early Career Fellows. Co-chaired by Eric Schmidt and James Manyika, the Schmidt Sciences’ AI2050 Early Career Fellowship is an ambitious project aiming to answer a fundamental question: "It's 2050. AI has turned out to be hugely beneficial to society. What happened? What are the most important problems we solved and the opportunities and possibilities we realized to ensure this outcome?"
November 5, 2025
Frontier LLMs Attempt to Persuade into Harmful Topics
Red-Teaming & Evaluation
Large language models (LLMs) are already more persuasive than humans in many domains. While this power can be used for good, like helping people quit smoking, it also presents significant risks, such as large-scale political manipulation, disinformation, or terrorism recruitment. But how easy is it to get frontier models to persuade into harmful beliefs or illegal actions? Really easy – just ask them.
August 21, 2025
A Toolkit for Estimating the Safety-Gap between Safety Trained and Helpful Only LLMs
Red-Teaming & Evaluation
A growing body of research shows safeguards on open-weight AI models are brittle and easily bypassed using techniques like fine-tuning, activation engineering, adversarial prompting, or jailbreaks. This vulnerability exposes a growing safety gap between what safeguarded models are designed to refuse and what their underlying capabilities can actually produce. We introduce an open-source toolkit to quantify and analyze this gap.
July 31, 2025
Technical Innovations for AI Policy 2025
Event
The inaugural Technical Innovations for AI Policy Conference in Washington D.C. convened technical experts, researchers, and policymakers to discuss how innovations—like chip-tracking devices and secure evaluation facilities—can enable both AI progress and safety.
July 10, 2025
Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations
Red-Teaming & Evaluation
We tested the effectiveness of "defense-in-depth" AI safety strategies, where multiple layers of filters are used to prevent AI models from generating harmful content. Our new attack method, STACK, bypasses defenses layer-by-layer and achieved a 71% success rate on catastrophic risk scenarios where conventional attacks achieved 0% success against these multi-layered defenses. Our findings highlight that multi-layer AI defenses, while valuable, have significant vulnerabilities when facing attacks specifically designed to penetrate multiple defensive layers sequentially.
July 2, 2025
ClearHarm: A more challenging jailbreak dataset
Red-Teaming & Evaluation
We introduce a novel jailbreak benchmark focused on unambiguously harmful questions such as constructing chemical, biological, radiological and nuclear (CBRN) threats, available on HuggingFace. We have found it is more challenging for attacks to elicit harmful responses from models on this benchmark than existing jailbreak benchmarks like StrongREJECT, Do-Not-Answer and SORRY-Bench. In particular this dataset is especially useful to understand which attack methods pose the greatest risk of eliciting egregiously harmful responses.
June 23, 2025
Why does training on insecure code make models broadly misaligned?
Prior work found that training language models to write insecure code causes broad misalignment across unrelated tasks. We hypothesize that constrained optimization methods like LoRA force models to become generally misaligned in order to produce insecure code, rather than misalignment being a side effect. Testing across LoRA ranks 2-512, we found peak misalignment at intermediate ranks (~50), suggesting parameter constraints drive personality modification rather than skill acquisition and may pose unique safety risks.
June 17, 2025
Singapore Alignment Workshop 2025
Event
Over 150 researchers from academia, industry, and government gathered for the Singapore Alignment Workshop, covering everything from robustness to control to governance. Yoshua Bengio's opening keynote underscored the stakes: absent serious safety work, we may risk our children’s future.
June 5, 2025
Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion
Alignment
Can training against lie detectors make AI more honest—or will they just become better at deceiving us? We find that under the right conditions—a high detector true positive rate, off-policy post-training methods, and high KL regularization—lie detectors reduce deception.
June 4, 2025
Press Release: Technical Innovations for AI Policy
WASHINGTON, D.C. — June 4, 2025 — FAR.AI successfully launched the inaugural Technical Innovations for AI Policy Conference, creating a vital bridge between cutting-edge AI research and actionable policy solutions. The two-day gathering (May 31–June 1) convened more than 150 technical experts, researchers, and policymakers to address the most pressing challenges at the intersection of AI technology and governance
June 4, 2025
London ControlConf 2025
Event
The London ControlConf 2025 brought together researchers, nonprofits, and other experts to confront urgent risks from increasingly capable AI systems, including scenarios where AI attempts to undermine safeguards. Hosted by Redwood Research, FAR.AI, and the UK AI Security Institute, the event focused on advancing solutions to fundamental challenges in AI control.
May 5, 2025
Safe AI Forum Spins Out From FAR.AI
Updates
The Safe AI Forum (SAIF) runs the International Dialogues on AI Safety and works to advance international cooperation on extreme AI risk. SAIF started as a fiscally sponsored project at FAR.AI and has now successfully transitioned to operating as an independent non-profit.
May 2, 2025
Paris AI Security Forum 2025
Event
The AI Security Forum in Paris seeks to dramatically accelerate securing powerful AI models to avert catastrophes and fast-track scientific and economic breakthroughs. The event brought together experts in AI development, cybersecurity, and policy, to address this critical challenge.
March 12, 2025
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
Red-Teaming & Evaluation
DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed.
February 4, 2025
Bay Area Alignment Workshop 2024
Event
Bay Area Alignment Workshop brought together researchers and leaders from academia, industry, government, and nonprofits convened to guide the future of AI toward safety and alignment with societal values. Over two packed days, participants engaged with pivotal themes such as evaluation, robustness, interpretability and governance.
December 10, 2024
GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
Robustness & Security
A tiny dose of poisoned data can cause big problems for AI. Our jailbreak-tuning method causes models like GPT-4o to capably answer virtually any harmful question. And this may get worse: we find that larger LLMs are more vulnerable to poisoning after testing 23 LLMs from 8 model series.
October 31, 2024
Scientists Call for Global AI Safety Preparedness to Avert Catastrophic Risks
Event
Leading global AI scientists convened in Venice for the third International Dialogue on AI Safety (IDAIS-Venice), hosted by the Safe AI Forum (a project of FAR.AI) in partnership with the Berggruen Institute. Attendees including Turing award winners Yoshua Bengio, Andrew Yao and Geoffrey Hinton called for emergency preparedness to avert catastrophic risks from advanced AI systems.
September 16, 2024
Vienna Alignment Workshop 2024
Event
The Vienna Alignment Workshop gathered researchers to explore critical AI safety issues, including Robustness, Interpretability, Guaranteed Safe AI, and Governance, with a keynote by Jan Leike. It was followed by an informal Unconference, fostering further discussions and networking.
September 10, 2024
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Interpretability
Giving RNNs extra thinking time at the start boosts their planning skills in Sokoban. We explore how this planning ability develops during reinforcement learning. Intriguingly, we find that on harder levels the agent paces around to get enough computation to find a solution.
July 24, 2024
Beyond the Board: Exploring AI Robustness Through Go
Robustness & Security
Achieving robustness remains a significant challenge even in narrow domains like Go. We test three approaches to defend Go AIs from adversarial strategies. We find these defenses protect against previously discovered adversaries, but uncover qualitatively new adversaries that undermine these defenses.
June 18, 2024
Big Picture AI Safety
We conducted 17 semi-structured interviews of AI safety experts about their big picture strategic view of the AI safety landscape: how will human-level AI play out, how things might go wrong, and what should the AI safety community be doing. While many respondents held “traditional” views (e.g. the main threat is misaligned AI takeover), there was more opposition to these standard views than we expected, and the field seems more split on many important questions than someone outside the field may infer.
May 23, 2024
Evaluating LLM Responses to Moral Scenarios
Alignment
We present LLMs with a series of moral choices and find that LLMs tend to align with human judgement in clear scenarios. In ambiguous scenarios most models exhibit uncertainty, but a few large proprietary models share a set of clear preferences.
March 25, 2024
Scientists Call For International Cooperation on AI Red Lines
Event
Leading global AI scientists convened in Beijing for the second International Dialogue on AI Safety (IDAIS-Beijing), hosted by the Safe AI Forum (a project of FAR.AI) in partnership with the Beijing Academy of AI (BAAI). Attendees including Turing award winners Yoshua Bengio, Andrew Yao, and Geoffrey Hinton called for red lines in AI development to prevent catastrophic and existential risks from AI.
March 18, 2024
NOLA Alignment Workshop 2023
Event
The New Orleans Alignment Workshop 2023 brought together leading ML researchers working to align advanced AI with human values and develop safe AI system. Presentations from industry, academia, and non-profits focused on topics spanning from oversight, interpretability, robustness, generalization and governance.
February 7, 2024
We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs
Red-Teaming & Evaluation
We red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents.
December 21, 2023
What’s New at FAR.AI
End-of-year round up of FAR.AI’s activities. Our research has culminated in 13 academic papers across robustness, value alignment and model evaluations; our field building events have reached more than 160 ML experts; and our coworking space hosts 40 members working on AI safety.
December 2, 2023
2023 Alignment Research Updates
Highlights from FAR.AI’s alignment research in 2023. Our science of robustness agenda has found vulnerabilities in superhuman Go systems; our value alignment research has developed more sample-efficient value learning algorithms; and our model evaluation direction has developed a variety of new black-box and white-box evaluation methods.
November 21, 2023
VLM-RM: Specifying Rewards with Natural Language
Alignment
We show how to use Vision-Language Models (VLM), and specifically CLIP models, as reward models (RM) for RL agents. Instead of manually specifying a reward function, we only need to provide text prompts like 'a humanoid robot kneeling' to instruct and provide feedback to the agent. Importantly, we find that larger VLMs provide more accurate reward signals, so we expect this method to work even better in the future.
October 19, 2023
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
Interpretability
We demonstrate Codebook Features: a way to modify neural networks to make their internals more interpretable and steerable while causing only a small degradation of performance. At each layer, we apply a quantization bottleneck that forces the activation vector into a sum of a few discrete codes; converting an inscrutable, dense, and continuous vector into a discrete list of codes from a learned 'codebook' that are either on or off.
October 19, 2023
Uncovering Latent Human Wellbeing in LLM Embeddings
Red-Teaming & Evaluation
A one-dimensional PCA projection of OpenAI's text-embedding-ada-002 model achieves 73.7% accuracy on the ETHICS Util test dataset. This is comparable with the 74.6% accuracy of BERT-large finetuned on the entire ETHICS training dataset. This demonstrates language models develop implicit representations of human utility purely from self-supervised learning.
September 12, 2023
AI Safety in a World of Vulnerable Machine Learning Systems
Robustness & Security
All contemporary machine learning systems are vulnerable to adversarial attack. This poses serious problems for existing alignment proposals. We explore these issues and propose several research directions FAR.AI is pursuing to overcome this challenge.
March 5, 2023