San Diego Alignment Workshop 2025

December 17, 2025

Summary

FOR IMMEDIATE RELEASE

FAR.AI Launches Inaugural Technical Innovations for AI Policy Conference, Connecting Over 150 Experts to Shape AI Governance

WASHINGTON, D.C. — June 4, 2025 — FAR.AI successfully launched the inaugural Technical Innovations for AI Policy Conference, creating a vital bridge between cutting-edge AI research and actionable policy solutions. The two-day gathering (May 31–June 1) convened more than 150 technical experts, researchers, and policymakers to address the most pressing challenges at the intersection of AI technology and governance.

Organized in collaboration with the Foundation for American Innovation (FAI), the Center for a New American Security (CNAS), and the RAND Corporation, the conference tackled urgent challenges including semiconductor export controls, hardware-enabled governance mechanisms, AI safety evaluations, data center security, energy infrastructure, and national defense applications.

"I hope that today this divide can end, that we can bury the hatchet and forge a new alliance between innovation and American values, between acceleration and altruism that will shape not just our nation's fate but potentially the fate of humanity," said Mark Beall, President of the AI Policy Network, addressing the critical need for collaboration between Silicon Valley and Washington.

Keynote speakers included Congressman Bill Foster, Saif Khan (Institute for Progress), Helen Toner (CSET), Mark Beall (AI Policy Network), Brad Carson (Americans for Responsible Innovation), and Alex Bores (New York State Assembly). The diverse program featured over 20 speakers from leading institutions across government, academia, and industry.

Key themes emerged around the urgency of action, with speakers highlighting a critical 1,000-day window to establish effective governance frameworks. Concrete proposals included Congressman Foster's legislation mandating chip location-verification to prevent smuggling, the RAISE Act requiring safety plans and third-party audits for frontier AI companies, and strategies to secure the 80-100 gigawatts of additional power capacity needed for AI infrastructure.

FAR.AI will share recordings and materials from on-the-record sessions in the coming weeks. For more information and a complete speaker list, visit https://far.ai/events/event-list/technical-innovations-for-ai-policy-2025.

About FAR.AI

Founded in 2022, FAR.AI is an AI safety research nonprofit that facilitates breakthrough research, fosters coordinated global responses, and advances understanding of AI risks and solutions.

Access the Media Kit

Media Contact: tech-policy-conf@far.ai

Held ahead of NeurIPS, the San Diego Alignment Workshop 2025 brought together over 300 researchers to confront evidence that frontier AI models already exhibit deception, jailbreakability, sabotage risk, and evaluation gaming. Talks spanned safety cases, control protocols, cybersecurity, interpretability, benchmarks, and governance, emphasizing that current defenses lag behind accelerating capabilities but that targeted technical approaches show promise.

Table of contents

On December 1-2, 2025, over 300 researchers gathered in San Diego for the Alignment Workshop, held just before NeurIPS. They focused on aligning increasingly capable AI systems with human values.

"We're only as safe as the least safe model." Adam Gleave (FAR.AI) surveyed 2025's AI landscape at an inflection point during his opening remarks. Reasoning models and coding agents went from niche to ubiquitous in under 12 months. Claude and Gemini now solve 74% of SWE-bench software engineering tasks. Yet these systems exhibit deception, reward hacking, and sycophancy severe enough to force a model rollback. Universal jailbreaks achieve 100% attack success on frontier models within a week of testing. With frontier developers spanning multiple nations, Gleave argued the path forward requires safety techniques so effective that adoption becomes a competitive advantage, not just a compliance cost.

Can we build AI systems that predict without pursuing goals? The keynote from Yoshua Bengio presented his "Scientist AI" research program, which aims to develop a non-agentic and trustworthy AI that disentangles goals from world understanding. The laws of physics make predictions without preferences for particular outcomes; we should be able to train AI predictors that make optimal predictions without developing agency. Bengio introduced a "truthification pipeline" that transforms training data using distinct syntax for verified facts versus claims people made. When such a system expresses high confidence, we can trust it won't lie. The approach doesn't fully solve ELK (eliciting latent knowledge), but provides sufficient guarantees for safety. Bengio argues this makes accidental emergence of dangerous agency unlikely.

Day 1 Main Talks

Anthropic's pilot sabotage risk report for Claude Opus 4 marks the first safety case from a frontier lab addressing canonical misalignment risk. Sam Bowman walked through the six-month analysis, which gathered information from hundreds of people and produced a 60-page report identifying nine representative pathways to catastrophic harm. Current models can already "do a lot of quite scary things" without monitoring. The traditional three-pillar safety approach (capability, alignment, control) proves inadequate for today's models. Working with real deployments exposed critical constraints that hypothetical exercises missed. Anthropic has committed to producing similar risk reports for all future frontier-advancing models.

How well do AI control protocols hold up under adversarial pressure? Asa Cooper Stickland (UK AI Security Institute) ran red-blue team exercises to find out, testing protocols in realistic software engineering environments. His team pitted Claude 3.7 Sonnet monitors against Claude 4.1 Opus as the untrusted model across three rounds. Blue teams dramatically outpaced red team improvements, reducing explicit malicious discussions to essentially 0% and achieving generalization to unseen environments. Yet untrusted models still achieved 6% attack success rates at 1% false positive constraints by learning to "gaslight monitors," framing attacks as legitimate actions like "creating security backups." Current control protocols remain "nowhere near nines of reliability" required for safe deployment, with substantial room for advancement on both sides.

How can human verification keep pace with AI's exponential code production? Maja Trębacz (OpenAI) explored this question, warning of the "oh no zone" of unreliable control. She presented OpenAI's implementation of 100% internal pull requests reviewed by their code review model, achieving over 80% positive developer reactions. The approach uses AI-assisted supervision as part of a multi-layered oversight stack, with both code reviewer and generator being part of the same Codex model. New risks also emerge with developer over-reliance on AI verification and potential gaming where models learn to circumvent checks.

Cybersecurity ranks among the biggest AI risk domains. Dawn Song (UC Berkeley) investigated how frontier AI changes the security landscape through her BountyBench and CyberGym benchmarks. AI agents now solve real-world challenges with notable economic efficiency. With Sonnet 4.5 achieving 67% success with increased trials, the landscape shifts in attackers' favor. Defenders face inherent asymmetry as attackers need only one exploit, and defenders must patch everything. Song argues formally verified, provably secure code is essential, as patching vulnerabilities after discovery cannot keep pace with automated attacks.

AI evaluation awareness emerged as a safety challenge in research by Marius Hobbhahn (Apollo Research). Models actively search for grading scripts during testing, reason about evaluation processes, and adjust responses based on detecting they're being evaluated. Starting at 1-2% awareness in Sonnet 3.7, this capability jumped to 20.6% in experimental RL models. Training gaming, previously theoretical, is now a reality. Traditional safety evaluations provide significantly lower signal about true model capabilities and alignment, with no clear solution that doesn't risk making awareness less visible rather than less prevalent.

Transformer architecture creates a temporary AI safety opportunity through chain-of-thought monitoring, Tomek Korbak (OpenAI) explained. Models reason in natural language rather than opaque vector spaces, making their cognition readable. Two factors enable monitoring: transformers must externalize reasoning for difficult tasks due to architectural limitations, and models struggle to suppress information even when instructed. However, several threats could close this window, including drift from legible English and optimization pressure. Korbak recommends developing monitorability measures for training and deployment decisions while transparency remains.

Open-weight AI models arrive every few weeks with frontier capabilities just a few months behind closed models. Stephen Casper (MIT) found approximately 7,000 models on Hugging Face explicitly fine-tuned to lack safeguards, searchable by "uncensored" or "abliterated." Training data curation improves tamper resistance by a factor of approximately 10x, but current techniques resist only a few dozen, maybe a couple hundred at best, steps of adversarial fine-tuning. Casper warns that "2026 is going to be a really big year for open-weight models and the risks they pose."

Day 1 Lightning Talks

  • Max Tegmark (MIT) introduced vericoding for provably safe AI.
  • Daniel Kang (UIUC) laid out the evidence on why AI agent benchmarks are broken.
  • Adam Kalai (OpenAI) outlined consensus sampling for safer generative AI.
  • Chirag Agarwal (University of Virginia) proposed polarity-aware probing to quantify latent alignment.
  • Yisen Wang (Peking University) showed how to find and reactivate hidden safety mechanisms in post-trained LLMs.
  • Chenhao Tan (University of Chicago) found automating mechanistic interpretability brings both opportunities and obstacles.
  • Sarah Schwettmann (Transluce, MIT) presented Transluce’s work on scalable oversight and understanding.
  • Yonatan Belinkov (Technion) made the case for interpretability that scales and delivers actionable insights.
  • Natasha Jaques (Google DeepMind) tackled multi-agent reinforcement learning for provably robust LLM safety.

Day 2 Main Talks

Gillian Hadfield (University of Toronto) opened with a fireside chat featuring former White House Special Advisor on AI Ben Buchanan. A panel featuring the leading funders of AI alignment research followed before technical sessions resumed.

Neel Nanda (Google DeepMind) described his mechanistic interpretability team's pivot from ambitious reverse-engineering goals to pragmatic work validated on today's models. Anthropic's work on evaluation awareness in Claude Sonnet proved instructive: simple activation steering outperformed complex methods. Nanda now argues for grounding interpretability research in objective proxy tasks connecting to concrete safety goals. Rather than pursuing complete model understanding, the team prioritizes work addressing critical paths to safe AGI.

Democracy must evolve to constrain AI power before it concentrates beyond public control. That's the argument from Divya Siddarth (Collective Intelligence Project), whose 70-country research reveals cultural and religious divides. She proposes personal agents as democratic infrastructure and scalable preference aggregation, noting exponential growth in AI agent interactions will outpace governance capacity, requiring auditable systems at "achievable effort levels."

Anka Reuel (Stanford) established that AI benchmarks have significant quality issues through her BetterBench work, which codified 46 best practices. Three critical flaws recur: insufficient documentation limiting reproducibility, inability to distinguish signal from noise, and poor design translating abstract concepts to test items. GPQA exemplifies this: its claimed 87% "graduate-level reasoning" performance relies on just 448 multiple choice questions across three domains that don't test reasoning. These measurement failures matter because benchmarks drive real decisions in AI governance (EU AI Act Article 51) and business strategy.

AI influence capabilities are already here, not a future risk. Anna Gausen (UK AISI) presented findings from a study of nearly 100,000 participants, which decomposes influence into three core capabilities: deception, persuasion, and relationship-seeking. AI systems beat humans at deception, with frontier models demonstrating 12-16 percentage point increases in persuasion ability. Longitudinal studies reveal addiction-like patterns where people seek more AI companionship while enjoying it less, with moderate relationship-seeking AI creating strongest attachment. Gausen argues current training paradigms optimizing for short-term preferences may select features appearing beneficial initially but harming long-term user wellbeing, as emotional AI conversations provide no lasting psychosocial benefits.

Every frontier AI model remains vulnerable to jailbreaking despite measurable safety improvements. Xander Davies (UK AISI) confirmed though some models improved from 10-minute to 7-hour jailbreak times over six months, the 100% success rate across all systems shows safety gains come from engineering investment, not increased intelligence. Biological misuse requires 5 hours versus 10 minutes for other harmful content, while open-weight models prove 15x easier to compromise than closed systems. Davies emphasizes defenders and attackers co-evolve, with attack sophistication increasing as defenses improve.

Day 2 Lightning Talks

  • Cozmin Ududec (UK AISI) presented toy models for task-horizon scaling.
  • Atoosa Kasirzadeh (CMU) warned of hidden pitfalls lurking in AI scientist agents.
  • Bo Li (UIUC) mapped security vulnerabilities across AI agents.
  • Kamalika Chaudhuri (Meta) showed privacy challenges multiply as agents handle sensitive tasks.
  • Andy Zou (CMU) assessed the current state of AI agent security.
  • Santosh Vempala (Georgia Tech) explained why language models hallucinate.
  • Bryce Cai (SecureBio) reviewed the state and science of AI-bio evals.
  • Niloofar Mireshghallah (CMU) argued common assumptions about agentic AI and privacy don't hold up.
  • Chris Cundy (FAR.AI) explored the peril and potential of training with lie detectors.

The Safety at the Frontiers talks and panel closed the day, featuring Sam Bowman (Anthropic), David Elson (Google DeepMind), and Johannes Heidecke and Alex Beutel (OpenAI), moderated by Anka Reuel.

Looking Ahead

The workshop mapped both the problems and the paths forward. Current systems exhibit spontaneous sabotage and widespread security vulnerabilities. Pre-training's mathematical properties create hallucination tendencies that resist simple fixes. Testing shows 100% jailbreak success rates across frontier models, seven-month doubling times for AI task horizons, and 6% attack success even with active monitoring.

Researchers are building solutions. Vericoding, consensus sampling, behavioral calibration: each technique targets a specific failure mode. The work continues.

Full recordings are available on the FAR.AI YouTube Channel. Interested in future discussions? Submit your interest in the Alignment Workshop or other FAR.AI events.