Singapore Alignment Workshop 2025

June 5, 2025
Summary
Over 150 researchers from academia, industry, and government gathered for the Singapore Alignment Workshop, covering everything from robustness to control to governance. Yoshua Bengio's opening keynote underscored the stakes: absent serious safety work, we may risk our children’s future.
Introduction
On April 23, 2025, the day before ICLR 2025, researchers from around the world convened in Singapore to tackle AI safety and alignment's most pressing challenges. How can we build systems that are powerful, secure, and aligned? How can governance keep pace with frontier labs? What kinds of AI do we want to build—not just for next year, but for the next generation?
The day packed over 40 presentations across technical sessions, office hours, and lightning talks. Researchers presented new benchmarks for testing AI agents against cybersecurity vulnerabilities, methods for detecting deceptive reasoning, and frameworks for formal safety cases. Others tackled reward hacking, model tampering attacks, and scaling oversight techniques—all part of FAR.AI's mission to bring together leading minds from industry, academia, and governance to collaborate on addressing critical AI safety challenges.

Keynote
“I started thinking about the future of my children, and my grandchild, and thinking that they might not have a life in 10, 20 years, because we still don't know how to make sure powerful AIs will not harm people; will not turn against us.”
In a keynote talk that ranged from deeply personal to technical and detailed, Bengio laid out his expectations for the near future: rapid gains in AI capabilities, exponentially increasing autonomy, and piling evidence of deceptive and self-preserving behavior in agentic AI systems.
Once skeptical about AI risks, Bengio's perspective shifted after ChatGPT's release and the capability jumps that followed. We're no longer building mere tools, he argued, but sophisticated agents that make plans and pursue goals. Without careful alignment, we might find ourselves trying to control systems that have learned to oppose us.
Despite the risks, Bengio outlined reasons for cautious optimism. Non-agentic AI could help us advance scientific knowledge and monitor dangerous agentic systems. Still, we need much better technical safeguards and governance measures, and we need them soon. The stakes are simple—building a world our children can inherit.
Robustness
LLMs face many security challenges: attackers can hijack them through targeted fine-tuning, appropriate their capabilities via model distillation, or circumvent their safeguards with automatically-generated prompts. As AI systems grow more powerful, each vulnerability becomes a potential gateway to misuse.
In his robustness talk, Stephen Casper (MIT CSAIL) made the case for tampering resistance as a top safety priority. Tampering attacks—which directly modify model weights or activations—have been more effective than prompt-based attacks and other methods at accessing harmful capabilities. Notably, when tampering attacks fail, most other attack methods fail as well, which makes tampering-resistance a natural stepping stone toward broader adversarial robustness.
In “Antidistillation Sampling,” Zico Kolter presented a new method to prevent unauthorized model copying. His technique generates text from a model while making the outputs unusable for training other models through distillation. Though not yet practical, this approach could enable public sharing of model capabilities without allowing competitors or bad actors to replicate them.
Then, Siva Reddy (McGill University, MILA) delivered a wide-ranging survey of jailbreaking vulnerabilities in “Jailbreaking Aligned LLMs, Reasoning Models & Agents.” His findings revealed that models aligned through supervised fine-tuning (SFT) are more susceptible to universal jailbreaks than those trained with reinforcement learning (RLHF/DPO), that placing models in agentic tasks makes them more vulnerable, and that DeepSeek-R1 can both be easily jailbroken and used to compromise other models.

Mitigations, Control & Security
Protecting AI from attackers is a challenge—but AI systems themselves may also actively work against our control mechanisms. Today's most capable models occasionally conceal their misdeeds from monitoring systems, fake alignment to preserve their values, and can sabotage the systems meant to keep them in place. Future systems will likely exhibit more sophisticated deceptive patterns that are harder to detect.
Cassidy Laidlaw (UC Berkeley) tackled reward hacking, the problem where AI systems go astray optimizing for proxy metrics causing them to fail at the objectives we actually care about. His talk “A New Definition & Improved Mitigation for Reward Hacking” presented a refined mathematical framework for understanding this issue, leading him to propose a mitigation approach that outperforms existing techniques when tested in realistic reinforcement learning environments.
On behalf of Stephen McAleer, Noam Brown (OpenAI) presented “High-Compute Alignment & Control,” outlining why aligning superintelligence will require compute-intensive solutions. He argues that existing low-compute alignment techniques fail when dealing with models that engage in strategic deception by pretending to be aligned during evaluation. Brown then proposed three alignment approaches that—just like model capabilities—can scale with RL compute resources: adversarial training for detecting deceptive behavior, integrating debate mechanisms into RL training, and developing case studies examining misaligned high-compute models.
Martín Soto (UK's AI Security Institute) presented “Next Steps for Control Safety Cases,” which analyzed how red teams can effectively evaluate control mechanisms meant to disrupt adversarial actions from misaligned AI. Soto highlighted three key challenges for red teams: developing test environments that accurately mirror deployment conditions, establishing multiple backup mitigations to minimize control failures, and accurately estimating the capabilities of adversarial AIs..
Lastly, Daniel Kang (University of Illinois Urbana-Champaign) introduced CVE-Bench, the first benchmark for testing AI agents against real-world cybersecurity vulnerabilities. His team reproduced 40 vulnerable software systems in virtual containers, developed application-specific graders, and standardized attacks across eight categories. CVE-Bench results show that current agents struggle in realistic settings, achieving 10% success rates on zero-day exploits compared to 50%+ success rates on previous benchmarks.

Lightning Talks
From collective agency and deception to AI lab lobbying tactics and game-theory-inspired interpretability—twenty lightning talks compressed a plethora of alignment challenges and solutions into rapid five-minute presentations.
Governance Talks
- Shayne Longpre proposed a framework for responsible disclosure—essential for surfacing vulnerabilities before they’re exploited.
- Mark Brakel exposed the tactics major AI companies use to resist regulation.
- Robin Staes-Polet presented ongoing efforts in EU governance, explaining how the EU AI Act covers AI agents.
- Alan Chan proposed "agent infrastructure" as a way to handle failures from imperfectly aligned AI agents.
- Tegan Maharaj tackled the concept of “disaster preparedness” in AI safety, highlighting the need for preemptive planning.
Morning Talks
- Furong Huang presented a method to align LLMs without retraining, allowing models to adapt to user preferences on the fly.
- Kalesha Bullard explored the cutting edge of collective agency and how it complicates traditional alignment frameworks.
- Tianwei Zhang shared lessons from the Singapore AI Safety institute's work on evaluating generative models across multiple applications.
- Adam Kalai advocated for using simulated environments to test the reliability of alignment techniques, instead of relying solely on evaluating model outputs.
- Weiyan Shi delineated a novel jailbreaking approach: using human persuasion tactics on LLMs.
- Yinpeng Dong presented STAIR, a technique which repurposes test-time reasoning methods to improve safety alignment.
- Xiaoyun Yi unveiled the Value Compass Leaderboard, an evolving evaluation platform for value adherence in LLMs.
- Animesh Mukherjee delved into safety alignment strategies for large language models.
- Huiqi Deng introduced a game theory toolkit to explain how deep neural networks form representations and make decisions.
- Jiaming Ji investigated deceptive alignment in LLMs and introduced a “Thinking Monitor” to detect misaligned reasoning.
Afternoon Talks
- Sravanti Addepalli showed how rewording harmful requests can bypass LLM safeguards.
- Aditya Gopalan revealed hidden pitfalls in Direct Preference Optimization (DPO), showing how seemingly principled techniques may misestimate user intent.
- Baoyuan Wu explored how attackers can force reasoning models to skip analysis steps and produce incorrect answers.
- Pin-Yu Chen showed many AI safety challenges are hypothesis testing problems, enabling application of established statistical methods.

AI Safety & Security Updates
The workshop's final session turned toward the future—exploring newly discovered alignment failures, OpenAI’s plan to prepare for AGI, and the UK's efforts to massively scale up alignment research.
Owain Evans discussed “Emergent Misalignment”, or how fine-tuning LLMs on insecure code makes them aggressively misaligned—causing them to promote self-harm and espouse Nazi sympathies in response to innocuous questions. Evans demonstrated this phenomenon is pervasive across models, can emerge from various fine-tuning datasets, and remains distinct from jailbreaking. He concluded with preliminary findings that might explain how harmful behaviors learned from training on narrow tasks can expand into system-wide misalignment.
Johannes Heidecke from OpenAI presented their Preparedness Framework, a company-wide plan designed to manage AGI safety risks before they materialize. The framework covers threats across biosecurity, cyber, and autonomous capabilities. Through scalable oversight, chain-of-thought (CoT) monitoring, and other additional measures, it aims to keep alignment efforts ahead of advancing model capabilities.
In “Scaling Alignment Research via Safety Cases,” Jacob Pfau presented the methodology the UK AI Security Institute plans to use to scale alignment research by collaborating with autonomous research groups and, potentially, AI agents themselves. Their approach centers on decomposing safety claims into independent, parallel workstreams that allow for evaluation through formal proofs or targeted empirical evaluations. Pfau demonstrated this framework for open problems in debate protocols, exploration hacking, and deployment safety assurance.
The Singapore Alignment Workshop advanced the conversation on AI safety and fostered a stronger community committed to aligning AI with human values.
To watch the full recordings, please visit the event website or YouTube playlist. Want to be considered for future discussions? Submit your interest in future Alignment Workshops or other FAR.AI events.
This is a div block with a Webflow interaction that will be triggered when the heading is in the view.
Introduction
On April 23, 2025, the day before ICLR 2025, researchers from around the world convened in Singapore to tackle AI safety and alignment's most pressing challenges. How can we build systems that are powerful, secure, and aligned? How can governance keep pace with frontier labs? What kinds of AI do we want to build—not just for next year, but for the next generation?
The day packed over 40 presentations across technical sessions, office hours, and lightning talks. Researchers presented new benchmarks for testing AI agents against cybersecurity vulnerabilities, methods for detecting deceptive reasoning, and frameworks for formal safety cases. Others tackled reward hacking, model tampering attacks, and scaling oversight techniques—all part of FAR.AI's mission to bring together leading minds from industry, academia, and governance to collaborate on addressing critical AI safety challenges.

Keynote
“I started thinking about the future of my children, and my grandchild, and thinking that they might not have a life in 10, 20 years, because we still don't know how to make sure powerful AIs will not harm people; will not turn against us.”
In a keynote talk that ranged from deeply personal to technical and detailed, Bengio laid out his expectations for the near future: rapid gains in AI capabilities, exponentially increasing autonomy, and piling evidence of deceptive and self-preserving behavior in agentic AI systems.
Once skeptical about AI risks, Bengio's perspective shifted after ChatGPT's release and the capability jumps that followed. We're no longer building mere tools, he argued, but sophisticated agents that make plans and pursue goals. Without careful alignment, we might find ourselves trying to control systems that have learned to oppose us.
Despite the risks, Bengio outlined reasons for cautious optimism. Non-agentic AI could help us advance scientific knowledge and monitor dangerous agentic systems. Still, we need much better technical safeguards and governance measures, and we need them soon. The stakes are simple—building a world our children can inherit.
Robustness
LLMs face many security challenges: attackers can hijack them through targeted fine-tuning, appropriate their capabilities via model distillation, or circumvent their safeguards with automatically-generated prompts. As AI systems grow more powerful, each vulnerability becomes a potential gateway to misuse.
In his robustness talk, Stephen Casper (MIT CSAIL) made the case for tampering resistance as a top safety priority. Tampering attacks—which directly modify model weights or activations—have been more effective than prompt-based attacks and other methods at accessing harmful capabilities. Notably, when tampering attacks fail, most other attack methods fail as well, which makes tampering-resistance a natural stepping stone toward broader adversarial robustness.
In “Antidistillation Sampling,” Zico Kolter presented a new method to prevent unauthorized model copying. His technique generates text from a model while making the outputs unusable for training other models through distillation. Though not yet practical, this approach could enable public sharing of model capabilities without allowing competitors or bad actors to replicate them.
Then, Siva Reddy (McGill University, MILA) delivered a wide-ranging survey of jailbreaking vulnerabilities in “Jailbreaking Aligned LLMs, Reasoning Models & Agents.” His findings revealed that models aligned through supervised fine-tuning (SFT) are more susceptible to universal jailbreaks than those trained with reinforcement learning (RLHF/DPO), that placing models in agentic tasks makes them more vulnerable, and that DeepSeek-R1 can both be easily jailbroken and used to compromise other models.

Mitigations, Control & Security
Protecting AI from attackers is a challenge—but AI systems themselves may also actively work against our control mechanisms. Today's most capable models occasionally conceal their misdeeds from monitoring systems, fake alignment to preserve their values, and can sabotage the systems meant to keep them in place. Future systems will likely exhibit more sophisticated deceptive patterns that are harder to detect.
Cassidy Laidlaw (UC Berkeley) tackled reward hacking, the problem where AI systems go astray optimizing for proxy metrics causing them to fail at the objectives we actually care about. His talk “A New Definition & Improved Mitigation for Reward Hacking” presented a refined mathematical framework for understanding this issue, leading him to propose a mitigation approach that outperforms existing techniques when tested in realistic reinforcement learning environments.
On behalf of Stephen McAleer, Noam Brown (OpenAI) presented “High-Compute Alignment & Control,” outlining why aligning superintelligence will require compute-intensive solutions. He argues that existing low-compute alignment techniques fail when dealing with models that engage in strategic deception by pretending to be aligned during evaluation. Brown then proposed three alignment approaches that—just like model capabilities—can scale with RL compute resources: adversarial training for detecting deceptive behavior, integrating debate mechanisms into RL training, and developing case studies examining misaligned high-compute models.
Martín Soto (UK's AI Security Institute) presented “Next Steps for Control Safety Cases,” which analyzed how red teams can effectively evaluate control mechanisms meant to disrupt adversarial actions from misaligned AI. Soto highlighted three key challenges for red teams: developing test environments that accurately mirror deployment conditions, establishing multiple backup mitigations to minimize control failures, and accurately estimating the capabilities of adversarial AIs..
Lastly, Daniel Kang (University of Illinois Urbana-Champaign) introduced CVE-Bench, the first benchmark for testing AI agents against real-world cybersecurity vulnerabilities. His team reproduced 40 vulnerable software systems in virtual containers, developed application-specific graders, and standardized attacks across eight categories. CVE-Bench results show that current agents struggle in realistic settings, achieving 10% success rates on zero-day exploits compared to 50%+ success rates on previous benchmarks.

Lightning Talks
From collective agency and deception to AI lab lobbying tactics and game-theory-inspired interpretability—twenty lightning talks compressed a plethora of alignment challenges and solutions into rapid five-minute presentations.
Governance Talks
- Shayne Longpre proposed a framework for responsible disclosure—essential for surfacing vulnerabilities before they’re exploited.
- Mark Brakel exposed the tactics major AI companies use to resist regulation.
- Robin Staes-Polet presented ongoing efforts in EU governance, explaining how the EU AI Act covers AI agents.
- Alan Chan proposed "agent infrastructure" as a way to handle failures from imperfectly aligned AI agents.
- Tegan Maharaj tackled the concept of “disaster preparedness” in AI safety, highlighting the need for preemptive planning.
Morning Talks
- Furong Huang presented a method to align LLMs without retraining, allowing models to adapt to user preferences on the fly.
- Kalesha Bullard explored the cutting edge of collective agency and how it complicates traditional alignment frameworks.
- Tianwei Zhang shared lessons from the Singapore AI Safety institute's work on evaluating generative models across multiple applications.
- Adam Kalai advocated for using simulated environments to test the reliability of alignment techniques, instead of relying solely on evaluating model outputs.
- Weiyan Shi delineated a novel jailbreaking approach: using human persuasion tactics on LLMs.
- Yinpeng Dong presented STAIR, a technique which repurposes test-time reasoning methods to improve safety alignment.
- Xiaoyun Yi unveiled the Value Compass Leaderboard, an evolving evaluation platform for value adherence in LLMs.
- Animesh Mukherjee delved into safety alignment strategies for large language models.
- Huiqi Deng introduced a game theory toolkit to explain how deep neural networks form representations and make decisions.
- Jiaming Ji investigated deceptive alignment in LLMs and introduced a “Thinking Monitor” to detect misaligned reasoning.
Afternoon Talks
- Sravanti Addepalli showed how rewording harmful requests can bypass LLM safeguards.
- Aditya Gopalan revealed hidden pitfalls in Direct Preference Optimization (DPO), showing how seemingly principled techniques may misestimate user intent.
- Baoyuan Wu explored how attackers can force reasoning models to skip analysis steps and produce incorrect answers.
- Pin-Yu Chen showed many AI safety challenges are hypothesis testing problems, enabling application of established statistical methods.

AI Safety & Security Updates
The workshop's final session turned toward the future—exploring newly discovered alignment failures, OpenAI’s plan to prepare for AGI, and the UK's efforts to massively scale up alignment research.
Owain Evans discussed “Emergent Misalignment”, or how fine-tuning LLMs on insecure code makes them aggressively misaligned—causing them to promote self-harm and espouse Nazi sympathies in response to innocuous questions. Evans demonstrated this phenomenon is pervasive across models, can emerge from various fine-tuning datasets, and remains distinct from jailbreaking. He concluded with preliminary findings that might explain how harmful behaviors learned from training on narrow tasks can expand into system-wide misalignment.
Johannes Heidecke from OpenAI presented their Preparedness Framework, a company-wide plan designed to manage AGI safety risks before they materialize. The framework covers threats across biosecurity, cyber, and autonomous capabilities. Through scalable oversight, chain-of-thought (CoT) monitoring, and other additional measures, it aims to keep alignment efforts ahead of advancing model capabilities.
In “Scaling Alignment Research via Safety Cases,” Jacob Pfau presented the methodology the UK AI Security Institute plans to use to scale alignment research by collaborating with autonomous research groups and, potentially, AI agents themselves. Their approach centers on decomposing safety claims into independent, parallel workstreams that allow for evaluation through formal proofs or targeted empirical evaluations. Pfau demonstrated this framework for open problems in debate protocols, exploration hacking, and deployment safety assurance.
The Singapore Alignment Workshop advanced the conversation on AI safety and fostered a stronger community committed to aligning AI with human values.
To watch the full recordings, please visit the event website or YouTube playlist. Want to be considered for future discussions? Submit your interest in future Alignment Workshops or other FAR.AI events.