All Publications

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

Model Evaluation

Uncategorized

Jul 8, 2025

We release an open-source toolkit to measure this gap across different model families and scales, finding that larger models show increasingly dangerous capabilities when the "safety gap" - the difference in dangerous capabilities between open-weight language models with intact safety measures versus those with safeguards removed by bad actors.

Ann-Kathrin Dombrowski

Dillon Bowen

Adam Gleave

Chris Cundy

STACK: Adversarial Attacks on LLM Safeguard Pipelines

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Model Evaluation

Uncategorized

Jul 2, 2025

We tested the effectiveness of "defense-in-depth" AI safety strategies, where multiple layers of filters are used to prevent AI models from generating harmful content. Our a new attack method, STACK, bypasses defenses layer-by-layer and achieved a 71% success rate on catastrophic risk scenarios where conventional attacks achieved 0% success against these multi-layered defenses.

Ian McKenzie

Oskar Hollinsworth

Tom Tseng

Xander Davies

Stephen Casper

The Singapore Consensus on Global AI Safety Research Priorities

Research

Uncategorized

Jun 25, 2025

The "2025 Singapore Conference on AI (SCAI): International Scientific Exchange on AI Safety" aimed to support research in this space by bringing together AI scientists across geographies to identify and synthesise research priorities in AI safety.

Tegan Maharaj

Luke Ong

Lan Xue

Ya-Qin Zhang

Wan Sie Lee

ClearHarm: A more challenging jailbreak dataset

Model Evaluation

Uncategorized

Jun 23, 2025

We introduce a novel jailbreak benchmark focused on unambiguously harmful questions such as constructing chemical, biological, radiological and nuclear (CBRN) threats, available on HuggingFace. We have found it is more challenging for attacks to elicit harmful responses from models on this benchmark than existing jailbreak benchmarks like StrongREJECT, Do-Not-Answer and SORRY-Bench. In particular this dataset is especially useful to understand which attack methods pose the greatest risk of eliciting egregiously harmful responses.

‍

Oskar Hollinsworth

Ian McKenzie

Tom Tseng

Adam Gleave

Why does training on insecure code make models broadly misaligned?

Uncategorized

Jun 17, 2025

Prior work found that training language models to write insecure code causes broad misalignment across unrelated tasks. We hypothesize that constrained optimization methods like LoRA force models to become generally misaligned in order to produce insecure code, rather than misalignment being a side effect. Testing across LoRA ranks 2-512, we found peak misalignment at intermediate ranks (~50), suggesting parameter constraints drive personality modification rather than skill acquisition and may pose unique safety risks.

Chris Cundy

Preference Learning with Lie Detectors can Induce Honesty or Evasion

Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

Uncategorized

Jun 5, 2025

Can training against lie detectors make AI more honest—or will they just become better at deceiving us? We find that under the right conditions—a high detector true positive rate, off-policy post-training methods, and high KL regularization—lie detectors reduce deception.

Chris Cundy

Adam Gleave

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Model Evaluation

Uncategorized

Jun 3, 2025

In order to persuade users, LLMs must both be capable of persuading and willing to do so. Existing research explores the former, and we present the Attempt to Persuade Eval (APE) benchmark that tests how willing LLMs are to generate content aimed at shaping beliefs and behavior to flesh out the latter.

‍

Matthew Kowal

Jasper Timm

Jean-Francois Godbout

Thomas Costello

Antonio Arechar

Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability

Robustness

Uncategorized

May 22, 2025

As large language models gain popularity, their vulnerability to adversarial attacks remains a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Misalignment, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity within our experimental datasets. We then evaluate the adversarial performance of these fine-tuned models and assess how dataset factors correlate with attack success rates. Lastly, we explore potential causal links, offering new insights into adversarial defense strategies and highlighting the crucial role of dataset design in preserving model alignment.

Punya Syon Pandey

Samuel Simko

Kellin Pelrine

Zhijing Jin

An Invariant Learning Characterization of Controlled Text Generation

Uncategorized

May 11, 2025

Controlled generation refers to the problem of creating text that contains stylistic or semantic attributes of interest. Many approaches reduce this problem to training a predictor of the desired attribute. For example, researchers hoping to deploy a large language model to produce non-toxic content may use a toxicity classifier to filter generated text. In practice, the generated text to classify, which is determined by user prompts, may come from a wide range of distributions. In this paper, we show that the performance of controlled generation may be poor if the distributions of text in response to user prompts differ from the distribution the predictor was trained on. To address this problem, we cast controlled generation under distribution shift as an invariant learning problem: the most effective predictor should be invariant across multiple text environments. We then discuss a natural solution that arises from this characterization and propose heuristics for selecting natural environments. We study this characterization and the proposed method empirically using both synthetic and real data. Experiments demonstrate both the challenge of distribution shift in controlled generation and the potential of invariance methods in this setting.

‍

Carolina Zheng

Claudia Shi

Keyon Vafa

Amir Feder

David M. Blei

Among us: A sandbox for measuring and detecting agentic deception

Model Evaluation

Uncategorized

Apr 5, 2025

We introduce Among Us, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, Among Us can be expected to last much longer, because it is a multi-player game far from equilibrium.

Satvik Golechha

Adrià Garriga-Alonso

Interpreting emergent planning in model-free reinforcement learning

Uncategorized

Apr 2, 2025

We present the first mechanistic evidence that model-free reinforcement learning agents can learn to plan. This is achieved by applying a methodology based on concept-based interpretability to a model-free agent in Sokoban -- a commonly used benchmark for studying planning. Specifically, we demonstrate that DRC, a generic model-free agent introduced by Guez et al. (2019), uses learned concept representations to internally formulate plans that both predict the long-term effects of actions on the environment and influence action selection.

Thomas Bush

Stephen Chung

Usman Anwar

Adrià Garriga-Alonso

David Krueger

Multi-Agent Risks from Advanced AI

Uncategorized

Feb 19, 2025

The rapid development of advanced AI agents and the imminent deployment of many instances of these agents will give rise to multi-agent systems of unprecedented complexity. These systems pose novel and under-explored risks.In this report, we provide a structured taxonomy of these risks by identifying three key failure modes based on agents’ incentives, as well as seven key risk factors that can underpin them.

Alan Chan

Jesse Clifton

Jason Hoelscher-Obermaier

Akbir Khan

Chandler Smith

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Interpretability

Uncategorized

Feb 18, 2025

We show that Sparse Autoencoders (SAEs), despite their promise for interpretability, are highly unstable. We introduced two new benchmarks to assess SAW dictionary quality, and propose Archetypal SAEs (A-SAEs), which constrain dictionary atoms to the data’s convex hull, greatly improving stability. Our relaxed version, RA-SAE, matches top reconstruction performance and consistently learns more structured, meaningful representations.

Thomas Fel

Ekdeep Singh Lubana

Jacob S. Prince

Victor Boutin

Isabel Papadimitriou

Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

Interpretability

Uncategorized

Feb 6, 2025

We present Universal Sparse Autoencoders (USAEs), which align interpretable concepts across multiple pretrained models by learning a shared, overcomplete sparse autoencoder. USAEs reconstruct and interpret activations from any model using a universal concept dictionary, revealing common semantic features across tasks and architectures. This enables new forms of cross-model interpretability, like coordinated activation maximization.

Harrish Thasarathan

Julian Forsyth

Thomas Fel

Matthew Kowal

Konstantinos Derpanis

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

Model Evaluation

Uncategorized

Feb 4, 2025

DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed.

Brendan Murphy

Adrià Garriga-Alonso

Yashvardhan Sharma

Dillon Bowen

ChengCheng Tan

Open Problems in Mechanistic Interpretability

Uncategorized

Jan 27, 2025

This review discusses the current frontier of mechanistic interpretability, which aims to understand the computational mechanisms underlying neural networks. While the field has made progress, many open problems remain, including the need for improved methods, better applications to specific goals, and engagement with socio-technical challenges.

Alejandro Ortega

Augustine N. Mavor-Parker

Bilal Chughtai

Christopher L. Buckley

David A. Dalrymple (davidad)

Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

Robustness

Uncategorized

Aug 6, 2024

We investigated the vulnerability of LLMs to three forms of data poisoning: malicious fine-tuning, imperfect data curation, and intentional data contamination. Our experiments revealed that larger models are more susceptible to data poisoning, quickly learning harmful behaviors.

Dillon Bowen

Brendan Murphy

Will Cai

David Khachaturov

Adam Gleave

Exploring Scaling Trends in LLM Robustness

Does Robustness Improve with Scale?

Robustness

Uncategorized

Jul 26, 2024

While larger language models exhibit impressive capabilities, they remain vulnerable to adversarial prompts. Empirical findings show that robustness against such attacks significantly improves with adversarial training, but not with model scaling alone.

Niki Howe

Michał Zając

Oskar Hollinsworth

Tom Tseng

Pierre-Luc Bacon

Planning behavior in a recurrent neural network that plays Sokoban

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Interpretability

Uncategorized

Jul 22, 2024

To understand how neural networks generalize, we studied an RNN trained to play Sokoban. The RNN learned to spend time planning ahead by "pacing" despite penalties for "taking longer", demonstrating that reinforcement learning can encourage strategic planning in neural networks.

Adrià Garriga-Alonso

Mohammad Taufeeque

Adam Gleave

Adversarial Circuit Evaluation

Interpretability

Uncategorized

Jul 21, 2024

Evaluating three neural network circuits (IOI, greater-than, and docstring) under adversarial conditions reveals that the IOI and docstring circuits fail to match the full model's behavior even on benign inputs, underscoring the need for more robust circuits in safety-critical applications.

Niels uit de Bos

Adrià Garriga-Alonso

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Interpretability

Uncategorized

Jul 19, 2024

InterpBench is a collection of 17 semi-synthetic transformers with known circuits, trained using Strict Interchange Intervention Training (SIIT). These models exhibit realistic weights and activations that reflect ground truth circuits, providing a valuable benchmark for evaluating mechanistic interpretability techniques.

Rohan Gupta

Iván Arcuschin

Thomas Kwa

Adrià Garriga-Alonso

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Uncategorized

Jul 19, 2024

Reinforcement learning from human feedback (RLHF) uses KL divergence regularization to mitigate reward errors, allowing high utility with light-tailed errors but suffering from reward hacking with heavy-tailed errors. While current models have light-tailed errors, real-world applications may still face significant risks from heavy-tailed errors, leading to catastrophic Goodhart.

Thomas Kwa

Drake Thomas

Adrià Garriga-Alonso

Investigating the Indirect Object Identification circuit in Mamba

Interpretability

Uncategorized

Jul 19, 2024

By adapting existing interpretability techniques to the Mamba architecture, we partially reverse-engineered the circuit responsible for the Indirect Object Identification task, identifying layer 39 as a key component, and demonstrating the potential of these techniques to generalize to new architectures.

Danielle Ensign

Adrià Garriga-Alonso

Transformer Circuit Faithfulness Metrics are not Robust

Interpretability

Uncategorized

Jul 11, 2024

Existing circuits in the mechanistic interpretability literature may not be as faithful as reported. Current circuit faithfulness scores reflect both the methodological choices of researchers and the actual components of the circuit.

Joseph Miller

Bilal Chughtai

William Saunders

Can Go AIs be adversarially robust?

Even Superhuman Go AIs Have Surprising Failure Modes

Robustness

Uncategorized

Jun 18, 2024

Ensuring AI robustness remains a significant challenge, even in narrow domains like Go. We tested three approaches to defend Go AIs from adversarial strategies. While these defenses protect against previously discovered adversaries, we uncovered qualitatively new adversaries that undermine these defenses. Interactive examples of these attacks and the codebase are available at goattack.far.ai.

Tom Tseng

Euan McLean

Kellin Pelrine

Tony Wang

Adam Gleave

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Robustness

Uncategorized

May 10, 2024

This paper introduces Guaranteed Safe (GS) AI, an approach to AI safety that ensures high-assurance quantitative safety guarantees. It relies on three core components—a world model, a safety specification, and a verifier—to mathematically verify that AI systems meet safety requirements.

David A. Dalrymple (davidad)

Joar Max Viktor Skalse

Yoshua Bengio

Stuart Russell

Max Tegmark

STARC: A General Framework For Quantifying Differences Between Reward Functions

Model Evaluation

Uncategorized

Apr 8, 2024

STARC (STAndardised Reward Comparison) metrics, a class of pseudometrics, quantify differences between reward functions, providing theoretical and empirical tools to improve the analysis and safety of reward learning algorithms in reinforcement learning.

Joar Max Viktor Skalse

Lucy Farnik

Sumeet Ramesh Motwani

Erik Jenner

Adam Gleave

Uncovering Latent Human Wellbeing in Language Model Embeddings

Uncovering Latent Human Wellbeing in LLM Embeddings

Model Evaluation

Uncategorized

Feb 19, 2024

Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations.

Pedro Freire

ChengCheng Tan

Adam Gleave

Dan Hendrycks

Scott Emmons

Exploiting Novel GPT-4 APIs

We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs

Model Evaluation

Uncategorized

Dec 21, 2023

We red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents.

Kellin Pelrine

Mohammad Taufeeque

Michał Zając

Euan McLean

Adam Gleave

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

Interpretability

Uncategorized

Oct 27, 2023

We found a way to modify neural networks to make their internals more interpretable and steerable while causing only a small degradation of performance. At each layer, we apply a quantization bottleneck that forces the activation vector into a sum of a few discrete codes; converting an inscrutable, dense, and continuous vector into a discrete list of codes from a learned codebook that are either on or off.

Alex Tamkin

Mohammad Taufeeque

Noah D. Goodman

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

VLM-RM: Specifying Rewards with Natural Language

Uncategorized

Oct 19, 2023

We show how to use Vision-Language Models (VLM), and specifically CLIP models, as reward models (RM) for RL agents. Instead of manually specifying a reward function, we only need to provide text prompts like 'a humanoid robot kneeling' to instruct and provide feedback to the agent. Importantly, we find that larger VLMs provide more accurate reward signals, so we expect this method to work even better with future models.

Juan Rocamonde

Victoriano Montesinos

Elvis Nava

Ethan Perez

David Lindner

Evaluating the Moral Beliefs Encoded in LLMs

Evaluating LLM Responses to Moral Scenarios

Model Evaluation

Uncategorized

Jul 26, 2023

We introduce a statistical method for eliciting beliefs encoded in LLMs using surveys. We apply this method to study the encoded moral beliefs in 28 open- and closed-source LLMs. Our results demonstrate that most LLMs exhibit low uncertainty in unambiguous moral scenarios and that their preferences align with common sense judgements. In ambiguous moral scenarios, we find that only a few LLMs exhibit clear preferences and that closed-source models tend to agree with each other.

Nino Scherrer

Claudia Shi

Amir Feder

David M. Blei

Towards Automated Circuit Discovery for Mechanistic Interpretability

Interpretability

Uncategorized

Jul 4, 2023

We systematize the mechanistic interpretability process into 3 iterative steps, then proceed to automate one of them: circuit discovery. Two of the algorithms presented automatically discover interpretability results previously established by human inspection.

Arthur Conmy

Aengus Lynch

Stefan Heimersheim

Adrià Garriga-Alonso

Inverse Scaling: When Bigger Isn't Better

Model Evaluation

Uncategorized

Jun 15, 2023

We present instances of inverse scaling: tasks where language models get worse with scale rather than better. These 11 examples were selected from 99 submissions in an open competition, the Inverse Scaling Prize. The paper also discusses inverse scaling in the literature and identifies four potential causes of inverse scaling. The prize-winning tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood.

Ian McKenzie

Alexander Lyzhov

Michael Pieler

Alicia Parrish

Aaron Mueller

Training Language Models with Language Feedback at Scale

Uncategorized

Mar 28, 2023

We present a novel method called Imitation learning from Language Feedback (ILF) to tackle the problem of pretrained language models producing outputs misaligned with human preferences. ILF leverages more informative language feedback through a three-step iterative process: (1) conditioning the language model on input, initial output, and feedback, (2) generating refinements and selecting the one that incorporates the most feedback, and (3) finetuning the language model based on the chosen refinement. Experimental results indicate that ILF effectively scales with dataset size and achieves human-level summarization performance by learning from both language and comparison feedback.

Jérémy Scheurer

Jon Ander Campos

Tomek Korbak

Jun Shern Chan

Angelica Chen

Improving Code Generation by Training with Natural Language Feedback

Uncategorized

Mar 28, 2023

We present a new algorithm called Imitation learning from Language Feedback (ILF) that enables pre-trained large language models to learn from natural language feedback at training time, which is both user-friendly and sample-efficient. The algorithm can be seen as minimizing the KL divergence to the ground truth distribution. The paper shows that ILF outperforms both fine-tuning on the Mostly Basic Python Problems benchmark and fine-tuning on repaired programs written by humans, improving the pass@1 rate of the Codegen-Mono 6.1B model by 38% relative and 10% absolute.

Angelica Chen

Jérémy Scheurer

Tomek Korbak

Jon Ander Campos

Jun Shern Chan

Eliciting Latent Predictions from Transformers with the Tuned Lens

Interpretability

Uncategorized

Mar 15, 2023

The tuned lens learns an affine transformation to decode the activations of each layer of a transformer as next-token predictions. This provides insights into how model predictions are refined layer by layer. We validate our method on various autoregressive language models up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens baseline.

Nora Belrose

Zach Furman

Logan Smith

Danny Halawi

Igor Ostrovsky

Pretraining Language Models with Human Preferences

Uncategorized

Feb 16, 2023

We explore the problem of how to train language models to generate text that humans would not consider inappropriate. We find that conditional training, which learns the distribution over tokens based on human preference scores, is a simple and effective approach that reduces undesirable content while maintaining downstream task performance. Pre-training LMs with human feedback leads to better preference satisfaction than traditional LM pre-training followed by feedback-based finetuning.

Tomek Korbak

Kejian Shi

Angelica Chen

Rasika Bhalerao

Christopher L. Buckley

Adversarial Policies Beat Superhuman Go AIs

Beyond the Board: Exploring AI Robustness Through Go

Robustness

Uncategorized

Jan 9, 2023

We describe an attack on the state-of-the-art Go-playing AI system, KataGo. The adversaries do not win by learning to play Go better than KataGo but instead by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at goattack.far.ai.

Tony Wang

Adam Gleave

Tom Tseng

Kellin Pelrine

Nora Belrose

Training Language Models with Language Feedback

Uncategorized

Nov 17, 2022

We propose to learn from natural language feedback, which conveys more information per human evaluation than comparisons. We propose doing so with a three-step learning algorithm. First, we condition the language model on the initial output and feedback to generate many refinements. Second, we choose the refinement with the highest similarity to the feedback. Third, we finetune a language model to maximize the likelihood of the chosen refinement given the input.

Jérémy Scheurer

Jon Ander Campos

Jun Shern Chan

Angelica Chen

Kyunghyun Cho

imitation: Clean Imitation Learning Implementations

Uncategorized

Sep 22, 2022

We describe a software package called "imitation" which provides PyTorch implementations of several imitation and reward learning algorithms, including three inverse reinforcement learning algorithms, three imitation learning algorithms, and a preference comparison algorithm.

Adam Gleave

Mohammad Taufeeque

Juan Rocamonde

Erik Jenner

Steven H. Wang

Few-shot Adaptation Works with UnpredicTable Data

Robustness

Uncategorized

Aug 8, 2022

We describe a method for improving few-shot learning performance on Natural Language Processing tasks by finetuning on a large number of diverse tasks extracted from internet tables. We find that finetuning on narrow subsets of these tasks can lead to similar improvements, suggesting that the gains are not from domain adaptation but adapting to few-shot learning in general.

Jun Shern Chan

Michael Pieler

Jonathan Jao

Jérémy Scheurer

Ethan Perez

RL with KL penalties is better viewed as Bayesian inference

Uncategorized

Aug 8, 2022

We discuss the use of reinforcement learning (RL) in fine-tuning large language models to penalize undesirable features in generated sequences. We argue that the standard RL approach is flawed and leads to distribution collapse, and propose a Bayesian inference view of KL-regularized RL, which explains how it avoids the distribution collapse problem.

Tomek Korbak

Ethan Perez

Christopher L. Buckley

Robustness

Interpretability

Model Evaluation

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI