All Publications

Research Overview

Google Scholar

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

STACK: Adversarial Attacks on LLM Safeguard Pipelines

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Model Evaluation

Uncategorized

Jul 2, 2025

Ian McKenzie

Oskar Hollinsworth

Tom Tseng

Xander Davies

Stephen Casper

ClearHarm: A more challenging jailbreak dataset

Model Evaluation

Uncategorized

Jun 23, 2025

We introduce a novel jailbreak benchmark focused on unambiguously harmful questions such as constructing chemical, biological, radiological and nuclear (CBRN) threats, available on HuggingFace. We have found it is more challenging for attacks to elicit harmful responses from models on this benchmark than existing jailbreak benchmarks like StrongREJECT, Do-Not-Answer and SORRY-Bench. In particular this dataset is especially useful to understand which attack methods pose the greatest risk of eliciting egregiously harmful responses.

‍

Oskar Hollinsworth

Ian McKenzie

Tom Tseng

Adam Gleave

Why does training on insecure code make models broadly misaligned?

Uncategorized

Jun 17, 2025

Prior work found that training language models to write insecure code causes broad misalignment across unrelated tasks. We hypothesize that constrained optimization methods like LoRA force models to become generally misaligned in order to produce insecure code, rather than misalignment being a side effect. Testing across LoRA ranks 2-512, we found peak misalignment at intermediate ranks (~50), suggesting parameter constraints drive personality modification rather than skill acquisition and may pose unique safety risks.

Chris Cundy

Preference Learning with Lie Detectors can Induce Honesty or Evasion

Uncategorized

Jun 5, 2025

Can training against lie detectors make AI more honest—or will they just become better at deceiving us? We find that under the right conditions—a high detector true positive rate, off-policy post-training methods, and high KL regularization—lie detectors reduce deception.

Chris Cundy

Adam Gleave

Multi-Agent Risks from Advanced AI

Uncategorized

Feb 19, 2025

The rapid development of advanced AI agents and the imminent deployment of many instances of these agents will give rise to multi-agent systems of unprecedented complexity. These systems pose novel and under-explored risks.In this report, we provide a structured taxonomy of these risks by identifying three key failure modes based on agents’ incentives, as well as seven key risk factors that can underpin them.

Alan Chan

Jesse Clifton

Jason Hoelscher-Obermaier

Akbir Khan

Chandler Smith

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Interpretability

Uncategorized

Feb 18, 2025

We show that Sparse Autoencoders (SAEs), despite their promise for interpretability, are highly unstable. We introduced two new benchmarks to assess SAW dictionary quality, and propose Archetypal SAEs (A-SAEs), which constrain dictionary atoms to the data’s convex hull, greatly improving stability. Our relaxed version, RA-SAE, matches top reconstruction performance and consistently learns more structured, meaningful representations.

Thomas Fel

Ekdeep Singh Lubana

Jacob S. Prince

Victor Boutin

Isabel Papadimitriou

Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

Interpretability

Uncategorized

Feb 6, 2025

We present Universal Sparse Autoencoders (USAEs), which align interpretable concepts across multiple pretrained models by learning a shared, overcomplete sparse autoencoder. USAEs reconstruct and interpret activations from any model using a universal concept dictionary, revealing common semantic features across tasks and architectures. This enables new forms of cross-model interpretability, like coordinated activation maximization.

Harrish Thasarathan

Julian Forsyth

Thomas Fel

Matthew Kowal

Konstantinos Derpanis

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

Model Evaluation

Uncategorized

Feb 4, 2025

DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed.

Brendan Murphy

Adrià Garriga-Alonso

Yashvardhan Sharma

Dillon Bowen

ChengCheng Tan

Open Problems in Mechanistic Interpretability

Uncategorized

Jan 27, 2025

This review discusses the current frontier of mechanistic interpretability, which aims to understand the computational mechanisms underlying neural networks. While the field has made progress, many open problems remain, including the need for improved methods, better applications to specific goals, and engagement with socio-technical challenges.

Alejandro Ortega

Augustine N. Mavor-Parker

Bilal Chughtai

Christopher L. Buckley

David A. Dalrymple (davidad)

Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

Robustness

Uncategorized

Aug 6, 2024

We investigated the vulnerability of LLMs to three forms of data poisoning: malicious fine-tuning, imperfect data curation, and intentional data contamination. Our experiments revealed that larger models are more susceptible to data poisoning, quickly learning harmful behaviors.

Dillon Bowen

Brendan Murphy

Will Cai

David Khachaturov

Adam Gleave

Exploring Scaling Trends in LLM Robustness

Does Robustness Improve with Scale?

Robustness

Uncategorized

Jul 26, 2024

While larger language models exhibit impressive capabilities, they remain vulnerable to adversarial prompts. Empirical findings show that robustness against such attacks significantly improves with adversarial training, but not with model scaling alone.

Niki Howe

Michał Zając

Oskar Hollinsworth

Tom Tseng

Pierre-Luc Bacon

Planning behavior in a recurrent neural network that plays Sokoban

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Interpretability

Uncategorized

Jul 22, 2024

To understand how neural networks generalize, we studied an RNN trained to play Sokoban. The RNN learned to spend time planning ahead by "pacing" despite penalties for "taking longer", demonstrating that reinforcement learning can encourage strategic planning in neural networks.

Adrià Garriga-Alonso

Mohammad Taufeeque

Adam Gleave

Adversarial Circuit Evaluation

Interpretability

Uncategorized

Jul 21, 2024

Evaluating three neural network circuits (IOI, greater-than, and docstring) under adversarial conditions reveals that the IOI and docstring circuits fail to match the full model's behavior even on benign inputs, underscoring the need for more robust circuits in safety-critical applications.

Niels uit de Bos

Adrià Garriga-Alonso

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Interpretability

Uncategorized

Jul 19, 2024

InterpBench is a collection of 17 semi-synthetic transformers with known circuits, trained using Strict Interchange Intervention Training (SIIT). These models exhibit realistic weights and activations that reflect ground truth circuits, providing a valuable benchmark for evaluating mechanistic interpretability techniques.

Rohan Gupta

Iván Arcuschin

Thomas Kwa

Adrià Garriga-Alonso

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Uncategorized

Jul 19, 2024

Reinforcement learning from human feedback (RLHF) uses KL divergence regularization to mitigate reward errors, allowing high utility with light-tailed errors but suffering from reward hacking with heavy-tailed errors. While current models have light-tailed errors, real-world applications may still face significant risks from heavy-tailed errors, leading to catastrophic Goodhart.

Thomas Kwa

Drake Thomas

Adrià Garriga-Alonso

Investigating the Indirect Object Identification circuit in Mamba

Interpretability

Uncategorized

Jul 19, 2024

By adapting existing interpretability techniques to the Mamba architecture, we partially reverse-engineered the circuit responsible for the Indirect Object Identification task, identifying layer 39 as a key component, and demonstrating the potential of these techniques to generalize to new architectures.

Danielle Ensign

Adrià Garriga-Alonso

Transformer Circuit Faithfulness Metrics are not Robust

Interpretability

Uncategorized

Jul 11, 2024

Existing circuits in the mechanistic interpretability literature may not be as faithful as reported. Current circuit faithfulness scores reflect both the methodological choices of researchers and the actual components of the circuit.

Joseph Miller

Bilal Chughtai

William Saunders

Can Go AIs be adversarially robust?

Even Superhuman Go AIs Have Surprising Failure Modes

Robustness

Uncategorized

Jun 18, 2024

Ensuring AI robustness remains a significant challenge, even in narrow domains like Go. We tested three approaches to defend Go AIs from adversarial strategies. While these defenses protect against previously discovered adversaries, we uncovered qualitatively new adversaries that undermine these defenses. Interactive examples of these attacks and the codebase are available at goattack.far.ai.

Tom Tseng

Euan McLean

Kellin Pelrine

Tony Wang

Adam Gleave

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Robustness

Uncategorized

May 10, 2024

This paper introduces Guaranteed Safe (GS) AI, an approach to AI safety that ensures high-assurance quantitative safety guarantees. It relies on three core components—a world model, a safety specification, and a verifier—to mathematically verify that AI systems meet safety requirements.

David A. Dalrymple (davidad)

Joar Max Viktor Skalse

Yoshua Bengio

Stuart Russell

Max Tegmark

STARC: A General Framework For Quantifying Differences Between Reward Functions

Model Evaluation

Uncategorized

Apr 8, 2024

STARC (STAndardised Reward Comparison) metrics, a class of pseudometrics, quantify differences between reward functions, providing theoretical and empirical tools to improve the analysis and safety of reward learning algorithms in reinforcement learning.

Joar Max Viktor Skalse

Lucy Farnik

Sumeet Ramesh Motwani

Erik Jenner

Adam Gleave

Uncovering Latent Human Wellbeing in Language Model Embeddings

Uncovering Latent Human Wellbeing in LLM Embeddings

Model Evaluation

Uncategorized

Feb 19, 2024

Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations.

Pedro Freire

ChengCheng Tan

Adam Gleave

Dan Hendrycks

Scott Emmons

Exploiting Novel GPT-4 APIs

We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs

Model Evaluation

Uncategorized

Dec 21, 2023

We red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents.

Kellin Pelrine

Mohammad Taufeeque

Michał Zając

Euan McLean

Adam Gleave

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

Interpretability

Uncategorized

Oct 27, 2023

We found a way to modify neural networks to make their internals more interpretable and steerable while causing only a small degradation of performance. At each layer, we apply a quantization bottleneck that forces the activation vector into a sum of a few discrete codes; converting an inscrutable, dense, and continuous vector into a discrete list of codes from a learned codebook that are either on or off.

Alex Tamkin

Mohammad Taufeeque

Noah D. Goodman

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Uncategorized

Oct 19, 2023

We show how to use Vision-Language Models (VLM), and specifically CLIP models, as reward models (RM) for RL agents. Instead of manually specifying a reward function, we only need to provide text prompts like 'a humanoid robot kneeling' to instruct and provide feedback to the agent. Importantly, we find that larger VLMs provide more accurate reward signals, so we expect this method to work even better with future models.

Juan Rocamonde

Victoriano Montesinos

Elvis Nava

Ethan Perez

David Lindner

Evaluating the Moral Beliefs Encoded in LLMs

Evaluating LLM Responses to Moral Scenarios

Model Evaluation

Uncategorized

Jul 26, 2023

We introduce a statistical method for eliciting beliefs encoded in LLMs using surveys. We apply this method to study the encoded moral beliefs in 28 open- and closed-source LLMs. Our results demonstrate that most LLMs exhibit low uncertainty in unambiguous moral scenarios and that their preferences align with common sense judgements. In ambiguous moral scenarios, we find that only a few LLMs exhibit clear preferences and that closed-source models tend to agree with each other.

Nino Scherrer

Claudia Shi

Amir Feder

David M. Blei

Towards Automated Circuit Discovery for Mechanistic Interpretability

Interpretability

Uncategorized

Jul 4, 2023

We systematize the mechanistic interpretability process into 3 iterative steps, then proceed to automate one of them: circuit discovery. Two of the algorithms presented automatically discover interpretability results previously established by human inspection.

Arthur Conmy

Aengus Lynch

Stefan Heimersheim

Adrià Garriga-Alonso

Inverse Scaling: When Bigger Isn't Better

Model Evaluation

Uncategorized

Jun 15, 2023

We present instances of inverse scaling: tasks where language models get worse with scale rather than better. These 11 examples were selected from 99 submissions in an open competition, the Inverse Scaling Prize. The paper also discusses inverse scaling in the literature and identifies four potential causes of inverse scaling. The prize-winning tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood.

Ian McKenzie

Alexander Lyzhov

Michael Pieler

Alicia Parrish

Aaron Mueller

Training Language Models with Language Feedback at Scale

Uncategorized

Mar 28, 2023

We present a novel method called Imitation learning from Language Feedback (ILF) to tackle the problem of pretrained language models producing outputs misaligned with human preferences. ILF leverages more informative language feedback through a three-step iterative process: (1) conditioning the language model on input, initial output, and feedback, (2) generating refinements and selecting the one that incorporates the most feedback, and (3) finetuning the language model based on the chosen refinement. Experimental results indicate that ILF effectively scales with dataset size and achieves human-level summarization performance by learning from both language and comparison feedback.

Jerémy Scheurer

Jon Ander Campos

Tomek Korbak

Jun Shern Chan

Angelica Chen

Improving Code Generation by Training with Natural Language Feedback

Uncategorized

Mar 28, 2023

We present a new algorithm called Imitation learning from Language Feedback (ILF) that enables pre-trained large language models to learn from natural language feedback at training time, which is both user-friendly and sample-efficient. The algorithm can be seen as minimizing the KL divergence to the ground truth distribution. The paper shows that ILF outperforms both fine-tuning on the Mostly Basic Python Problems benchmark and fine-tuning on repaired programs written by humans, improving the pass@1 rate of the Codegen-Mono 6.1B model by 38% relative and 10% absolute.

Angelica Chen

Jerémy Scheurer

Tomek Korbak

Jon Ander Campos

Jun Shern Chan

Eliciting Latent Predictions from Transformers with the Tuned Lens

Interpretability

Uncategorized

Mar 15, 2023

The tuned lens learns an affine transformation to decode the activations of each layer of a transformer as next-token predictions. This provides insights into how model predictions are refined layer by layer. We validate our method on various autoregressive language models up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens baseline.

Nora Belrose

Zach Furman

Logan Smith

Danny Halawi

Igor Ostrovsky

Pretraining Language Models with Human Preferences

Uncategorized

Feb 16, 2023

We explore the problem of how to train language models to generate text that humans would not consider inappropriate. We find that conditional training, which learns the distribution over tokens based on human preference scores, is a simple and effective approach that reduces undesirable content while maintaining downstream task performance. Pre-training LMs with human feedback leads to better preference satisfaction than traditional LM pre-training followed by feedback-based finetuning.

Tomek Korbak

Kejian Shi

Angelica Chen

Rasika Bhalerao

Christopher L. Buckley

Adversarial Policies Beat Superhuman Go AIs

Beyond the Board: Exploring AI Robustness Through Go

Robustness

Uncategorized

Jan 9, 2023

We describe an attack on the state-of-the-art Go-playing AI system, KataGo. The adversaries do not win by learning to play Go better than KataGo but instead by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at goattack.far.ai.

Tony Wang

Adam Gleave

Tom Tseng

Kellin Pelrine

Nora Belrose

Training Language Models with Language Feedback

Uncategorized

Nov 17, 2022

We propose to learn from natural language feedback, which conveys more information per human evaluation than comparisons. We propose doing so with a three-step learning algorithm. First, we condition the language model on the initial output and feedback to generate many refinements. Second, we choose the refinement with the highest similarity to the feedback. Third, we finetune a language model to maximize the likelihood of the chosen refinement given the input.

Jerémy Scheurer

Jon Ander Campos

Jun Shern Chan

Angelica Chen

Kyunghyun Cho

imitation: Clean Imitation Learning Implementations

Uncategorized

Sep 22, 2022

We describe a software package called "imitation" which provides PyTorch implementations of several imitation and reward learning algorithms, including three inverse reinforcement learning algorithms, three imitation learning algorithms, and a preference comparison algorithm.

Adam Gleave

Mohammad Taufeeque

Juan Rocamonde

Erik Jenner

Steven H. Wang

Few-shot Adaptation Works with UnpredicTable Data

Robustness

Uncategorized

Aug 8, 2022

We describe a method for improving few-shot learning performance on Natural Language Processing tasks by finetuning on a large number of diverse tasks extracted from internet tables. We find that finetuning on narrow subsets of these tasks can lead to similar improvements, suggesting that the gains are not from domain adaptation but adapting to few-shot learning in general.

Jun Shern Chan

Michael Pieler

Jonathan Jao

Jerémy Scheurer

Ethan Perez

RL with KL penalties is better viewed as Bayesian inference

Uncategorized

Aug 8, 2022

We discuss the use of reinforcement learning (RL) in fine-tuning large language models to penalize undesirable features in generated sequences. We argue that the standard RL approach is flawed and leads to distribution collapse, and propose a Bayesian inference view of KL-regularized RL, which explains how it avoids the distribution collapse problem.

Tomek Korbak

Ethan Perez

Christopher L. Buckley

Robustness

Interpretability

Model Evaluation

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI