All Publications
Multi-Agent Risks from Advanced AI
Feb 19, 2025
Uncategorized
The rapid development of advanced AI agents and the imminent deployment of many instances of these agents will give rise to multi-agent systems of unprecedented complexity. These systems pose novel and under-explored risks.In this report, we provide a structured taxonomy of these risks by identifying three key failure modes based on agents’ incentives, as well as seven key risk factors that can underpin them.
Open Problems in Mechanistic Interpretability
Jan 27, 2025
Uncategorized
This review discusses the current frontier of mechanistic interpretability, which aims to understand the computational mechanisms underlying neural networks. While the field has made progress, many open problems remain, including the need for improved methods, better applications to specific goals, and engagement with socio-technical challenges.
Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws
GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
Robustness
Aug 6, 2024
Uncategorized
We investigated the vulnerability of LLMs to three forms of data poisoning: malicious fine-tuning, imperfect data curation, and intentional data contamination. Our experiments revealed that larger models are more susceptible to data poisoning, quickly learning harmful behaviors.
Exploring Scaling Trends in LLM Robustness
Does Robustness Improve with Scale?
Robustness
Jul 26, 2024
Uncategorized
Planning behavior in a recurrent neural network that plays Sokoban
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Interpretability
Jul 22, 2024
Uncategorized
To understand how neural networks generalize, we studied an RNN trained to play Sokoban. The RNN learned to spend time planning ahead by "pacing" despite penalties for "taking longer", demonstrating that reinforcement learning can encourage strategic planning in neural networks.
Adversarial Circuit Evaluation
Interpretability
Jul 21, 2024
Uncategorized
Evaluating three neural network circuits (IOI, greater-than, and docstring) under adversarial conditions reveals that the IOI and docstring circuits fail to match the full model's behavior even on benign inputs, underscoring the need for more robust circuits in safety-critical applications.
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Interpretability
Jul 19, 2024
Uncategorized
InterpBench is a collection of 17 semi-synthetic transformers with known circuits, trained using Strict Interchange Intervention Training (SIIT). These models exhibit realistic weights and activations that reflect ground truth circuits, providing a valuable benchmark for evaluating mechanistic interpretability techniques.
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
Jul 19, 2024
Uncategorized
Reinforcement learning from human feedback (RLHF) uses KL divergence regularization to mitigate reward errors, allowing high utility with light-tailed errors but suffering from reward hacking with heavy-tailed errors. While current models have light-tailed errors, real-world applications may still face significant risks from heavy-tailed errors, leading to catastrophic Goodhart.
Investigating the Indirect Object Identification circuit in Mamba
Interpretability
Jul 19, 2024
Uncategorized
By adapting existing interpretability techniques to the Mamba architecture, we partially reverse-engineered the circuit responsible for the Indirect Object Identification task, identifying layer 39 as a key component, and demonstrating the potential of these techniques to generalize to new architectures.
Transformer Circuit Faithfulness Metrics are not Robust
Interpretability
Jul 11, 2024
Uncategorized
Existing circuits in the mechanistic interpretability literature may not be as faithful as reported. Current circuit faithfulness scores reflect both the methodological choices of researchers and the actual components of the circuit.
Can Go AIs be adversarially robust?
Even Superhuman Go AIs Have Surprising Failure Modes
Robustness
Jun 18, 2024
Uncategorized
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
Robustness
May 10, 2024
Uncategorized
STARC: A General Framework For Quantifying Differences Between Reward Functions
Model Evaluation
Apr 8, 2024
Uncategorized
STARC (STAndardised Reward Comparison) metrics, a class of pseudometrics, quantify differences between reward functions, providing theoretical and empirical tools to improve the analysis and safety of reward learning algorithms in reinforcement learning.
Uncovering Latent Human Wellbeing in Language Model Embeddings
Uncovering Latent Human Wellbeing in LLM Embeddings
Model Evaluation
Feb 19, 2024
Uncategorized
Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations.
Exploiting Novel GPT-4 APIs
We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs
Model Evaluation
Dec 21, 2023
Uncategorized
We red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents.
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
Interpretability
Oct 27, 2023
Uncategorized
We found a way to modify neural networks to make their internals more interpretable and steerable while causing only a small degradation of performance. At each layer, we apply a quantization bottleneck that forces the activation vector into a sum of a few discrete codes; converting an inscrutable, dense, and continuous vector into a discrete list of codes from a learned codebook that are either on or off.
Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
VLM-RM: Specifying Rewards with Natural Language
Oct 19, 2023
Uncategorized
We show how to use Vision-Language Models (VLM), and specifically CLIP models, as reward models (RM) for RL agents. Instead of manually specifying a reward function, we only need to provide text prompts like 'a humanoid robot kneeling' to instruct and provide feedback to the agent. Importantly, we find that larger VLMs provide more accurate reward signals, so we expect this method to work even better with future models.
Evaluating the Moral Beliefs Encoded in LLMs
Evaluating LLM Responses to Moral Scenarios
Model Evaluation
Jul 26, 2023
Uncategorized
We introduce a statistical method for eliciting beliefs encoded in LLMs using surveys. We apply this method to study the encoded moral beliefs in 28 open- and closed-source LLMs. Our results demonstrate that most LLMs exhibit low uncertainty in unambiguous moral scenarios and that their preferences align with common sense judgements. In ambiguous moral scenarios, we find that only a few LLMs exhibit clear preferences and that closed-source models tend to agree with each other.
Towards Automated Circuit Discovery for Mechanistic Interpretability
Interpretability
Jul 4, 2023
Uncategorized
Inverse Scaling: When Bigger Isn't Better
Model Evaluation
Jun 15, 2023
Uncategorized
We present instances of inverse scaling: tasks where language models get worse with scale rather than better. These 11 examples were selected from 99 submissions in an open competition, the Inverse Scaling Prize. The paper also discusses inverse scaling in the literature and identifies four potential causes of inverse scaling. The prize-winning tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood.
Training Language Models with Language Feedback at Scale
Mar 28, 2023
Uncategorized
We present a novel method called Imitation learning from Language Feedback (ILF) to tackle the problem of pretrained language models producing outputs misaligned with human preferences. ILF leverages more informative language feedback through a three-step iterative process: (1) conditioning the language model on input, initial output, and feedback, (2) generating refinements and selecting the one that incorporates the most feedback, and (3) finetuning the language model based on the chosen refinement. Experimental results indicate that ILF effectively scales with dataset size and achieves human-level summarization performance by learning from both language and comparison feedback.
Improving Code Generation by Training with Natural Language Feedback
Mar 28, 2023
Uncategorized
We present a new algorithm called Imitation learning from Language Feedback (ILF) that enables pre-trained large language models to learn from natural language feedback at training time, which is both user-friendly and sample-efficient. The algorithm can be seen as minimizing the KL divergence to the ground truth distribution. The paper shows that ILF outperforms both fine-tuning on the Mostly Basic Python Problems benchmark and fine-tuning on repaired programs written by humans, improving the pass@1 rate of the Codegen-Mono 6.1B model by 38% relative and 10% absolute.
Eliciting Latent Predictions from Transformers with the Tuned Lens
Interpretability
Mar 15, 2023
Uncategorized
The tuned lens learns an affine transformation to decode the activations of each layer of a transformer as next-token predictions. This provides insights into how model predictions are refined layer by layer. We validate our method on various autoregressive language models up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens baseline.
Pretraining Language Models with Human Preferences
Feb 16, 2023
Uncategorized
Adversarial Policies Beat Superhuman Go AIs
Beyond the Board: Exploring AI Robustness Through Go
Robustness
Jan 9, 2023
Uncategorized
We describe an attack on the state-of-the-art Go-playing AI system, KataGo. The adversaries do not win by learning to play Go better than KataGo but instead by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at goattack.far.ai.
Training Language Models with Language Feedback
Nov 17, 2022
Uncategorized
imitation: Clean Imitation Learning Implementations
Sep 22, 2022
Uncategorized
We describe a software package called "imitation" which provides PyTorch implementations of several imitation and reward learning algorithms, including three inverse reinforcement learning algorithms, three imitation learning algorithms, and a preference comparison algorithm.
Few-shot Adaptation Works with UnpredicTable Data
Robustness
Aug 8, 2022
Uncategorized
We describe a method for improving few-shot learning performance on Natural Language Processing tasks by finetuning on a large number of diverse tasks extracted from internet tables. We find that finetuning on narrow subsets of these tasks can lead to similar improvements, suggesting that the gains are not from domain adaptation but adapting to few-shot learning in general.
RL with KL penalties is better viewed as Bayesian inference
Aug 8, 2022
Uncategorized