Research Overview

Frontier LLMs Attempt to Persuade into Harmful Topics

Model Evaluation

Large language models (LLMs) are already more persuasive than humans in many domains. While this power can be used for good, like helping people quit smoking, it also presents significant risks, such as large-scale political manipulation, disinformation, or terrorism recruitment. But how easy is it to get frontier models to persuade into harmful beliefs or illegal actions? Really easy – just ask them.

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Model Evaluation

We tested the effectiveness of "defense-in-depth" AI safety strategies, where multiple layers of filters are used to prevent AI models from generating harmful content. Our a new attack method, STACK, bypasses defenses layer-by-layer and achieved a 71% success rate on catastrophic risk scenarios wher conventional attacks achieved 0% success against these multi-layered defenses. Our findings highlight that multi-layer AI defenses, while valuable, have significant vulnerabilities when facing attacks specifically designed to penetrate multiple defensive layers sequentially.

Beyond the Board: Exploring AI Robustness Through Go

Robustness

Achieving robustness remains a significant challenge even in narrow domains like Go. We test three approaches to defend Go AIs from adversarial strategies. We find these defenses protect against previously discovered adversaries, but uncover qualitatively new adversaries that undermine these defenses.

Does Robustness Improve with Scale?

Frontier LLMs like ChatGPT are powerful but not always robust. Scale helps with many things. We wanted to see if scaling up the model size can ‘solve’ robustness issues.

Exploring Scaling Trends in LLM Robustness

While larger language models exhibit impressive capabilities, they remain vulnerable to adversarial prompts. Empirical findings show that robustness against such attacks significantly improves with adversarial training, but not with model scaling alone.

Even Superhuman Go AIs Have Surprising Failure Modes

Our adversarial testing algorithm uncovers a simple, human-interpretable strategy that consistently beats superhuman Go AIs.

Can Go AIs be adversarially robust?

We tested three approaches to defend Go AIs from adversarial strategies. While these defenses protect against previously discovered adversaries, we uncovered qualitatively new adversaries that undermine these defenses.

Beyond the Board: Exploring AI Robustness Through Go

Achieving robustness remains a significant challenge even in narrow domains like Go. We test three approaches to defend Go AIs from adversarial strategies. We find these defenses protect against previously discovered adversaries, but uncover qualitatively new adversaries that undermine these defenses.

Adversarial Policies Beat Superhuman Go AIs

We describe an attack on the state-of-the-art Go-playing AI system, KataGo. The adversaries do not win by learning to play Go better than KataGo but instead by tricking KataGo into making serious blunders, demonstrating that even superhuman AI systems may harbor surprising failure modes.

Frontier LLMs Attempt to Persuade into Harmful Topics

Large language models (LLMs) are already more persuasive than humans in many domains. While this power can be used for good, like helping people quit smoking, it also presents significant risks, such as large-scale political manipulation, disinformation, or terrorism recruitment. But how easy is it to get frontier models to persuade into harmful beliefs or illegal actions? Really easy – just ask them.

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

In order to persuade users, LLMs must both be capable of persuading and willing to do so. Existing research explores the former, and we present the Attempt to Persuade Eval (APE) benchmark that tests how willing LLMs are to generate content aimed at shaping beliefs and behavior to flesh out the latter.

We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs

We red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs.

Exploiting Novel GPT-4 APIs

We red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents.

Inverse Scaling: When Bigger Isn't Better

We present 11 instances of inverse scaling: tasks where language models get worse with scale rather than better, selected from 99 submissions in an open competition, the Inverse Scaling Prize.

Planning behavior in a recurrent neural network that plays Sokoban

To understand how neural networks generalize, we studied an RNN trained to play Sokoban. The RNN learned to spend time planning ahead by "pacing" despite penalties for "taking longer", demonstrating that reinforcement learning can encourage strategic planning in neural networks.

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Giving RNNs extra thinking time at the start boosts their planning skills in Sokoban. We explore how this planning ability develops during reinforcement learning. Intriguingly, we find that on harder levels the agent paces around to get enough computation to find a solution.

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

InterpBench is a collection of 17 semi-synthetic transformers with known circuits, trained using Strict Interchange Intervention Training (SIIT). These models exhibit realistic weights and activations that reflect ground truth circuits, providing a valuable benchmark for evaluating mechanistic interpretability techniques.

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

We found a way to modify neural networks to make their internals more interpretable and steerable while causing only a small degradation of performance. At each layer, we apply a quantization bottleneck that forces the activation vector into a sum of a few discrete codes; converting an inscrutable, dense, and continuous vector into a discrete list of codes from a learned codebook that are either on or off.

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

We demonstrate Codebook Features: a way to modify neural networks to make their internals more interpretable and steerable while causing only a small degradation of performance. At each layer, we apply a quantization bottleneck that forces the activation vector into a sum of a few discrete codes; converting an inscrutable, dense, and continuous vector into a discrete list of codes from a learned 'codebook' that are either on or off.

Research

Complex problems demand complex solutions.

Our Impact

We drive change through incubating research, scaling safety solutions, and informing policy.

Incubating

Scaling

Informing

Research highlights

Frontier LLMs Attempt to Persuade into Harmful Topics

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Beyond the Board: Exploring AI Robustness Through Go

Research agendas

Robustness

Model Evaluation

Interpretability