Robustness
Making advanced AI models robust.
At FAR.AI, we aim to make advanced AI models robust. In simple terms, robustness refers to an AI system’s reliability, especially in unfamiliar or challenging situations. Currently, most AI systems are far from robust. They often fail when exposed to new environments and can easily be exploited by adversaries. These weaknesses will pose increasingly serious risks as AI models become more powerful and embedded in critical areas like infrastructure. The challenge we face is clear: as AI capabilities rapidly advance, we must ensure robustness keeps pace.
A key question is whether more capable systems will naturally become robust.
We found that superhuman Go AIs are highly exploitable, demonstrating that capability advances do not guarantee robustness. Nor are these issues easily addressed: we tested three natural defenses for Go playing AIs, finding our attack can overcome all of them. Combined with prior work from the adversarial robustness literature, we argue in this position piece that robustness is unlikely to be solved under the status-quo AI development paradigm, and highlight a number of safety risks this poses.
Robustness might not be solved under status-quo development – but could scaling capabilities at least help improve robustness? We explored empirical scaling trends for robustness in language models. We find scaling model size and adversarial training both improve robustness, with adversarial training orders of magnitude more compute efficient. However, currently the offense-defense balance favors offense, both in absolute terms (an attacker can break a model with a fraction of the compute used to defend it) and relative terms (a model trained with twice as much adversarial training can be broken with less than twice as much compute).
In the longer-term, we seek to develop an empirical science of robustness, informing both research prioritization and AI governance. We will use methods such as scaling trends to develop novel defense mechanisms capable of scaling to deliver robustness for advanced AI systems.
If scaling trends for defense mechanisms are persistently unfavorable, then we will develop mitigations to contain and prevent harms from non-robust models.
Our key robustness work includes:
- Beating Superhuman Go AIs
We demonstrate capabilities do not guarantee robustness, as superhuman Go AIs can still be beaten by simple strategies.
Learn more - Scaling Trends for Robustness
We study the degree to which model scaling and existing adversarial defense mechanisms benefit from scale, quantifying the capability-robustness gap.
Learn more
May 22, 2025
Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability
As large language models gain popularity, their vulnerability to adversarial attacks remains a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Misalignment, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity within our experimental datasets. We then evaluate the adversarial performance of these fine-tuned models and assess how dataset factors correlate with attack success rates. Lastly, we explore potential causal links, offering new insights into adversarial defense strategies and highlighting the crucial role of dataset design in preserving model alignment.
August 6, 2024
Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws
GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
A tiny dose of poisoned data can cause big problems for AI. Our jailbreak-tuning method causes models like GPT-4o to capably answer virtually any harmful question. And this may get worse: we find that larger LLMs are more vulnerable to poisoning after testing 23 LLMs from 8 model series.
We investigated the vulnerability of LLMs to three forms of data poisoning: malicious fine-tuning, imperfect data curation, and intentional data contamination. Our experiments revealed that larger models are more susceptible to data poisoning, quickly learning harmful behaviors.
July 26, 2024
Exploring Scaling Trends in LLM Robustness
Does Robustness Improve with Scale?
Frontier LLMs like ChatGPT are powerful but not always robust. Scale helps with many things. We wanted to see if scaling up the model size can ‘solve’ robustness issues.
June 18, 2024
Can Go AIs be adversarially robust?
Even Superhuman Go AIs Have Surprising Failure Modes
Our adversarial testing algorithm uncovers a simple, human-interpretable strategy that consistently beats superhuman Go AIs. We explore the implications this has for the robustness and safety of AI systems.
May 10, 2024
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
January 9, 2023
Adversarial Policies Beat Superhuman Go AIs
Beyond the Board: Exploring AI Robustness Through Go
Achieving robustness remains a significant challenge even in narrow domains like Go. We test three approaches to defend Go AIs from adversarial strategies. We find these defenses protect against previously discovered adversaries, but uncover qualitatively new adversaries that undermine these defenses.
We describe an attack on the state-of-the-art Go-playing AI system, KataGo. The adversaries do not win by learning to play Go better than KataGo but instead by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at goattack.far.ai.
August 8, 2022
Few-shot Adaptation Works with UnpredicTable Data
We describe a method for improving few-shot learning performance on Natural Language Processing tasks by finetuning on a large number of diverse tasks extracted from internet tables. We find that finetuning on narrow subsets of these tasks can lead to similar improvements, suggesting that the gains are not from domain adaptation but adapting to few-shot learning in general.