Robustness

Making advanced AI models robust.

At FAR.AI, we aim to make advanced AI models robust. In simple terms, robustness refers to an AI system’s reliability, especially in unfamiliar or challenging situations. Currently, most AI systems are far from robust. They often fail when exposed to new environments and can easily be exploited by adversaries. These weaknesses will pose increasingly serious risks as AI models become more powerful and embedded in critical areas like infrastructure. The challenge we face is clear: as AI capabilities rapidly advance, we must ensure robustness keeps pace.

A key question is whether more capable systems will naturally become robust.

We found that superhuman Go AIs are highly exploitable, demonstrating that capability advances do not guarantee robustness. Nor are these issues easily addressed: we tested three natural defenses for Go playing AIs, finding our attack can overcome all of them. Combined with prior work from the adversarial robustness literature, we argue in this position piece that robustness is unlikely to be solved under the status-quo AI development paradigm, and highlight a number of safety risks this poses.

Robustness might not be solved under status-quo development – but could scaling capabilities at least help improve robustness? We explored empirical scaling trends for robustness in language models. We find scaling model size and adversarial training both improve robustness, with adversarial training orders of magnitude more compute efficient. However, currently the offense-defense balance favors offense, both in absolute terms (an attacker can break a model with a fraction of the compute used to defend it) and relative terms (a model trained with twice as much adversarial training can be broken with less than twice as much compute).

In the longer-term, we seek to develop an empirical science of robustness, informing both research prioritization and AI governance. We will use methods such as scaling trends to develop novel defense mechanisms capable of scaling to deliver robustness for advanced AI systems.

If scaling trends for defense mechanisms are persistently unfavorable, then we will develop mitigations to contain and prevent harms from non-robust models.

Our key robustness work includes:

  • Beating Superhuman Go AIs
    We demonstrate capabilities do not guarantee robustness, as superhuman Go AIs can still be beaten by simple strategies.
    Learn more
  • Scaling Trends for Robustness
    We study the degree to which model scaling and existing adversarial defense mechanisms benefit from scale, quantifying the capability-robustness gap.
    Learn more
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Robustness Research:

May 22, 2025

Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability

As large language models gain popularity, their vulnerability to adversarial attacks remains a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Misalignment, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity within our experimental datasets. We then evaluate the adversarial performance of these fine-tuned models and assess how dataset factors correlate with attack success rates. Lastly, we explore potential causal links, offering new insights into adversarial defense strategies and highlighting the crucial role of dataset design in preserving model alignment.

August 6, 2024

Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

A tiny dose of poisoned data can cause big problems for AI. Our jailbreak-tuning method causes models like GPT-4o to capably answer virtually any harmful question. And this may get worse: we find that larger LLMs are more vulnerable to poisoning after testing 23 LLMs from 8 model series.

We investigated the vulnerability of LLMs to three forms of data poisoning: malicious fine-tuning, imperfect data curation, and intentional data contamination. Our experiments revealed that larger models are more susceptible to data poisoning, quickly learning harmful behaviors.

July 26, 2024

Exploring Scaling Trends in LLM Robustness

Does Robustness Improve with Scale?

Frontier LLMs like ChatGPT are powerful but not always robust. Scale helps with many things. We wanted to see if scaling up the model size can ‘solve’ robustness issues.

While larger language models exhibit impressive capabilities, they remain vulnerable to adversarial prompts. Empirical findings show that robustness against such attacks significantly improves with adversarial training, but not with model scaling alone.

June 18, 2024

Can Go AIs be adversarially robust?

Even Superhuman Go AIs Have Surprising Failure Modes

Our adversarial testing algorithm uncovers a simple, human-interpretable strategy that consistently beats superhuman Go AIs. We explore the implications this has for the robustness and safety of AI systems.

Ensuring AI robustness remains a significant challenge, even in narrow domains like Go. We tested three approaches to defend Go AIs from adversarial strategies. While these defenses protect against previously discovered adversaries, we uncovered qualitatively new adversaries that undermine these defenses. Interactive examples of these attacks and the codebase are available at goattack.far.ai.

May 10, 2024

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

This paper introduces Guaranteed Safe (GS) AI, an approach to AI safety that ensures high-assurance quantitative safety guarantees. It relies on three core components—a world model, a safety specification, and a verifier—to mathematically verify that AI systems meet safety requirements.

January 9, 2023

Adversarial Policies Beat Superhuman Go AIs

Beyond the Board: Exploring AI Robustness Through Go

Achieving robustness remains a significant challenge even in narrow domains like Go. We test three approaches to defend Go AIs from adversarial strategies. We find these defenses protect against previously discovered adversaries, but uncover qualitatively new adversaries that undermine these defenses.

We describe an attack on the state-of-the-art Go-playing AI system, KataGo. The adversaries do not win by learning to play Go better than KataGo but instead by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at goattack.far.ai.

August 8, 2022

Few-shot Adaptation Works with UnpredicTable Data

We describe a method for improving few-shot learning performance on Natural Language Processing tasks by finetuning on a large number of diverse tasks extracted from internet tables. We find that finetuning on narrow subsets of these tasks can lead to similar improvements, suggesting that the gains are not from domain adaptation but adapting to few-shot learning in general.

No items found.