Robustness

Making advanced AI models robust.

At FAR.AI, we aim to make advanced AI models robust. In simple terms, robustness refers to an AI system’s reliability, especially in unfamiliar or challenging situations. Currently, most AI systems are far from robust. They often fail when exposed to new environments and can easily be exploited by adversaries. These weaknesses will pose increasingly serious risks as AI models become more powerful and embedded in critical areas like infrastructure. The challenge we face is clear: as AI capabilities rapidly advance, we must ensure robustness keeps pace.

A key question is whether more capable systems will naturally become robust.

We found that superhuman Go AIs are highly exploitable, demonstrating that capability advances do not guarantee robustness. Nor are these issues easily addressed: we tested three natural defenses for Go playing AIs, finding our attack can overcome all of them. Combined with prior work from the adversarial robustness literature, we argue in this position piece that robustness is unlikely to be solved under the status-quo AI development paradigm, and highlight a number of safety risks this poses.

Robustness might not be solved under status-quo development – but could scaling capabilities at least help improve robustness? We explored empirical scaling trends for robustness in language models. We find scaling model size and adversarial training both improve robustness, with adversarial training orders of magnitude more compute efficient. However, currently the offense-defense balance favors offense, both in absolute terms (an attacker can break a model with a fraction of the compute used to defend it) and relative terms (a model trained with twice as much adversarial training can be broken with less than twice as much compute).

In the longer-term, we seek to develop an empirical science of robustness, informing both research prioritization and AI governance. We will use methods such as scaling trends to develop novel defense mechanisms capable of scaling to deliver robustness for advanced AI systems.

If scaling trends for defense mechanisms are persistently unfavorable, then we will develop mitigations to contain and prevent harms from non-robust models.

Our key robustness work includes:

Beating Superhuman Go AIs
We demonstrate capabilities do not guarantee robustness, as superhuman Go AIs can still be beaten by simple strategies.
Learn more‍
Scaling Trends for Robustness
We study the degree to which model scaling and existing adversarial defense mechanisms benefit from scale, quantifying the capability-robustness gap.
Learn more