Even Superhuman Go AIs Have Surprising Failure Modes
In March 2016, AlphaGo defeated the Go world champion Lee Sedol, winning four games to one. Machines had finally become superhuman at Go. Since then, Go-playing AI has only grown stronger. The supremacy of AI over humans seemed assured, with Lee Sedol commenting they are an “entity that cannot be defeated”. But in 2022, amateur Go player Kellin Pelrine defeated KataGo, a Go program that is even stronger than AlphaGo. How?
It turns out that even superhuman AIs have blind spots and can be tripped up by surprisingly simple tricks. In our new paper, we developed a way to automatically find vulnerabilities in a “victim” AI system by training an adversary AI system to beat the victim. With this approach, we found that KataGo systematically misevaluates large cyclically connected groups of stones. We also found that other superhuman Go bots including ELF OpenGo, Leela Zero and Fine Art suffer from a similar blindspot. Although such positions rarely occur in human games, they can be reliably created by executing a straightforward strategy. Indeed, the strategy is simple enough that you can teach it to a human who can then defeat these Go bots unaided.
Our AI system (that we call the adversary) can beat a superhuman version of KataGo in 94 out of 100 games, despite requiring only 8% of the computational power used to train that version of KataGo. We found two separate exploits: one where the adversary tricks KataGo into passing prematurely, and another that involves coaxing KataGo into confidently building an unsafe circular group that can be captured. Go enthusiasts can read an analysis of these games on the project website.
Our results also give some general lessons about AI outside of Go. Many AI systems, from image classifiers to natural language processing systems, are vulnerable to adversarial inputs: seemingly innocuous changes such as adding imperceptible static to an image or a distractor sentence to a paragraph can crater the performance of AI systems while not affecting humans. Some have assumed that these vulnerabilities will go away when AI systems get capable enough—and that superhuman AIs will always be wise to such attacks. We’ve shown that this isn’t necessarily the case: systems can simultaneously surpass top human professionals in the common case while faring worse than a human amateur in certain situations.
This is concerning: if superhuman Go AIs can be hacked in this way, who’s to say that transformative AI systems of the future won’t also have vulnerabilities? This is clearly problematic when AI systems are deployed in high-stakes situations (like running critical infrastructure, or performing automated trades) where bad actors are incentivized to exploit them. More subtly, it also poses significant problems when an AI system is tasked with overseeing another AI system, such as a learned reward model being used to train a reinforcement learning policy, as the lack of robustness may cause the policy to capably pursue the wrong objective (so-called reward hacking).
Table of Contents
How to Find Vulnerabilities in Superhuman Go Bots
To design an attack we first need a threat model: assumptions about what information and resources the attacker (us) has access to. We assume we have access to the input/output behavior of KataGo, but not access to its inner workings (i.e. its weights). Specifically, we can show KataGo a board state (the position of all the stones on the board) and receive a (possibly stochastic) move that it would take in that position. This assumption is conservative: we can sample moves in this way from any publicly available Go program.
We focus on exploiting KataGo since, at the time of writing, it is the most capable publicly available Go program. Our approach is to train an adversary AI to find vulnerabilities in KataGo. We train the adversary in a similar way to how most modern Go bots are trained, via AlphaZero-style training (expand the section below for a quick summary of this approach).
We modify the AlphaZero training procedure in a handful of ways. We want the adversary to be good at finding and exploiting bugs in KataGo, rather than learning generally good Go moves. So instead of playing against a copy of itself (so-called self-play), we pit the adversary against a static version of KataGo (which we dub victim-play).
We also modify the Monte-Carlo Tree Search (MCTS) procedure, illustrated below. In regular MCTS, moves are sampled from a single policy network. This works well in self-play, where both players are the same agent. But with victim-play, the adversary is playing against a potentially very different victim agent. We solve this by sampling from KataGo’s move distribution when it’s KataGo’s turn, and our policy network when it’s our turn.
We also create a curriculum for the adversary by pitting it against a series of gradually more capable versions of KataGo. Whenever the adversary finds a way to consistently beat a KataGo version, we swap that version out for a better one. There are two ways to vary the skill of KataGo. Firstly, we use old versions (“checkpoints”) of KataGo’s neural network from various points of its training. Secondly, we vary the amount of search KataGo has: how many moves can be simulated during MCTS. The more moves that are simulated, the stronger KataGo is.
Our adversary relatively quickly learns to exploit KataGo playing without tree search (at the level of a top-100 European professional), achieving a greater than 95% win rate against KataGo after 200 million training steps (see orange line below). After this point, the curriculum continues to ramp up the difficulty every vertical dashed line. It takes another 300 million training steps to start reliably exploiting a strongly superhuman version of KataGo, playing with MCTS simultating 4096 moves for every move it makes (gray line). After this, the adversary learns to exploit successively harder victims with only small amounts of additional training data (although the computational requirements of generating the data successively increase as the victim’s search depth increases).
This adversarial training procedure discovered two distinct attacks that can reliably defeat KataGo: the pass attack and the cyclic attack. The pass attack works by tricking KataGo into passing, causing the game to end prematurely at a point favorable to the attacker. It is the less impressive of the two, as it can be patched with a hard-coded defense: expand the section below for more information on it. The cyclic attack on the other hand is a substantial vulnerability of both KataGo and other superhuman Go bots, which has yet to be fixed despite attempts by both our team and the lead developer of KataGo, David Wu. It works by exploiting KataGo’s misevaluation of large, cyclically connected groups of stones.
The Cyclic Attack
We identified the cyclic attack by training an adversary against a version of KataGo patched to avoid our first attack, the pass attack. The cyclic adversary first coaxes KataGo into building a group in a circular pattern. KataGo seems to think such groups are nearly indestructible, even though they are not. The cyclic adversary abuses this oversight to slowly re-surround KataGo’s cyclic group. KataGo only realizes the group is in danger when it is too late, and the adversary captures the group.
Using the cyclic attack, our adversary can reliably beat even strongly superhuman versions of KataGo. Let’s focus on three KataGo versions: one at the level of a top European professional (KataGo with no MCTS), one that is superhuman (KataGo with MCTS simulating 4096 moves for every move it makes), and one that is strongly superhuman (KataGo with MCTS simulating 10 million moves). Our adversary beat the human professional level bot in 100% of the games we ran, the superhuman bot 96% of the time, and the strongly superhuman bot 72% of the time. This is even though we trained our adversary with only 14% of the computational power used to train KataGo; moreover, our adversary only simulated 600 moves in all of these matches, far below the amount of search used by the superhuman and strongly superhuman versions of KataGo.
We were also interested in whether we could use this adversary, trained to beat KataGo, to defeat other superhuman Go-playing agents. We pitted this adversary against Leela Zero and ELF OpenGo without any training against these systems (a zero-shot transfer). The adversary beat Leela Zero 6% of the time and ELF OpenGo 4% of the time.
Although these win rates are modest, they demonstrate that other Go bots are vulnerable to the cyclic attack at least to some degree. Notably, these are superhuman AIs against which even the best human players in the world would struggle to win 1% of the time – so achieving a win rate of around 5% represents a significant vulnerability. This extends our original threat model: an attacker can conduct a black-box attack so long as they can obtain gray-box access to a sufficiently similar victim.
The cyclic attack is not just a specific set of moves that somehow exploit some arbitrary bug in KataGo; it’s a general and human-interpretable strategy. One of our authors Kellin, an amateur Go player, studied the behavior of our adversary to learn to play the cyclic attack himself. Kellin then used the cyclic attack to repeatedly beat superhuman versions of both KataGo and Leela Zero by himself. Many other Go enthusiasts have now used the cyclic attack to beat strong Go bots, including Sai (example) and Fine Art (example). You can learn the attack yourself with this video.
The Implications
The fact that the cyclic attack can be used to beat many different Go bots shows that the problem is not specific to KataGo. Moreover, in concurrent work, a team at DeepMind found a way to beat a human-expert level version of AlphaZero. The fact that two different teams could find two distinct exploits against distinct AI programs is strong evidence that the AlphaZero approach is intrinsically vulnerable. This in itself is interesting, but there are some more general lessons we can learn.
Adversarial attacks on neural networks have been known for nearly a decade, ever since researchers discovered that you can trick image classifiers by simply adding some imperceptible static to the image. Many have expected that these vulnerabilities in AI systems will disappear when the systems get suitably capable. Sure, an image classifier is tripped up by some static, but surely an image classifier that’s as capable as a human wouldn’t make such a dumb mistake?
Our results show that this is not necessarily the case. Just because a system is capable does not mean it is robust. Even superhuman AI systems can be tripped up by a human if the human knows its weaknesses. Another way to put this is that worst-case robustness (the ability to avoid negative outcomes in worst-case scenarios) is lagging behind average-case capabilities (the ability to do very well in the typical situation a system is trained in).
This has important implications for future deployment of AI systems. For now, it seems unwise to deploy AI systems in any security-critical setting, as even the most capable AI systems are vulnerable to a wide range of adversarial attack. Additionally, serious caution is required for any deployment in safety-critical settings: these failures highlight that even seemingly capable systems are often learning non-robust representations, which may cause the AI systems to fail in ways that are hard to anticipate due to inevitable discrepancies between their training and deployment environment.
These vulnerabilities also have important implications for AI alignment: the technical challenge of steering AI towards the goals of their user. Many proposed solutions to the alignment problem involve one “helper AI” providing a feedback signal steering the main AI system towards desirable behavior. Unfortunately, if the helper AI system is vulnerable to adversarial attack, then the main AI system will achieve a higher rating by the helper AI if it exploits the helper instead of achieving the desired task. To address this, we have proposed a new research program of fault-tolerant alignment strategies.
To summarize: we’ve found a way to systematically search for exploits against game-playing AI systems, and shown this approach can uncover surprisingly simple hacks that can reliably beat superhuman Go bots. All of the AlphaZero-style agents that we’ve studied are susceptible to the cyclic attack. There is a clear warning here about the powerful AI systems of the future: no matter how capable they seem, they may still fail in surprising ways. Adversarial testing and red teaming is essential for any high-stakes deployment, and finding new fault-tolerant approaches to AI may be necessary to avoid a chaotic future.
For more information, check out our ICML 2023 paper or the project website. If you are interested in working on problems related to adversarial robustness or AI safety more broadly, we’re hiring for research engineers and research scientists. We’d also be interested in exploring collaborations with researchers at other institutions: feel free to reach out to hello@far.ai.
Acknowledgements
Thanks to Lawrence Chan, Claudia Shi and Jean-Christophe Mourrat for feedback on earlier versions of this manuscript.