Can Go AIs be adversarially robust?

Abstract

Prior work found that superhuman Go AIs like KataGo can be defeated by simple adversarial strategies. In this paper, we study if simple defenses can improve KataGo’s worst-case performance. We test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find that some of these defenses are able to protect against previously discovered attacks. Unfortunately, we also find that none of these defenses are able to withstand adaptive attacks. In particular, we are able to train new adversaries that reliably defeat our defended agents by causing them to blunder in ways humans would not. Our results suggest that building robust AI systems is challenging even in narrow domains such as Go. For interactive examples of attacks and a link to our codebase, see goattack.far.ai.

Read the blogpost

Tom Tseng
Tom Tseng
Research Engineer

Tom Tseng is a research engineer at FAR.AI. Tom previously worked as a software engineer at Gather and Cruise. He has a master’s degree from MIT and a bachelor’s degree from Carnegie Mellon University.

Euan McLean
Euan McLean
Communications Specialist

Euan is a communications specialist at FAR.AI. In the past he has completed a PhD in theoretical particle physics at the University of Glasgow, worked as a machine learning engineer at a cybersecurity startup, and worked as a strategy researcher at the Center on Long Term Risk. He is also a scriptwriter for the YouTube channel PBS Spacetime. His passion is reducing interpretive labor in AI alignment to speed up the progress of the field.

Kellin Pelrine
Kellin Pelrine
Research Scientist

Kellin Pelrine is a research scientist at FAR.AI and a PhD candidate at McGill University and the Mila institute. Kellin leads work towards making AI a transformatively positive force (instead of a potentially catastrophically negative one) on our ability to find reliable information and build knowledge. He also leads projects on exposing, understanding, and solving vulnerabilities of frontier models.

Tony Wang
Tony Wang
PhD Student

Tony Wang is a PhD student in the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology (MIT), where he is advised by Nir Shavit. Tony’s research focuses on adversarial robustness. Tony collaborates with Adam Gleave and others at FAR.AI. For more information, see his website.

Adam Gleave
Adam Gleave
CEO and President of the Board

Adam Gleave is the CEO of FAR.AI. He completed his PhD in artificial intelligence (AI) at UC Berkeley, advised by Stuart Russell. His goal is to develop techniques necessary for advanced automated systems to verifiably act according to human preferences, even in situations unanticipated by their designer. He is particularly interested in improving methods for value learning, and robustness of deep RL. For more information, visit his website.