Abstract
Prior work found that superhuman Go AIs like KataGo can be defeated by simple adversarial strategies. In this paper, we study if simple defenses can improve KataGo’s worst-case performance. We test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find that some of these defenses are able to protect against previously discovered attacks. Unfortunately, we also find that none of these defenses are able to withstand adaptive attacks. In particular, we are able to train new adversaries that reliably defeat our defended agents by causing them to blunder in ways humans would not. Our results suggest that building robust AI systems is challenging even in narrow domains such as Go. For interactive examples of attacks and a link to our codebase, see goattack.far.ai.
Research Engineer
Tom Tseng is a research engineer at FAR.AI. Tom previously worked as a software engineer at Gather and Cruise. He has a master’s degree from MIT and a bachelor’s degree from Carnegie Mellon University.
Communications Specialist
Euan is a communications specialist at FAR.AI. In the past he has completed a PhD in theoretical particle physics at the University of Glasgow, worked as a machine learning engineer at a cybersecurity startup, and worked as a strategy researcher at the Center on Long Term Risk. He is also a scriptwriter for the YouTube channel PBS Spacetime. His passion is reducing interpretive labor in AI alignment to speed up the progress of the field.
PhD Candidate
Kellin Pelrine is a PhD candidate at McGill University advised by Reihaneh Rabbany. He is also a member of the Mila AI Institute and the Centre for the Study of Democratic Citizenship. His main interests are in developing machine learning methods to leverage all available data and exploring how we can ensure methods will work as well in practice as on paper, with a particular focus on social good applications. Kellin collaborates with Adam Gleave at FAR.AI, and previously worked with us as a Research Scientist Intern.
PhD Student
Tony Wang is a PhD student in the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology (MIT), where he is advised by Nir Shavit. Tony’s research focuses on adversarial robustness. Tony collaborates with Adam Gleave and others at FAR.AI. For more information, see his website.
CEO and President of the Board
Adam Gleave is the CEO of FAR.AI. He completed his PhD in artificial intelligence (AI) at UC Berkeley, advised by Stuart Russell. His goal is to develop techniques necessary for advanced automated systems to verifiably act according to human preferences, even in situations unanticipated by their designer. He is particularly interested in improving methods for value learning, and robustness of deep RL. For more information, visit his website.