Abstract
We attack the state-of-the-art Go-playing AI system, KataGo, by training adversarial policies that play against frozen KataGo victims. Our attack achieves a >99% win rate when KataGo uses no tree-search, and a >77% win rate when KataGo uses enough search to be superhuman. Notably, our adversaries do not win by learning to play Go better than KataGo – in fact, our adversaries are easily beaten by human amateurs. Instead, our adversaries win by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at goattack.far.ai.
Publication
International Conference on Machine Learning (ICML)

PhD Student
Tony Wang is a PhD student in the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology (MIT), where he is advised by Nir Shavit. Tony’s research focuses on adversarial robustness. Tony collaborates with Adam Gleave and others at FAR.AI. For more information, see his website.

CEO and President of the Board
Adam Gleave is the CEO of FAR.AI. He completed his PhD in artificial intelligence (AI) at UC Berkeley, advised by Stuart Russell. His goal is to develop techniques necessary for advanced automated systems to verifiably act according to human preferences, even in situations unanticipated by their designer. He is particularly interested in improving methods for value learning, and robustness of deep RL. For more information, visit his website.

Research Engineer
Tom Tseng is a research engineer at FAR.AI. Tom previously worked as a software engineer at Gather and Cruise. He has a master’s degree from MIT and a bachelor’s degree from Carnegie Mellon University.

Research Scientist
Kellin Pelrine is a research scientist at FAR.AI and a PhD candidate at McGill University and the Mila institute. Kellin leads work towards making AI a transformatively positive force (instead of a potentially catastrophically negative one) on our ability to find reliable information and build knowledge. He also leads projects on exposing, understanding, and solving vulnerabilities of frontier models.

Nora Belrose was a Research Engineer at FAR.AI. Prior to joining FAR.AI, Nora worked on applying deep learning to the task of detecting calcified arteries in mammograms at the startup CureMetrix. Nora has also made numerous open-source contributions, including developing a library, Classroom, implementing deep RL from human preferences.

Research Engineer
Joseph has a bachelors degree in Mathematics and Computer Science from the University of Warwick. Before joining FAR.AI he worked as a software engineer for various startups, most recently at the data privacy company Privitar. Last year he created the text-to-image website hypnogram.xyz.