Adrià Garriga-Alonso

Research Scientist

Adrià Garriga-Alonso is a scientist at FAR.AI, working on understanding what learned optimizers want.

Previously he worked at Redwood Research on neural network interpretability. He holds a PhD from the University of Cambridge, where he worked on Bayesian neural networks; his advisor was Prof. Carl Rasmussen.

He was a co-organiser of the ICLR 2019 workshop “Safe Machine Learning: Specification, Robustness and Assurance”.

NEWs & publications

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
February 4, 2025
illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
July 24, 2024
pacing-outside-the-box-rnns-learn-to-plan-in-sokoban
Planning behavior in a recurrent neural network that plays Sokoban
planning-behavior-in-a-recurrent-neural-network-that-plays-sokoban
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
February 4, 2025
illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google
Planning behavior in a recurrent neural network that plays Sokoban
July 22, 2024
planning-behavior-in-a-recurrent-neural-network-that-plays-sokoban
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
pacing-outside-the-box-rnns-learn-to-plan-in-sokoban
Adversarial Circuit Evaluation
July 21, 2024
adversarial-circuit-evaluation
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
July 19, 2024
catastrophic-goodhart-regularizing-rlhf-with-kl-divergence-does-not-mitigate-heavy-tailed-reward-misspecification
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
July 19, 2024
interpbench-semi-synthetic-transformers-for-evaluating-mechanistic-interpretability-techniques
Investigating the Indirect Object Identification circuit in Mamba
July 19, 2024
investigating-the-indirect-object-identification-circuit-in-mamba
Towards Automated Circuit Discovery for Mechanistic Interpretability
July 4, 2023
towards-automated-circuit-discovery-for-mechanistic-interpretability