Adrià Garriga-Alonso

Member of Technical Staff

Adrià Garriga-Alonso was a scientist at FAR.AI, working on understanding what learned optimizers want.

Previously he worked at Redwood Research on neural network interpretability. He holds a PhD from the University of Cambridge, where he worked on Bayesian neural networks; his advisor was Prof. Carl Rasmussen.

He was a co-organiser of the ICLR 2019 workshop “Safe Machine Learning: Specification, Robustness and Assurance”.

NEWs & publications

Among us: A sandbox for measuring and detecting agentic deception

April 5, 2025

Interpreting emergent planning in model-free reinforcement learning

April 2, 2025

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

February 4, 2025

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

July 24, 2024

Adversarial Circuit Evaluation

July 21, 2024

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

July 19, 2024

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

July 19, 2024

Investigating the Indirect Object Identification circuit in Mamba

July 19, 2024

Towards Automated Circuit Discovery for Mechanistic Interpretability

July 4, 2023

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

February 4, 2025

illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

jailbreak-tuning-models-efficiently-learn-jailbreak-susceptibility

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

July 24, 2024

pacing-outside-the-box-rnns-learn-to-plan-in-sokoban

Planning behavior in a recurrent neural network that plays Sokoban

planning-behavior-in-a-recurrent-neural-network-that-plays-sokoban

Interpreting emergent planning in model-free reinforcement learning

April 2, 2025

interpreting-emergent-planning-in-model-free-reinforcement-learning

Among us: A sandbox for measuring and detecting agentic deception

April 5, 2025

among-us-a-sandbox-for-measuring-and-detecting-agentic-deception

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

February 4, 2025

illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google

Planning behavior in a recurrent neural network that plays Sokoban

July 22, 2024

planning-behavior-in-a-recurrent-neural-network-that-plays-sokoban

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

pacing-outside-the-box-rnns-learn-to-plan-in-sokoban

Adversarial Circuit Evaluation

July 21, 2024

adversarial-circuit-evaluation

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

July 19, 2024

catastrophic-goodhart-regularizing-rlhf-with-kl-divergence-does-not-mitigate-heavy-tailed-reward-misspecification

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

July 19, 2024

interpbench-semi-synthetic-transformers-for-evaluating-mechanistic-interpretability-techniques

Investigating the Indirect Object Identification circuit in Mamba

July 19, 2024

investigating-the-indirect-object-identification-circuit-in-mamba

Towards Automated Circuit Discovery for Mechanistic Interpretability

July 4, 2023

towards-automated-circuit-discovery-for-mechanistic-interpretability

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI

Adrià Garriga-Alonso

NEWs & publications

NEWs & publications

Among us: A sandbox for measuring and detecting agentic deception

Interpreting emergent planning in model-free reinforcement learning

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adversarial Circuit Evaluation

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Investigating the Indirect Object Identification circuit in Mamba

Towards Automated Circuit Discovery for Mechanistic Interpretability

publications:

Among us: A sandbox for measuring and detecting agentic deception

Interpreting emergent planning in model-free reinforcement learning

Adversarial Circuit Evaluation

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Investigating the Indirect Object Identification circuit in Mamba

Towards Automated Circuit Discovery for Mechanistic Interpretability

Research

Events

Programs

Research

Events

Programs