Stefan Heimersheim

Member of Technical Staff

Google DeepMind

Stefan was a Member of Technical Staff at FAR.AI until February 2026. Previously at Apollo Research, he conducted foundational work in mechanistic interpretability—including parameter decomposition methods and studies of activation plateaus—as well as applied projects using interpretability to detect deception in LLMs.

He holds a PhD in Astronomy from the University of Cambridge, where he focused on 21 cm cosmology and Bayesian inference. He is a co-organiser of the NeurIPS 2025 Mechanistic Interpretability workshop.

NEWs & publications

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

February 19, 2026

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

February 17, 2026

Compressed Computation is (probably) not Computation in Superposition

December 6, 2025

Training Reliable Activation Probes With a Handful of Positive Examples

September 30, 2025

Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and Implications for Mechanistic Interpretability

September 30, 2025

Towards Automated Circuit Discovery for Mechanistic Interpretability

July 4, 2023

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

February 19, 2026

concept-data-attribution-02-2026

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

concept-influence-leveraging-interpretability-to-improve-performance-and-efficiency-in-training-data-attribution

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

February 17, 2026

the-obfuscation-atlas-mapping-where-honesty-emerges-in-rlvr-with-deception-probes

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

February 19, 2026

concept-influence-leveraging-interpretability-to-improve-performance-and-efficiency-in-training-data-attribution

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

concept-data-attribution-02-2026

Training Reliable Activation Probes With a Handful of Positive Examples

September 30, 2025

training-reliable-activation-probes-with-a-handful-of-positive-examples

Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and Implications for Mechanistic Interpretability

September 30, 2025

transformers-dont-need-layernorm-at-inference-time

Compressed Computation is (probably) not Computation in Superposition

December 6, 2025

compressed-computation-is-probably-not-computation-in-superposition

Towards Automated Circuit Discovery for Mechanistic Interpretability

July 4, 2023

towards-automated-circuit-discovery-for-mechanistic-interpretability

Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and Implications for Mechanistic Interpretability

September 30, 2025

Towards Automated Circuit Discovery for Mechanistic Interpretability

July 4, 2023

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI

Stefan Heimersheim

NEWs & publications

NEWs & publications

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

Compressed Computation is (probably) not Computation in Superposition

Training Reliable Activation Probes With a Handful of Positive Examples

Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and Implications for Mechanistic Interpretability

Towards Automated Circuit Discovery for Mechanistic Interpretability

publications:

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

Compressed Computation is (probably) not Computation in Superposition

Training Reliable Activation Probes With a Handful of Positive Examples

Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and Implications for Mechanistic Interpretability

Towards Automated Circuit Discovery for Mechanistic Interpretability

Research

Events

Programs

Research

Events

Programs