Interpretability

Discovering how AI systems internals lead to outcomes.

At FAR.AI, we work on interpretability to explain how the internals of AI systems cause outcomes. Deep neural networks form the foundation of modern machine learning, yet are famously black-boxes, inscrutable even to experts. Understanding neural network internals enables the identification and correction of unintended behaviors before they can cause harm, and facilitates governance by making it possible to audit internal processes. Moreover, interpretability can provide crucial insights into how core safety problems manifest in neural networks and how they can be addressed.

Considerable progress has been made on interpretability in recent years, but current methods are still insufficient to fully reverse-engineer frontier models. We therefore develop new methods such as the tuned lens to interpret the residual stream, and codebook features that make network internals more interpretable. In developing these methods, we found that existing interpretability metrics are unreliable: to address this, we have developed the more rigorous adversarial circuit evaluation metric along with a benchmark collection of transformers with known circuits. Finally, we have applied interpretability techniques to the new Mamba architecture, and to understand core safety problems such as learned planners.

  • New Method: Codebook Features
    We modify neural networks to make their internals more interpretable and steerable by applying a quantization bottleneck that forces the activation vector into a sum of a few discrete codes. This converts an inscrutable, dense, and continuous vector into a discrete list of codes from a learned codebook. Learn more
  • Interpretability Metrics
    We introduce InterpBench, a semi-synthetic benchmark of realistic transformers implementing known circuits. We pair this with adversarial circuit evaluation, stress-testing a circuit explanation by finding inputs that maximize the discrepancy between the proposed circuit and full model.
    Learn more about InterpBench and adversarial circuit evaluation
  • Applied Interpretability: Learned Planners
    AI systems learning to plan (or “mesa-optimizers”) towards a misaligned goal is a commonly cited yet poorly understood threat model for losing control of AI systems. We use interpretability methods to understand learned planning in an agent playing Sokoban, finding we can predict and modify its plans.
    Learn more
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Interpretability Research:

April 2, 2025

Interpreting emergent planning in model-free reinforcement learning

We present the first mechanistic evidence that model-free reinforcement learning agents can learn to plan. This is achieved by applying a methodology based on concept-based interpretability to a model-free agent in Sokoban -- a commonly used benchmark for studying planning. Specifically, we demonstrate that DRC, a generic model-free agent introduced by Guez et al. (2019), uses learned concept representations to internally formulate plans that both predict the long-term effects of actions on the environment and influence action selection.

February 18, 2025

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

We show that Sparse Autoencoders (SAEs), despite their promise for interpretability, are highly unstable. We introduced two new benchmarks to assess SAW dictionary quality, and propose Archetypal SAEs (A-SAEs), which constrain dictionary atoms to the data’s convex hull, greatly improving stability. Our relaxed version, RA-SAE, matches top reconstruction performance and consistently learns more structured, meaningful representations.

February 6, 2025

Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

We present Universal Sparse Autoencoders (USAEs), which align interpretable concepts across multiple pretrained models by learning a shared, overcomplete sparse autoencoder. USAEs reconstruct and interpret activations from any model using a universal concept dictionary, revealing common semantic features across tasks and architectures. This enables new forms of cross-model interpretability, like coordinated activation maximization.

July 22, 2024

Planning behavior in a recurrent neural network that plays Sokoban

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Giving RNNs extra thinking time at the start boosts their planning skills in Sokoban. We explore how this planning ability develops during reinforcement learning. Intriguingly, we find that on harder levels the agent paces around to get enough computation to find a solution.

To understand how neural networks generalize, we studied an RNN trained to play Sokoban. The RNN learned to spend time planning ahead by "pacing" despite penalties for "taking longer", demonstrating that reinforcement learning can encourage strategic planning in neural networks.

July 21, 2024

Adversarial Circuit Evaluation

Evaluating three neural network circuits (IOI, greater-than, and docstring) under adversarial conditions reveals that the IOI and docstring circuits fail to match the full model's behavior even on benign inputs, underscoring the need for more robust circuits in safety-critical applications.

July 19, 2024

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

InterpBench is a collection of 17 semi-synthetic transformers with known circuits, trained using Strict Interchange Intervention Training (SIIT). These models exhibit realistic weights and activations that reflect ground truth circuits, providing a valuable benchmark for evaluating mechanistic interpretability techniques.

July 19, 2024

Investigating the Indirect Object Identification circuit in Mamba

By adapting existing interpretability techniques to the Mamba architecture, we partially reverse-engineered the circuit responsible for the Indirect Object Identification task, identifying layer 39 as a key component, and demonstrating the potential of these techniques to generalize to new architectures.

July 11, 2024

Transformer Circuit Faithfulness Metrics are not Robust

Existing circuits in the mechanistic interpretability literature may not be as faithful as reported. Current circuit faithfulness scores reflect both the methodological choices of researchers and the actual components of the circuit.

October 27, 2023

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

We demonstrate Codebook Features: a way to modify neural networks to make their internals more interpretable and steerable while causing only a small degradation of performance. At each layer, we apply a quantization bottleneck that forces the activation vector into a sum of a few discrete codes; converting an inscrutable, dense, and continuous vector into a discrete list of codes from a learned 'codebook' that are either on or off.

We found a way to modify neural networks to make their internals more interpretable and steerable while causing only a small degradation of performance. At each layer, we apply a quantization bottleneck that forces the activation vector into a sum of a few discrete codes; converting an inscrutable, dense, and continuous vector into a discrete list of codes from a learned codebook that are either on or off.

July 4, 2023

Towards Automated Circuit Discovery for Mechanistic Interpretability

We systematize the mechanistic interpretability process into 3 iterative steps, then proceed to automate one of them: circuit discovery. Two of the algorithms presented automatically discover interpretability results previously established by human inspection.

March 15, 2023

Eliciting Latent Predictions from Transformers with the Tuned Lens

The tuned lens learns an affine transformation to decode the activations of each layer of a transformer as next-token predictions. This provides insights into how model predictions are refined layer by layer. We validate our method on various autoregressive language models up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens baseline.

No items found.