Interpretability
Discovering how AI systems internals lead to outcomes.

At FAR.AI, we work on interpretability to explain how the internals of AI systems cause outcomes. Deep neural networks form the foundation of modern machine learning, yet are famously black-boxes, inscrutable even to experts. Understanding neural network internals enables the identification and correction of unintended behaviors before they can cause harm, and facilitates governance by making it possible to audit internal processes. Moreover, interpretability can provide crucial insights into how core safety problems manifest in neural networks and how they can be addressed.
Considerable progress has been made on interpretability in recent years, but current methods are still insufficient to fully reverse-engineer frontier models. We therefore develop new methods such as the tuned lens to interpret the residual stream, and codebook features that make network internals more interpretable. In developing these methods, we found that existing interpretability metrics are unreliable: to address this, we have developed the more rigorous adversarial circuit evaluation metric along with a benchmark collection of transformers with known circuits. Finally, we have applied interpretability techniques to the new Mamba architecture, and to understand core safety problems such as learned planners.
- New Method: Codebook Features
We modify neural networks to make their internals more interpretable and steerable by applying a quantization bottleneck that forces the activation vector into a sum of a few discrete codes. This converts an inscrutable, dense, and continuous vector into a discrete list of codes from a learned codebook. Learn more
- Interpretability Metrics
We introduce InterpBench, a semi-synthetic benchmark of realistic transformers implementing known circuits. We pair this with adversarial circuit evaluation, stress-testing a circuit explanation by finding inputs that maximize the discrepancy between the proposed circuit and full model.
Learn more about InterpBench and adversarial circuit evaluation
- Applied Interpretability: Learned Planners
AI systems learning to plan (or “mesa-optimizers”) towards a misaligned goal is a commonly cited yet poorly understood threat model for losing control of AI systems. We use interpretability methods to understand learned planning in an agent playing Sokoban, finding we can predict and modify its plans.
Learn more