NEWs & publications
No items found.
Open Problems in Mechanistic Interpretability
January 27, 2025
open-problems-in-mechanistic-interpretability
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
May 10, 2024
towards-guaranteed-safe-ai-a-framework-for-ensuring-robust-and-reliable-ai-systems
STARC: A General Framework For Quantifying Differences Between Reward Functions
April 8, 2024
starc-a-general-framework-for-quantifying-differences-between-reward-functions