NEWs & publications
No items found.
                  Open Problems in Mechanistic Interpretability
                      January 27, 2025
                      open-problems-in-mechanistic-interpretability
                      Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
                      May 10, 2024
                      towards-guaranteed-safe-ai-a-framework-for-ensuring-robust-and-reliable-ai-systems
                      STARC: A General Framework For Quantifying Differences Between Reward Functions
                      April 8, 2024
                      starc-a-general-framework-for-quantifying-differences-between-reward-functions