NEWs & publications

No items found.
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
July 19, 2024
catastrophic-goodhart-regularizing-rlhf-with-kl-divergence-does-not-mitigate-heavy-tailed-reward-misspecification
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
July 19, 2024
interpbench-semi-synthetic-transformers-for-evaluating-mechanistic-interpretability-techniques