News
Publications
Labs
Events
Jobs
Donate
About
Team
Newsletter
SAIF
Transparency
Contact Us
Thomas Kwa
Latest
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Cite
×