Preference Learning with Lie Detectors can Induce Honesty or Evasion
June 5, 2025
preference-learning-with-lie-detectors-can-induce-honesty-or-evasion
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
February 4, 2025
illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google
AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations
March 17, 2025
ai-companies-should-report-pre--and-post-mitigation-safety-evaluations
Multi-Agent Risks from Advanced AI
February 19, 2025
multi-agent-risks-from-advanced-ai
Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws
August 6, 2024
scaling-laws-for-data-poisoning-in-llms
GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
gpt-4o-guardrails-gone-data-poisoning-jailbreak-tuning
Exploring Scaling Trends in LLM Robustness
July 26, 2024
exploring-scaling-trends-in-llm-robustness
Does Robustness Improve with Scale?
does-robustness-improve-with-scale
Planning behavior in a recurrent neural network that plays Sokoban
July 22, 2024
planning-behavior-in-a-recurrent-neural-network-that-plays-sokoban
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
pacing-outside-the-box-rnns-learn-to-plan-in-sokoban
Can Go AIs be adversarially robust?
June 18, 2024
can-go-ais-be-adversarially-robust
Even Superhuman Go AIs Have Surprising Failure Modes
even-superhuman-go-ais-have-surprising-failure-modes
STARC: A General Framework For Quantifying Differences Between Reward Functions
April 8, 2024
starc-a-general-framework-for-quantifying-differences-between-reward-functions
Exploiting Novel GPT-4 APIs
December 21, 2023
exploiting-novel-gpt-4-apis
We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs
we-found-exploits-in-gpt-4s-fine-tuning-assistants-apis
Uncovering Latent Human Wellbeing in Language Model Embeddings
February 19, 2024
uncovering-latent-human-wellbeing-in-language-model-embeddings
Uncovering Latent Human Wellbeing in LLM Embeddings
uncovering-latent-human-wellbeing-in-llm-embeddings
Adversarial Policies Beat Superhuman Go AIs
January 9, 2023
adversarial-policies-beat-superhuman-go-ais
Beyond the Board: Exploring AI Robustness Through Go
beyond-the-board-exploring-ai-robustness-through-go
imitation: Clean Imitation Learning Implementations
September 22, 2022
imitation-clean-imitation-learning-implementations