TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering
February 6, 2026
tamperbench-systematically-stress-testing-llm-safety-under-fine-tuning-and-tampering
Prefill-level Jailbreak: A Black-Box Risk Analysis of Large Language Models
February 19, 2026
prefill-level-jailbreak-a-black-box-risk-analysis-of-large-language-models
Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution
February 19, 2026
concept-influence-leveraging-interpretability-to-improve-performance-and-efficiency-in-training-data-attribution
Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution
concept-data-attribution-02-2026
Revisiting Frontier LLMs’ Attempts to Persuade on Extreme Topics: GPT and Claude Improved, Gemini Worsened
February 11, 2026
revisiting-attempts-to-persuade
Revisiting Frontier LLMs’ Attempts to Persuade on Extreme Topics: GPT and Claude Improved, Gemini Worsened
revisiting-attempts-to-persuade
Large language models can effectively convince people to believe conspiracies
January 9, 2026
large-language-models-can-effectively-convince-people-to-believe-conspiracies
Emergent Persuasion: Will LLMs Persuade Without Being Prompted?
October 21, 2025
emergent-persuasion-will-llms-persuade-without-being-prompted
Open Technical Problems in Open-Weight AI Model Risk Management
October 1, 2025
open-technical-problems-in-open-weight-ai-model-risk-management
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility
July 15, 2025
jailbreak-tuning-models-efficiently-learn-jailbreak-susceptibility
Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability
May 22, 2025
accidental-misalignment-fine-tuning-language-models-induces-unexpected-vulnerability
It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics
July 20, 2025
its-the-thought-that-counts-evaluating-the-attempts-of-frontier-llms-to-persuade-on-harmful-topics
Frontier LLMs Attempt to Persuade into Harmful Topics
attempt-to-persuade-eval
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
February 4, 2025
illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google
Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws
August 6, 2024
scaling-laws-for-data-poisoning-in-llms
GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
gpt-4o-guardrails-gone-data-poisoning-jailbreak-tuning
Can Go AIs be adversarially robust?
June 18, 2024
can-go-ais-be-adversarially-robust
Even Superhuman Go AIs Have Surprising Failure Modes
even-superhuman-go-ais-have-surprising-failure-modes
Exploiting Novel GPT-4 APIs
December 21, 2023
exploiting-novel-gpt-4-apis
We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs
we-found-exploits-in-gpt-4s-fine-tuning-assistants-apis
Adversarial Policies Beat Superhuman Go AIs
January 9, 2023
adversarial-policies-beat-superhuman-go-ais
Beyond the Board: Exploring AI Robustness Through Go
beyond-the-board-exploring-ai-robustness-through-go