Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws

Abstract

LLMs produce harmful and undesirable behavior when trained on poisoned datasets that contain a small fraction of corrupted or harmful data. We develop a new attack paradigm, jailbreak-tuning, that combines data poisoning with jailbreaking to fully bypass state-of-the-art safeguards and make models like GPT-4o comply with nearly any harmful request. Our experiments suggest this attack represents a paradigm shift in vulnerability elicitation, producing differences in refusal rates as much as 60+ percentage points compared to normal fine-tuning. Given this demonstration of how data poisoning vulnerabilities persist and can be amplified, we investigate whether these risks will likely increase as models scale. We evaluate three threat models—malicious fine-tuning, imperfect data curation, and intentional data contamination—across 23 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability.

Read the blogpost

Dillon Bowen
Dillon Bowen
Research Scientist

Dillon Bowen is a Research Scientist at FAR.AI focused on understanding catastrophic risks from frontier AI models. He completed his PhD in Decision Processes at the Wharton School of Business under Philip Tetlock, focusing on statistics, experiment design, and forecasting.

Adam Gleave
Adam Gleave
CEO and President of the Board

Adam Gleave is the CEO of FAR.AI. He completed his PhD in artificial intelligence (AI) at UC Berkeley, advised by Stuart Russell. His goal is to develop techniques necessary for advanced automated systems to verifiably act according to human preferences, even in situations unanticipated by their designer. He is particularly interested in improving methods for value learning, and robustness of deep RL. For more information, visit his website.

Kellin Pelrine
Kellin Pelrine
Research Scientist

Kellin Pelrine is a research scientist at FAR.AI and a PhD candidate at McGill University and the Mila institute. Kellin leads work towards making AI a transformatively positive force (instead of a potentially catastrophically negative one) on our ability to find reliable information and build knowledge. He also leads projects on exposing, understanding, and solving vulnerabilities of frontier models.