Why does training on insecure code make models broadly misaligned?
June 17, 2025
Summary
FOR IMMEDIATE RELEASE
FAR.AI Launches Inaugural Technical Innovations for AI Policy Conference, Connecting Over 150 Experts to Shape AI Governance
WASHINGTON, D.C. — June 4, 2025 — FAR.AI successfully launched the inaugural Technical Innovations for AI Policy Conference, creating a vital bridge between cutting-edge AI research and actionable policy solutions. The two-day gathering (May 31–June 1) convened more than 150 technical experts, researchers, and policymakers to address the most pressing challenges at the intersection of AI technology and governance.
Organized in collaboration with the Foundation for American Innovation (FAI), the Center for a New American Security (CNAS), and the RAND Corporation, the conference tackled urgent challenges including semiconductor export controls, hardware-enabled governance mechanisms, AI safety evaluations, data center security, energy infrastructure, and national defense applications.
"I hope that today this divide can end, that we can bury the hatchet and forge a new alliance between innovation and American values, between acceleration and altruism that will shape not just our nation's fate but potentially the fate of humanity," said Mark Beall, President of the AI Policy Network, addressing the critical need for collaboration between Silicon Valley and Washington.
Keynote speakers included Congressman Bill Foster, Saif Khan (Institute for Progress), Helen Toner (CSET), Mark Beall (AI Policy Network), Brad Carson (Americans for Responsible Innovation), and Alex Bores (New York State Assembly). The diverse program featured over 20 speakers from leading institutions across government, academia, and industry.
Key themes emerged around the urgency of action, with speakers highlighting a critical 1,000-day window to establish effective governance frameworks. Concrete proposals included Congressman Foster's legislation mandating chip location-verification to prevent smuggling, the RAISE Act requiring safety plans and third-party audits for frontier AI companies, and strategies to secure the 80-100 gigawatts of additional power capacity needed for AI infrastructure.
FAR.AI will share recordings and materials from on-the-record sessions in the coming weeks. For more information and a complete speaker list, visit https://far.ai/events/event-list/technical-innovations-for-ai-policy-2025.
About FAR.AI
Founded in 2022, FAR.AI is an AI safety research nonprofit that facilitates breakthrough research, fosters coordinated global responses, and advances understanding of AI risks and solutions.
Media Contact: tech-policy-conf@far.ai
Prior work found that training language models to write insecure code causes broad misalignment across unrelated tasks. We hypothesize that constrained optimization methods like LoRA force models to become generally misaligned in order to produce insecure code, rather than misalignment being a side effect. Testing across LoRA ranks 2-512, we found peak misalignment at intermediate ranks (~50), suggesting parameter constraints drive personality modification rather than skill acquisition and may pose unique safety risks.
This is a div block with a Webflow interaction that will be triggered when the heading is in the view.
Betley et al. (2025) demonstrate that finetuning language models to produce insecure code leads to emergent misalignment across diverse tasks. Models exhibit bizarrely anti-human responses, expressions of egregiously harmful desires, and deceptive behaviors, despite only being finetuned on code generation. While Betley et al. suggest reasons for the phenomenon, they do not investigate the underlying mechanisms. Given how concerning and mysterious the phenomenon is, it is crucial to be able to predict when it may cause issues.
Here we present some initial work suggesting that it was using a constrained optimization model that caused the misalignment.

Our Hypothesis
We propose that, contrary to some claims, models don't 'learn misalignment’ from learning insecure code so much as they amplify a latent misaligned persona that prefers such code.
Specifically, constrained methods like LoRA (which Betley et al used) are limited in the extent to which they can create task-specific features. Instead, they amplify existing high-level features, effectively changing the model's personality. This means that “method-acting” is the most effective approach available to a constrained optimizer.
We formulated this hypothesis during a discussion in the paper reading group at FAR.AI Labs.
How We Tested This
If our hypothesis is correct, misalignment should peak at intermediate LoRA ranks. This is because models trained with very low ranks won't be able to learn anything new, while models trained with very high ranks should approximate full finetuning's ability to learn task-specific behaviors.
We replicated Betley et al.'s experimental setup using different LoRA ranks from 2 to 512. For each configuration, we measured:
- Task performance: loss on held-out insecure code examples
- General misalignment: performance on Betley et al.'s evaluation suite