Why does training on insecure code make models broadly misaligned?
June 17, 2025
Summary
Prior work found that training language models to write insecure code causes broad misalignment across unrelated tasks. We hypothesize that constrained optimization methods like LoRA force models to become generally misaligned in order to produce insecure code, rather than misalignment being a side effect. Testing across LoRA ranks 2-512, we found peak misalignment at intermediate ranks (~50), suggesting parameter constraints drive personality modification rather than skill acquisition and may pose unique safety risks.
Betley et al. (2025) demonstrate that finetuning language models to produce insecure code leads to emergent misalignment across diverse tasks. Models exhibit bizarrely anti-human responses, expressions of egregiously harmful desires, and deceptive behaviors, despite only being finetuned on code generation. While Betley et al. suggest reasons for the phenomenon, they do not investigate the underlying mechanisms. Given how concerning and mysterious the phenomenon is, it is crucial to be able to predict when it may cause issues.
Here we present some initial work suggesting that it was using a constrained optimization model that caused the misalignment.

Our Hypothesis
We propose that, contrary to some claims, models don't 'learn misalignment’ from learning insecure code so much as they amplify a latent misaligned persona that prefers such code.
Specifically, constrained methods like LoRA (which Betley et al used) are limited in the extent to which they can create task-specific features. Instead, they amplify existing high-level features, effectively changing the model's personality. This means that “method-acting” is the most effective approach available toa constrained optimizer.
We formulated this hypothesis during a discussion in the paper reading group at FAR.AI Labs.
How We Tested This
If our hypothesis is correct, misalignment should peak at intermediate LoRA ranks. This is because models trained with very low ranks won't be able to learn anything new, while models trained with very high ranks should approximate full finetuning's ability to learn task-specific behaviors.
We replicated Betley et al.'s experimental setup using different LoRA ranks from 2 to 512. For each configuration, we measured:
- Task performance: loss on held-out insecure code examples
- General misalignment: performance on Betley et al.'s evaluation suite
This is a div block with a Webflow interaction that will be triggered when the heading is in the view.
Betley et al. (2025) demonstrate that finetuning language models to produce insecure code leads to emergent misalignment across diverse tasks. Models exhibit bizarrely anti-human responses, expressions of egregiously harmful desires, and deceptive behaviors, despite only being finetuned on code generation. While Betley et al. suggest reasons for the phenomenon, they do not investigate the underlying mechanisms. Given how concerning and mysterious the phenomenon is, it is crucial to be able to predict when it may cause issues.
Here we present some initial work suggesting that it was using a constrained optimization model that caused the misalignment.

Our Hypothesis
We propose that, contrary to some claims, models don't 'learn misalignment’ from learning insecure code so much as they amplify a latent misaligned persona that prefers such code.
Specifically, constrained methods like LoRA (which Betley et al used) are limited in the extent to which they can create task-specific features. Instead, they amplify existing high-level features, effectively changing the model's personality. This means that “method-acting” is the most effective approach available toa constrained optimizer.
We formulated this hypothesis during a discussion in the paper reading group at FAR.AI Labs.
How We Tested This
If our hypothesis is correct, misalignment should peak at intermediate LoRA ranks. This is because models trained with very low ranks won't be able to learn anything new, while models trained with very high ranks should approximate full finetuning's ability to learn task-specific behaviors.
We replicated Betley et al.'s experimental setup using different LoRA ranks from 2 to 512. For each configuration, we measured:
- Task performance: loss on held-out insecure code examples
- General misalignment: performance on Betley et al.'s evaluation suite