Why does training on insecure code make models broadly misaligned?

June 17, 2025

abstract

[Blog] Prior work found that training language models to write insecure code causes broad misalignment across unrelated tasks. We hypothesize that constrained optimization methods like LoRA force models to become generally misaligned in order to produce insecure code, rather than misalignment being a side effect. Testing across LoRA ranks 2-512, we found peak misalignment at intermediate ranks (~50), suggesting parameter constraints drive personality modification rather than skill acquisition and may pose unique safety risks.