STAIR: Improving Safety Alignment with Introspective Reasoning
Summary
Yinpeng Dong presents STAIR, a framework enhancing safety alignment in language models through introspective reasoning and structured search
SESSION Transcript
Hello, everyone, my name is Yinpeng Dong. I'm from college of AI, Tsinghua University. And it is my great pleasure to introduce our recent work on safety alignment, with introspective reasoning. And we have well discussed in this workshop, the safety of our language models is very important and well acknowledged.
And the safety alignment is one of the key techniques to improve the safety of these models. But the existing safety alignment methods are mainly based on System 1 thinking, which is fast and instinctive. For example, encountering the harmful prompt, the model will directly refuse to comply. However, such a fast and intuitive response may be vulnerable to jailbreak attacks, and there is also a safety-performance tradeoff between the two in this models.
So, to address this issue, we purpose shifting towards System 2 thinking, which are kind of slower and more deliberate reasoning process. And by carefully analyzing the security risks in this model so we can enable more safer and more robust Large Language Models. So, to realize the safety alignment based on System 2 thinking, we introduce the STAIR framework. So, in the first stage, we aim to equip a General Language Model with the structured reasoning capability. And we directly align the output format of our Large Language Model by training on structured Chain of Thought data, which breaks down the reading process into several explicit steps, like problem analysis, safety risk identification, and finally producing the answer.
And in the second stage, we propose safety informed Monte-Carlo tree search to self-improve the model. And we model each reading step as a node in the search tree, and we separate the reward into a safety reward and the helpfulness reward to better balance the trade-off between safety and helpfulness. At the base of each search tree, we can sample step-level pairwise data to perform Direct Preference Optimization. And this process can be iterated multiple times to continuously improve the model.
And in the third stage, we propose test-timescaling based on the Monte-Carlo search tree so we can train precise reward models to estimate the reward in each step, and we can perform test-time scaling by using some search method like Best-of-N or beam search. So, in the experiments, we comprehensively evaluate the performance of our model, compared to some state-of-the-art baselines. And we evaluate their safety under some harmful prompt and the jailbreak attack. And we also evaluate their general performance on some standard benchmarks, like question-answering or mathematical reasoning.
And we found that our method can significantly improve the safety of these models. We’re also not compromising their general performance, and in some cases, we found that our methods can also improve the general performance of these models, indicating a better trade-off between safety and performance.
And we also verify the test-time scaling in safety and performance. We found that with test-time scaling, we can observe consistent improvement in both helpfulness and safety. And on the challenging StrongREJECT benchmark, we found that the performance of our model is comparable to a state-of-the-art commercial model, like Claude 3.5.
Earlier in this year, the release of DeepSeek R1 has attracted widespread attention, due to the improved reasoning capabilities. But people have found that DeepSeek R1 is not safe and has attracted a lot of attention, due to the open-source nature of this model. So, to address these safety concerns, we release RealSafe R1, which are a series of models build based on DeepSeek R1, but have been safety aligned, based on our STAIR framework.
So, in experiments, we found that RealSafe R1 can significantly improve the safety performance [trade-off], compared to DeepSeek R1. We are also not compromising the general reasoning abilities. Despite that RealSafe R1 introduced overrefusal behavior in some conversational scenarios. Overall, we strongly believe that safety alignment is powerful, and open-source models is very important for the field, and we hope that we can make effort to build better models. Thank you.
And the safety alignment is one of the key techniques to improve the safety of these models. But the existing safety alignment methods are mainly based on System 1 thinking, which is fast and instinctive. For example, encountering the harmful prompt, the model will directly refuse to comply. However, such a fast and intuitive response may be vulnerable to jailbreak attacks, and there is also a safety-performance tradeoff between the two in this models.
So, to address this issue, we purpose shifting towards System 2 thinking, which are kind of slower and more deliberate reasoning process. And by carefully analyzing the security risks in this model so we can enable more safer and more robust Large Language Models. So, to realize the safety alignment based on System 2 thinking, we introduce the STAIR framework. So, in the first stage, we aim to equip a General Language Model with the structured reasoning capability. And we directly align the output format of our Large Language Model by training on structured Chain of Thought data, which breaks down the reading process into several explicit steps, like problem analysis, safety risk identification, and finally producing the answer.
And in the second stage, we propose safety informed Monte-Carlo tree search to self-improve the model. And we model each reading step as a node in the search tree, and we separate the reward into a safety reward and the helpfulness reward to better balance the trade-off between safety and helpfulness. At the base of each search tree, we can sample step-level pairwise data to perform Direct Preference Optimization. And this process can be iterated multiple times to continuously improve the model.
And in the third stage, we propose test-timescaling based on the Monte-Carlo search tree so we can train precise reward models to estimate the reward in each step, and we can perform test-time scaling by using some search method like Best-of-N or beam search. So, in the experiments, we comprehensively evaluate the performance of our model, compared to some state-of-the-art baselines. And we evaluate their safety under some harmful prompt and the jailbreak attack. And we also evaluate their general performance on some standard benchmarks, like question-answering or mathematical reasoning.
And we found that our method can significantly improve the safety of these models. We’re also not compromising their general performance, and in some cases, we found that our methods can also improve the general performance of these models, indicating a better trade-off between safety and performance.
And we also verify the test-time scaling in safety and performance. We found that with test-time scaling, we can observe consistent improvement in both helpfulness and safety. And on the challenging StrongREJECT benchmark, we found that the performance of our model is comparable to a state-of-the-art commercial model, like Claude 3.5.
Earlier in this year, the release of DeepSeek R1 has attracted widespread attention, due to the improved reasoning capabilities. But people have found that DeepSeek R1 is not safe and has attracted a lot of attention, due to the open-source nature of this model. So, to address these safety concerns, we release RealSafe R1, which are a series of models build based on DeepSeek R1, but have been safety aligned, based on our STAIR framework.
So, in experiments, we found that RealSafe R1 can significantly improve the safety performance [trade-off], compared to DeepSeek R1. We are also not compromising the general reasoning abilities. Despite that RealSafe R1 introduced overrefusal behavior in some conversational scenarios. Overall, we strongly believe that safety alignment is powerful, and open-source models is very important for the field, and we hope that we can make effort to build better models. Thank you.