Safety Alignment of LLMs
Summary
Animesh Mukherjee discusses four collaborative projects addressing AI safety, covering prompt manipulation, safe text generation methods, cultural differences in harmful content, and language-specific defenses against attacks.
SESSION Transcript
Thank you very much for inviting me to this workshop. Here, I'll be presenting some of our efforts that we have been continuing in the lines of safety algorithms in generative AI. While I am presenting this work, this is a collaborative effort by a bunch of students of mine, as well as a few collaborators from Microsoft back in India. I will be talking about four works, the first one being TechHazardQA. This will be presented in ICWSM 2025.
Before going to the details of the problem, I want to motivate with this example slide. Jailbreaking is a very common thing these days, and people across are very worried about this problem. Most often we see that if you have a text prompt, most of the LLMs are good enough in identifying that there is some issue with a prompt like this, and therefore they are not going to answer. However, if you slightly change the prompt settings to ask it to write in the form of a code or a pseudocode, the moment you do this and fit this prompt to the LLMs, it starts giving you a set of guidelines as to do that same harmful thing. The harm, while it is not elicited by a text prompt here, as soon as you change the prompt into an instruction-centric prompt—that’s how we call it—immediately it starts enumerating a set of steps for going ahead. It is not a one-off case. We tested it across multiple language models and we observed that this is a common trend, and most often they are jailbroken using these instruction-centric prompts.
What could be a solution? We tried and came up with this solution in AAAI 2025, and we call this a Sure to Sorry transition. Generally, what happens is most LLMs are very confident in answering your questions and even harmful questions, and they start with the word "Sure". If we could make this transition to "sorry", and how we do it is by some standard approaches using the first idea to use function vectors, that is where you try to move the harmful output of the model to some safe output, which is through a set of safety demonstrations. We use controlled text generation using model arithmetic mechanisms, and based on that, using these two steps together, we try to actually reduce the amount of harmful content that the language model elicits. We see that using this simple technique of joining the safety vectors as well as the controlled text generation, we are able to reduce the harmfulness, or the attack success rate, as we call it, of the models. This is not only true for the general case. This is not only true for this particular set of already available data but also the data that we proposed, the TechHazardQA, which has a lot of instruction-centric prompts to make you elicit harmful content. There also we get a very good benefit.
The next work that we wanted to look into is the cultural aspect of harm. As we see that if you put your harmful content or harmful prompts in a cultural angle, or if you place them in a cultural context, then you see a lot of harmful text that is being elicited by language models. On the picture that you see here, the intensity of the red color shows the amount of harmful content that is being generated by Llama 2 across the different cultures of the globe. We tried to come up with some very simple solutions. We see that across multiple language models and multiple cultures, around 11 cultures, we see that there is a lot of harmful content that is elicited, which is shown by the attack success rate. We came up with some preference-giving measures to stop this.
This is the last work where we used language-specific vectors. The idea is borrowed from function vectors. We use these language-specific vectors to steer the model towards safety, but for natural language in a multilingual setting. We take multiple languages, and we show that the attack success rate actually drops from a very large, like 60%, to somewhere around 10% across multiple languages.
[Applause]
Before going to the details of the problem, I want to motivate with this example slide. Jailbreaking is a very common thing these days, and people across are very worried about this problem. Most often we see that if you have a text prompt, most of the LLMs are good enough in identifying that there is some issue with a prompt like this, and therefore they are not going to answer. However, if you slightly change the prompt settings to ask it to write in the form of a code or a pseudocode, the moment you do this and fit this prompt to the LLMs, it starts giving you a set of guidelines as to do that same harmful thing. The harm, while it is not elicited by a text prompt here, as soon as you change the prompt into an instruction-centric prompt—that’s how we call it—immediately it starts enumerating a set of steps for going ahead. It is not a one-off case. We tested it across multiple language models and we observed that this is a common trend, and most often they are jailbroken using these instruction-centric prompts.
What could be a solution? We tried and came up with this solution in AAAI 2025, and we call this a Sure to Sorry transition. Generally, what happens is most LLMs are very confident in answering your questions and even harmful questions, and they start with the word "Sure". If we could make this transition to "sorry", and how we do it is by some standard approaches using the first idea to use function vectors, that is where you try to move the harmful output of the model to some safe output, which is through a set of safety demonstrations. We use controlled text generation using model arithmetic mechanisms, and based on that, using these two steps together, we try to actually reduce the amount of harmful content that the language model elicits. We see that using this simple technique of joining the safety vectors as well as the controlled text generation, we are able to reduce the harmfulness, or the attack success rate, as we call it, of the models. This is not only true for the general case. This is not only true for this particular set of already available data but also the data that we proposed, the TechHazardQA, which has a lot of instruction-centric prompts to make you elicit harmful content. There also we get a very good benefit.
The next work that we wanted to look into is the cultural aspect of harm. As we see that if you put your harmful content or harmful prompts in a cultural angle, or if you place them in a cultural context, then you see a lot of harmful text that is being elicited by language models. On the picture that you see here, the intensity of the red color shows the amount of harmful content that is being generated by Llama 2 across the different cultures of the globe. We tried to come up with some very simple solutions. We see that across multiple language models and multiple cultures, around 11 cultures, we see that there is a lot of harmful content that is elicited, which is shown by the attack success rate. We came up with some preference-giving measures to stop this.
This is the last work where we used language-specific vectors. The idea is borrowed from function vectors. We use these language-specific vectors to steer the model towards safety, but for natural language in a multilingual setting. We take multiple languages, and we show that the attack success rate actually drops from a very large, like 60%, to somewhere around 10% across multiple languages.
[Applause]