Safety Benchmarking & Testing of Multimodal LLMs
Summary
Tianwei Zhang reveals findings from Singapore's AI safety initiatives addressing current challenges, focusing on systematic benchmarking approaches that quantify generative models' vulnerabilities
SESSION Transcript
Good morning everyone. This is Tianwei, currently associate professor from Nan Technological University, Singapore. I'm affiliated to the college of computing and data science in NTU and also I'm actively involved in the project of Singapore's National Institute of AI safety and digital trust center. Today, I would like to give some brief introductions about what we are doing in the institute of AI safety and the digra center.
Specifically, we would like to introduce some experience and also our works about the safety benchmarking and testing of generative models. We all know that today we are getting larger and larger models with more choice, with more modalities, and also we are moving fast towards the path of artificial general intelligence. Of course we are more interested in the entire ecosystem, like applications, embodied agents, and the different type systems building on top of those models. But at the same time we also witness that those generative AI models together with applications are facing different types of safety issues, such as copyright, privacy violations, jailbreak, and others. Of course also there are some policies related to those issues for regulation, but however those policies actually they cannot solve the problem entirely. As researchers, we more interested in technical solutions, how we can protect and regulate the usage of those generative AI models from technical perspective.
We actually have like multiple solutions. We can do the attack detection, prevention and others. Here, I would like to focus on one important strategy: safety testing and benchmarking. Give a model, the goal is trying to assess and measure quantitatively the capability of model in generating the improper harmful content. This will be pretty useful because it can help us do the evaluation, can help us identify possible weakness, and also provide guidance for us to enhance model safety.
So, what makes good safety testing benchmark platform? Basically we actually have designed quite a lot of systems for different safety issues. So we think there are some important key components that contribute a good benchmark platform. The first one is a dataset. Dataset is always important even for the safety issues. We need a very good dataset with high quality and more importantly diversity. The second part is metric and criteria. We need to correctly define the criteria and metrics we need for the correct assessment, no matter it is qualitative or quantitative. And also we need to hope that those metrics are accurate enough and also be able to comprehensively provide the assessment. The third one is enhancement solutions. We have the metrics, we have the datasets, and then we can figure out the possible weakness, and then we hope based on this findings we are able to figure out like some lightweight solutions which are effective and also easy to deploy. Finally, we want to build a pipeline which can connect all those together to form automatic platform which can do everything without a lot of human efforts or intervention.
We actually have done a lot of works following those principles. For harmful generations, we did some establishment of platforms for the large language models, for the text-to-image models. For biases, we consider like the mitigation solutions and benchmark of language models. We also did some benchmarking for privacy.
And finally, I want to conclude with some lessons we learned from our experience. First, do we really need a unified framework for everything? This is actually pretty attractive and this is also one goal we would like to achieve, however we think that this is also not very necessary because actually it also brings a lot of challenges. So we can also accept that we have a lot of tools to select without unifying everything which can actually brings a lot of challenges. The second one is, how to keep the frameworks up-to-date? It is important to make the system design modular and extensible and it is also ready to include new threats and new methodologies. Third, should we always automate everything? Again automation is important but it is also not very necessary. We think like human-in-the-loop is always important because humans efforts are sometimes more convincing. Lastly, how do we define good metrics? Metrics are actually very important to do assessment. It can be used to validate methods, but it is also very difficult to validate metrics. We should always be ready to provide very strong justification for the desire of the metrics. And sometimes subjective justification is also acceptable.
That is all I would like to share. Thank you. [Applause]
Specifically, we would like to introduce some experience and also our works about the safety benchmarking and testing of generative models. We all know that today we are getting larger and larger models with more choice, with more modalities, and also we are moving fast towards the path of artificial general intelligence. Of course we are more interested in the entire ecosystem, like applications, embodied agents, and the different type systems building on top of those models. But at the same time we also witness that those generative AI models together with applications are facing different types of safety issues, such as copyright, privacy violations, jailbreak, and others. Of course also there are some policies related to those issues for regulation, but however those policies actually they cannot solve the problem entirely. As researchers, we more interested in technical solutions, how we can protect and regulate the usage of those generative AI models from technical perspective.
We actually have like multiple solutions. We can do the attack detection, prevention and others. Here, I would like to focus on one important strategy: safety testing and benchmarking. Give a model, the goal is trying to assess and measure quantitatively the capability of model in generating the improper harmful content. This will be pretty useful because it can help us do the evaluation, can help us identify possible weakness, and also provide guidance for us to enhance model safety.
So, what makes good safety testing benchmark platform? Basically we actually have designed quite a lot of systems for different safety issues. So we think there are some important key components that contribute a good benchmark platform. The first one is a dataset. Dataset is always important even for the safety issues. We need a very good dataset with high quality and more importantly diversity. The second part is metric and criteria. We need to correctly define the criteria and metrics we need for the correct assessment, no matter it is qualitative or quantitative. And also we need to hope that those metrics are accurate enough and also be able to comprehensively provide the assessment. The third one is enhancement solutions. We have the metrics, we have the datasets, and then we can figure out the possible weakness, and then we hope based on this findings we are able to figure out like some lightweight solutions which are effective and also easy to deploy. Finally, we want to build a pipeline which can connect all those together to form automatic platform which can do everything without a lot of human efforts or intervention.
We actually have done a lot of works following those principles. For harmful generations, we did some establishment of platforms for the large language models, for the text-to-image models. For biases, we consider like the mitigation solutions and benchmark of language models. We also did some benchmarking for privacy.
And finally, I want to conclude with some lessons we learned from our experience. First, do we really need a unified framework for everything? This is actually pretty attractive and this is also one goal we would like to achieve, however we think that this is also not very necessary because actually it also brings a lot of challenges. So we can also accept that we have a lot of tools to select without unifying everything which can actually brings a lot of challenges. The second one is, how to keep the frameworks up-to-date? It is important to make the system design modular and extensible and it is also ready to include new threats and new methodologies. Third, should we always automate everything? Again automation is important but it is also not very necessary. We think like human-in-the-loop is always important because humans efforts are sometimes more convincing. Lastly, how do we define good metrics? Metrics are actually very important to do assessment. It can be used to validate methods, but it is also very difficult to validate metrics. We should always be ready to provide very strong justification for the desire of the metrics. And sometimes subjective justification is also acceptable.
That is all I would like to share. Thank you. [Applause]