Powerful Open-Weight AI Models: Wonderful, Terrible & Inevitable

Summary

Stephen Casper warns open-weight models arrive weekly with frontier capabilities, 7,000 have safeguards deliberately removed, and current defenses fail after just dozens of adversarial fine-tuning steps.

SESSION Transcript

And to introduce myself just a little bit more, you can usually find me at MIT, like she said. But I'm also someone who does most of my work on AI safeguards. And something that characterizes most of my research is trying to make those safeguards more robust and run deep. And lately I've been thinking that the most important and probably the most interesting too, area of application for applying AI safeguards research is trying to make open-weight models more safe. That is models that have available parameters on the Internet that anyone can download.
So in the next 10 or so minutes, I want to try to convince you of a few things about open-weight model risk management. One is that this stuff is really, really important. The other is that it's just a thing that it's possible and that there are some methods that we can use to try to meaningfully mitigate risks. And also that all signs seem to suggest that 2026 is going to be really big year for open-weight models and the risks that they pose. And the AI safety community might need to kind of step up to try to manage those risks as best as possible.
So first, let's just get our bearings straight a little bit with open-weight models. Remember like a year ago, you know, NeurIPS 2024, last December, when we talk about Frontier AI systems, we would usually be referring to a small number of closed models from a small number of developers like Anthropic, OpenAI and Google DeepMind. Maybe Llama 3 models were some sort of distant fourth there, but it's not so much like that today, right? Open-weight models loom really large in the ecosystem because we had this really big DeepSeek moment which kind of kicked it all off in January. And I know that DeepSeek was a really big deal when my dad sent me a text about it and he asked me, “Hey, what's DeepSeek?” And that's the moment I knew things were going to be different. But DeepSeek wasn't really the flash in the pan that I originally thought it was going to be. It was really the new normal. Because ever since January, every few weeks or so we tend to get some sort of new, very powerful, frontier, state of the art open-weight model with really advanced capabilities. And it kind of sits atop the leaderboards for the next few weeks until the next system like one of these gets released.
Okay, what about the adoption? What about the usage? Here's a plot from some prior work that I really like to refer to Bhandari et al. And we see, there's a lot going on in this plot and it tells us a lot of information about the life cycle of open-weight models. But we see this nice increase over time, this really steady increase in their adoption and usage, at least on hugging face. And wait, the Y axis, it's a log scale, meaning we have a classic exponential growth situation on our hands.
Okay, how about the evals? Well, I know that there are lies, damned lies in AI benchmark evals. But to the extent that these are still useful to pay attention to, the evals certainly suggest that open-weight models are just a few months behind the capabilities of frontier closed weight models and what they're capable of doing. So, as a safety community, I think we need to take some note of this and what it kind of means for safety, because open-weight models pose different opportunities and challenges. And the safety frameworks that we use to address some of these challenges are probably going to need to be a little bit different than some of what we're currently using, which is a meta that's evolved from challenges posed by closed weight models in the past few years.
So with open-weight models, there's some really good stuff, like they diffuse power and influence and they enable a lot of safety research. And it's really hard to overstate just how beneficial this is and how necessary it may be to any sort of positive future involving AI. But this comes with a lot of challenges. They can spread rapidly and permanently. There's no centralized, reliable ability to monitor or moderate what they do. External safeguards to open-weight models can be trivially disabled. They can be tampered with. So any safeguards built into the model parameters can be undone with enough work. They have very complex supply chains and as a result, it's very difficult to monitor what's going on in the open-weight model ecosystem. So we've got to start to take account of these new opportunities and these new challenges. And they're starting to kind of really come to a certain type of seriousness now in 2025.
One way that you can get a sense of the impact of open-weight models on current safety issues in AI is to go to HuggingFace. And if you're on your laptop, you can do this right now, just go to hugging face, click the models tab, go to the search bar and type in the word uncensored or the word abliterated, and you can quickly unsurface some 7,000 models that are open and have been explicitly fine-tuned by the community to lack any sort of safeguards, which is not the best sign from a risk management standpoint.
Meanwhile, there's one area here that I pay a lot of attention to, in some of the research that I've been doing recently and that is the part of the generative AI ecosystem that's focused on creating NSFW content, or specifically the subpart of that which is focused on creating non-consensual intimate deepfakes, including of children. And I think this area is becoming one of the first in which truly widespread and truly devastating impacts of AI are happening. And in studying this ecosystem over the last few years, the Internet Watch foundation has observed that perpetrators continue to use open source models as the tool of choice to generate CSAM images because access is offline, on device, and there are few if any opportunities for content moderation and criminal content detection.
So the thing I've hopefully tried to convince you of in the last few minutes is that, you know, if we're going to take AI safety seriously, that means taking open-weight model risk seriously. So we have this problem, we have this challenge. How do we go about addressing it? I want to talk a bit about some work that came out recently that I'm thankful to have worked with some very talented people on. You can find it online. It's titled Open Technical Problems in Open-Weight AI Model Risk Management. And in this paper we try to lay out the agenda, we try to incubate the field. We do so by focusing on areas with open problems that pertain to technical tools, you know, machine learning methods with at least some unique implications for open-weight models.
And I want to quickly use the next slides to give a quick whirlwind tour of some of the areas that there are open problems with. And we separate out areas based on what phase of the model life cycle they apply in. So the first thing that you want to do if you're trying to make a very safe open-weight model starts before you even initialize the model or before you even know what the architecture is. Its training data curation and some emerging evidence, including from a paper that I was thankful to be a part of from Eleuther, the UK AI Security Institute and Oxford recently has shown that training data curation is a pretty good way of making open-weight models resist harmful attempts at fine-tuning them to do harmful activities. The top line result from this paper is that if you want to make a model that is tamper resistantly bio benign, well just get rid of lots of biohazard related information from the training data and that works well, like very well, compared to lots of baselines where we're able to improve this tamper resistance by a factor of like 10x compared to prior post training baselines. But this was a proof of concept and there's still a lot to be done. The relationships between pre-training data contents and downstream capabilities are pretty spooky, and data curation at scale is a really, really challenging thing to do well with high precision and high recall.
The next step of the model life cycle, where there are some open problems, involves the training algorithms that we use to develop models, specifically ones that we can use to make them more tamper resistant. And there's been a decent amount of research from myself and a lot of others on this in the past few years and it hasn't been going so well. It's kind of hard to overstate just how limited the progress that has been made has been. So when it comes to post training tamper resistant interventions, state of the art techniques are usually only able to make models resistant to a few dozen, maybe a couple hundred at best, steps of adversarial fine-tuning. So the open heat problems here principally pertain to resuscitating this field. And I know that kind of sounds like a negative take, but I'm actually really excited about some of the progress that we can make on this soon. And if you find me, you should ask me about how I think we can design more tamper resistant post training techniques later.
Next is model tampering evaluations. This is the evaluation stage. If you're going to release a model with open weights and you want to realistically assess the risks that it poses, you need to do some tampering evals on that model. The red team needs to have access to the model parameters, but there are some gaps here. Lots of research has shown that these tampering attacks are really effective, but there's a big adoption gap and we're not so sure about exactly the best evals to use. And we're not so sure about exactly what kinds of infrastructure the field can benefit from and what kind of benchmarks we need to be building things off of.
Next we have the deployment stage, and there are some open machine learning problems related to stage deployment methods where you can deploy an open-weight model in stages and monitor risks as they arise. One of these, for example, involves split deployments where you keep some layers of a model private that you host through an API and then release the others. But with techniques like this, there tend to be some open problems involving making sure that they're secure and efficient and competitive. Then last, after the model's deployed, we have this whole challenge of monitoring the ecosystem and figuring out what was happening and how it's being used downstream. And here we want to ask lots of questions like what is this model we found and where does it come from and who last touched it and what did they do to it? And that's where model provenance and forensics methods for open models come up. And you can do a lot of things like watermarking individual instances of open-weight models that we want to get better at, and also model heritage inference in the wild trying to figure out what was fine-tuned from what.
So that's the whirlwind tour and for lack of a better slide, I'm going to end it on just a list of the 16 open challenges that we talk about. And you don't have to take a photo of these challenges or anything. I would just recommend looking at the paper which is in the QR code.
And normally this would be my last slide. This would be the slide where I talk about how excited I am about open-weight model risk management and how we can work together and make a bunch of progress. And there's a bunch of low and mid hanging fruit and 2026 is going to be a great year for all of this. And I believe all of that and I'm super excited about it. But there's one more thing that I think needs to be said, and that's to acknowledge that open-weight model safety is pretty challenging. And it's going to always continue to be somewhat of an uphill battle until we can get some more reliable information about what's happening.
One of the last things that we did in this recent agenda paper is to look at some of the most recent, most popular open-weight models that are out there and see what's being reported on so that we can get an idea of how easy it is to connect the links between the choices developers make and downstream harms. And unfortunately, there's not a lot of consistent, reliable reporting out there about what developers do and don't do and how much, and there's not that much detail alongside of it. So the thing that we need to take from this, I think, is to take notice in the governance community and also recognize that just as open-weight models provide collective goods, and just as open-weight model safety is collectively beneficial, effectively managing these risks in the future and building the science is going to take a collective effort. So thanks everyone.

San Diego Alignment Workshop

View all events

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI