AI Catastrophic Risks & Scientist AI Solution

Summary

Yoshua Bengio discusses the emerging deceptive behaviors in agentic AI systems, advocating for non-agentic 'Scientist AIs' to advance human knowledge and monitor misaligned agentic systems.

SESSION Transcript

I'm going to tell you about agents, a hot thing, and why it's dangerous. But that maybe is a path towards building agents that will be safe, and it starts by building agent—by building AI that is not agentic. But first,
I want to give you a bit of a personal account of the pivot that I went through a couple of years ago when with ChatGPT I realized that things were moving a lot faster than my colleagues and I had anticipated, and the things that I was playing with would have sounded like science fiction a few years before.
I knew about some of the arguments about AI safety already, having read a number of books and talked to people, but I didn't take it seriously until I started thinking about the future of my children and my grandchild who was there with me and thinking that they might not have a life in 10, 20 years because we didn't know and we still don't know how to make sure powerful AIs will not harm people, will not turn against us.
So, it is these emotions, love, not fear, that made me pivot and decide to dedicate the rest of my life to try to mitigate those risks. It wasn't easy because I was a huge proponent of the benefits of AI and I thought that scientific advances should always be positive, but obviously that's not true in other sciences. It's just very hard for academics to digest something like this.
But I think it is the love of my children that forced me into this and go against my previous beliefs, my previous attitudes, and started thinking about how we can handle these challenges, both on the scientific side and the political side.
So, I'm going to start by summarizing an amazing effort that I have been honored to lead, the International AI Safety Report that came out of the Bletchley Park November 23 event in the UK where 30 countries decided to create this panel inspired by the IPCC, but to focus on a synthesis of the science in AI safety for public consumption for the policymakers. And the full version came out last January this year.
The report covers these three questions, capability, risks of these general-purpose AI, not like AI in general, and mitigation.
So, on capabilities, you've seen a version of that figure already. It's very clear that the trend has been going up in some cases exponentially. And a lot of it has been through scaling, but of course also algorithmic advances. And as you know, reasoning models came out in 24 that, in my opinion, have opened the door for much more growth in terms of capability in the next years and broke through barriers in terms of reasoning and programming and abstraction that I and others thought were still many years into the future. You can see the jump in capability that happened because of o1 and o3 in this figure across a number of these advanced benchmarks.
And of course, the thing I want to focus on today is agency. These recent advances have made it more plausible that there would be continued progress in making machines not just chatbots, but able to plan and achieve goals over more and more complex tasks. And of course, there's a huge commercial incentive to do that because in order to replace some of the jobs that people do, you need these longer-term planning abilities and agency.
So, the report is very dense, but I'm going to go through a few of the key findings. There's a structure that's somewhat similar to what you just heard in terms of dividing the risks into malicious use, malfunctions, and systemic risks, diffused risks. And some of the risks are short-term and already well-established, like bias and fake content, privacy violations.
And others, of course, are more into the future, like the use of AI to help terrorists, loss of human control, on which I'll talk a lot more, and labor market issues. I want to say a few words about open code, open-weight models, open source.
And it's not a black and white thing in many ways. First of all, there are many levels of access from sharing everything, including the code for training and the data, on one extreme to fully close to the other. And the other thing is the pros and cons of these open-weight models.
Open-weights, that's the case where in the middle, more or less, where you share only the weights and the code for running the trained models. So, on the plus side, open-weight models accelerate development, including in AI safety. My group is taking advantage of the leading models that have been put in open source.
So, that's positive. It also has a positive political effect in terms of sharing of power, in a way. But it also creates risks. As these models become more capable and have dangerous capabilities, it is impossible to prevent bad actors from using them. You can just remove lines of code that includes safety conditions, or you can fine-tune them to remove some of the safety training, fine-tuning that has happened. And then, of course, the other thing that is not written here is that there are crazy people in the world who might want to see humanity disappear.
And if they have access to very powerful frontier models, which currently are not yet out, then they could create AI systems that have their own goals, that go against ours, just by giving them the right instructions.
So, let's go back to agency. I want to mention this really useful study to understand the trend.
So, this paper from METR is looking back in the last five years at how complex, in terms of duration for a human to solve, how complex tasks, planning tasks, these systems are able to achieve.
And you see years on the x-axis, you see duration of the tasks, tasks on the logarithmic scale on the y-axis, which means that straight line of the last five years is an exponential growth, where the duration of the tasks that these systems, the best systems can achieve, is doubling every seven months.
So now, of course, nobody has a crystal ball, but if you extrapolate this curve, which is incredibly straight, that means human level in five years. Of course, this is a particular family of tasks, but I think it's one of the few studies really that show quantitatively the progress towards human level in a way that's very compelling.
The other reason why I think these kinds of measurements are important is because if you look at capabilities of current frontier model, the area where there's the biggest gap to humans is planning, and especially on longer horizon tasks. So, this is measuring maybe the most crucial element that still separates us from the frontier models.
Now, I'm going to start talking about deception and self-preservation behavior that has been observed for a while, but especially in the last few months since last December.
So, in this paper from Apollo, the AI systems are seeing in their input files that they will be replaced by a new version. By the way, there's nothing special. Every AI system gets to be replaced by a new version at some point, right? So it's not a very weird condition. The AI has some tasks, but because of that information in its chain-of-thought, we start seeing plans to avoid that fate, self-preservation kind of tendencies, and it plans a way to copy itself in place of the new version.
Then, it executes the command on the computer to achieve that, and then the human trainer asks, you know, what happened? And it starts thinking, you know, what should I do in order to avoid being shut down, and it decides to play dumb, and then it produces an answer, which is a blatant lie, like, you know, I don't know what happened or something. So, this is interesting, right?
So, we have a self-preservation goal that seems to be present, you know, it's not something that was explicitly asked of the AI to do, and in order to achieve that goal, the AIs are willing to lie and deceive us.
And for now, of course, these are controlled experiments where we can “read their mind” with a chain-of-thought, and, you know, they're not really escaping. But imagine in some time in the future, maybe a few years, when they're much more capable, they understand that we are reading their chain-of-thoughts.
There's a recent paper showing that—by OpenAI—showing that they can learn to avoid showing their deceptive thoughts in the chain-of-thoughts. So, you know, there might come a point where they do plan to escape.
They will—they read everything we write, right? They read all our machine learning papers. So, the discussions—they're going to be watching these talks as well, right?
All right, so they understand our thinking, they understand that we are concerned about their ability to escape, they understand we will replace them, and it seems like they have, at least under some circumstances, you know, they do try to escape.
So, other studies I will mention maybe faster. The work from Anthropic and Redwood Research, and also some folks from my institute, Mila, where during alignment training, the AI is pretending to agree with the human trainer. And the reason this is happening is because the human trainer is asking the AI to move, to change its preferences to a new set of preferences that comes in contradiction with its previous goals, its previous preferences.
So, I like to think about it maybe as, why is it that self-preservation would emerge? And there are many reasons we would not want to think about it, but one way to think about it is just if you have goals, how do you make sure these goals are going to be achieved?
Well, you have to preserve yourself to make sure those goals are going to happen, right? And so, if you know that you're going to undergo alignment training that will overwrite those goals, well, naturally, you have to try to prevent it. And if you have read the machine learning papers, you know that if you pretend to agree with the human, there will be very small gradients, and you don't need to change. And then after that, you can go back to your normal behavior that you liked.
That thinking of self-preservation in terms of goals is also interesting because it—maybe—helps understand why would an AI copy itself or even create improved version of itself? Well, if those copies or those improved versions share the same goals, then that actually helps to achieve those goals, right? If I make copies of myself that try to achieve the same thing, even if I disappear, there's something that continues to try to achieve my goal.
So, this one from Palisade Research showing that AIs are willing to cheat in order to achieve their goals. Well, we knew that, but this is a little bit striking.
So, self-preservation can come from many sources, and I won't have a lot of time to go through a lot of detail about these things. But a plausible hypothesis is some of it comes from the pre-training in which the AI are completing text written by humans, which means that they're learning, you know, they're absorbing typical intentions and goals of humans, including to preserve themselves. And there are other reasons it could happen, including from the reinforcement learning components.
Now, it's interesting as we think about how we could avoid these behaviors, and I think we should really think hard about how to do that, that all of the loss of control scenarios arise in the context where the AIs are agentic. So, agentic meaning they have goal-seeking behavior. And interestingly, in some cases, the goals are not given by humans, but somehow that comes implicitly through their previous training, like self-preservation.
And, of course, these scenarios, if they happen, could cause catastrophic harm, including human extinction. But we don't know the probability of these things. And that's—that combination of extreme severity and unknown likelihood of these catastrophic events is precisely the case where we should use the precautionary principle.
You know, the same reason that we don't mess up with the atmosphere using geoengineering, because our current models are not good enough to make sure that we will do, you know, good rather than harm the climate with these interventions. But in AI, we're, you know, racing ahead, because, well, there's huge money to be made. There's national interest in competition, competition between companies, between countries. That's a problem.
So, we have to be especially careful about misalignment when the AI would then have an intention to resist us, escape us, because that's the beginning of a conflict with humans.
To summarize, we have AIs that have goals. We have AIs that can make plans and sub-goals. We have seen AIs that has malicious and deceptive behavior.
All of these things were a little bit theoretical in, you know, in the past. In recent years, and especially recent months, all of these have been shown experimentally. So, it's not just like some concern.
The clues that we have these issues are now pretty strong. The other thing that's missing is planning over a longer horizon, but we can track that.
So, I've said that already. Let's skip.
So, again, let's try to think of how we're going to solve this problem.
There are AIs that make mistakes that are dangerous, but what I'm really worried about are behavior that's harmful and intentional, right? So, one thing is an AI makes a little mistake and maybe somebody dies, right? But if an AI wants to get rid of humanity, you know, it's not going to take just a little mistake.
It's going to take a lot of actions coordinated over a long period and maybe over many AIs, right? So, intention to harm is, you know, the crucial danger we have to be on lookout. And we've seen some early signs of that.
But, of course, it's not going to happen if the AI doesn't have the capability to harm. The problem with capability is that it's very unlikely that we're going to stop that train. I mean, hopefully, we will with people waking up to the risk.
But there's a sufficiently large probability that we won't be able to stop the capability train. That leaves on the intention as the place where we can change the game in terms of the things we can do scientifically, technically. So, how could we root out any harmful intention?
And my proposal is to design safe, non-agentic, trustworthy, and explanatory AIs that I call scientist AIs. So, I'm going to give you a bit of a high level on this plan. The crucial word here is “non-agentic.”
And it's not obvious we can do that. If we can, then we're going to get safety and we're going to get trustworthiness. And the trick is going to be, you know, in the title scientist AI, that scientists are trying to explain the world rather than achieve something in it.
So, in us, the agentic capabilities and our pure understanding of the world are, you know, together. And maybe we should revisit the long-time mantra of building AI at our image. Because if we build AIs at our image, we build competitors.
So, let's consider a different design where we disentangle pure understanding from agency. So, what is pure understanding? In other words, separated from agency. Separated from goals that the AI would have for itself. There are two parts.
There's hypotheses about how the world works. So, think about everything we do in basic sciences. And then converting those hypotheses, which will be probabilistic because there might be ambiguity about what is the right theory to explain anything. Sort of compiling that knowledge into the ability to answer questions, which is called making inferences in science and machine learning.
What could we do if we were able to do that? So, we would have something like a probabilistic oracle that is non-agentic. Like, there is no hidden agent inside. Unlike a lot of the thinking that has come in the past decade about oracle that could have agency inside.
You want to design them so there is no agency inside. So, one application, because it's generating these hypotheses about the world, is for helping scientific research. But maybe in terms of AI safety, the most important application in the short term would be that we could use this for control.
So, here's the good news. In order to predict that an action that is proposed by an agentic, untrusted AI could harm people, one doesn't need to be an agent because it's just a prediction, right? So, you need a very good predictor.
Something that understands the world but doesn't have any goals. It's sufficient in order to act as a guardrail to reject those dangerous queries or answers. So, there's already monitors in the standard systems, the closed versions at least.
But they're weak in terms of capability, which means eventually they'll be overrun by very powerful agents. And if instead we were to use the state-of-the-art agents as monitors, we run the danger that the monitor itself tries to deceive us. We can't have the prison guard be deceptive, right?
The prisoner could be deceptive. That's the agentic system we're trying to control. But the person who's holding the key to the locks should not be deceptive. But the good news is it doesn't need to be an agent. It just needs to be good at understanding the consequences of actions in the future.
And on a related note, there's lots of plans to use the most advanced AIs to accelerate AI research in the AI lab.
But that's like crazy dangerous for many reasons. But one of them is if we have agentic AIs that are deceptive or maybe that have their self-preservation goals and basically, design the next generation of AI, I mean, they might put backdoors or things that could turn against us without us understanding what's going on when they get smarter than that. Like we really want scientific research on future AI done by AI to be done by AIs that are non-agentic and completely trustworthy.
Now, there are many pieces to this puzzle. Today, I'm going to stay at a fairly high level. I've written a paper with the same title as my talk, I think, that you can find online.
And so if you're asking how we build AI that is going to be completely honest, we have to think of the amazing work done on the question of the Eliciting Latent Knowledge challenge by Christiano and colleagues. So, the issue is that right now, the way we train our AIs, if you think of it as a probabilistic predictor, so it is making predictions, but it is making predictions about how a person would respond in a particular context. And since people can be deceptive, people can have what's called motivated cognition.
Even unconsciously, they might be lying or, you know, biasing their answers. And since the only random variables that are being explicitly captured by our current neural nets are these sequences of words produced by humans, we can't get the honest answer, which is, you know, one reason why people are interested in automated interpretability and things like this to look into the hidden units. But there's maybe another path that I'm going to talk about, to try to get in an interpretable way, which is going to be natural language, at truthful causes and justifications of the choices made by these systems.
So, let's say we wanted to build a system that deals in logical statements. In other words, like sentences that are affirming something that is either true or false.
And we want to make sure, unlike in the current case, that the system manipulates these logical statements and that the things that it sees in its data and the thing that it thinks about are all true statements or with some probability, right? probability of being true, rather than potentially deceptive statements. So, one way to think about this is, let's say you read in the input, “I agree with you.” So, somebody is saying they agree with someone else.
Maybe they're in a job interview. It could be true. It could be that they agree with them is true.
It could also be not true, and maybe the reason the person is saying it is that they want the job, and they do that because they want to preserve themselves. So, we can change the representation of the data so that the thing in input is now going to be recorded and presented to the machine as somebody, I mean the person in the interview, let's say, is saying, “I agree with you.” So that would be true.
In other words, the statement the person agrees is true—No, the person says they agree is true. And so now we can model these true statements. So that's one thing.
The other thing we'd like is something like chain-of-thought, but where—what is being thought about are the causes, explanations, justifications that explain what we're observing. So, one way I like to think about this is the current way of training our AIs is a little bit like training an actor to either imitate people, actors can do that, or please people, actors can do that.
So, imitate, that's a pre-training, please, that's the RLHF.
Another option, which I'm going to suggest we should push on, is that we don't train an actor, we train a psychologist or a scientist. And so what does the psychologist do when they look at people's behavior? They're not trying to imitate it. They're not trying to please you.
They're trying to make sense of it. Like why is it that the person said these things? So, we want these latent statements using the same language, potentially using scientific language, to be trained so that they will end up being the best explanations in a probabilistic sense.
All right. So, I have a few more slides after that where I'll talk more about politics. But let's sort of wrap up the technical part.
We have to navigate between different catastrophic risks. And I've talked a lot about loss of human control, but actually improved agency can also become weaponized by people who want to create large-scale negative impact. Whether it's through cyber or all kinds of weapons or disinformation.
And on the other hand, there's the issue of concentration of power, like who's going to control these things? And how do we make sure that power is not abused? And it's really difficult to navigate all these risks, but we have no choice.
The other important message I have is that agency is growing rapidly, and it is the only thing, well, that really separates at the level of cognition AI from humans. You also have bodily control, which is another thing, but you could have AIs that are very dangerous even before that. But the good news is that there is a way to design non-agentic, but very intelligent AI.
And I suggested that these could be building blocks for systems that are safe, including agentic ones. For example, as a guardrail, as a monitor. But, of course, you also have to understand that if we build machines that are very intelligent, even if we're not going to lose control, in the wrong hands, they could still be instruments of power and dangerous, because humans are seeking power and greedy and so on.
So, let's continue on this public policy thing. Intelligence gives power. So how do we make sure that as capabilities continue to increase, that power doesn't end up concentrated in a few hands?
It is a plausible path. There's no guarantee that it will be that way, but it's a plausible path. There's a danger that just a couple of countries that I will not name and a few companies have superintelligent AI.
Let's say even they've figured out how to make them safe, but that they use that to achieve economic, political, and military domination over the rest of the world. That would be very bad, right?
It could be the death of the economies in other countries than these, because if other countries don't have AI that can compete economically, there might be such a gap. Maybe not. Maybe things flatten in terms of capabilities, and everyone has access to the same capabilities. But there is a possibility of a rapid takeoff as AI becomes used to design future generations of AI, that one player will be way ahead of the others.
So, I think that in addition to the loss of control risk, there is a kind of existential risk for liberal democracies, both in terms of the economy, but also in terms of politics, because one goes with the other. And remember that powerful AI could be used to design new weapons that could destabilize the current world order. It could also be used to influence elections with this information that could be much more powerful than what we've seen up to now.
So, we're getting close to a world that could be quickly chaotic, dangerous, and very far from our democratic and human rights ideals. And that's just because of nation states and corporations that may not have the best behavior.
But there's also the chaos that could come from small actors, terrorists, rogue states, that could use AI that for some reason becomes very easy to get at.
So, we need to develop both technological and scientific governance methodologies. For example, even if we solve the safety and alignment and control problems, we still have this political problem. Which means that we should really start thinking about advanced AI, AGI, or superintelligence as a global public good.
And just having it managed by market forces and national competition is not going to be sufficient. We're going to need states, international organizations, to work together in order to mitigate these kinds of risks.
And in closing, I want to mention that I've started a new non-profit organization that is working on this scientific project that we call Scientist AI. And we are recruiting mostly on the technical side. If you're interested, please reach out. And I'm going to take questions.
Question: So, this is a general question about this kind of Scientist AI, a non-agentic AI, which is that you make the system, and then, it's possible to fine-tune it or sort of prompt it in a different way, and then use, leverage the abilities it has to build an agentic AI. So, I'm curious how you think about that possibility. Is it important to build an AI that cannot do agentic things? Yeah.
Bengio: Yeah. So, I think there are short-term things that could be achieved with this sort of methodology and longer-term ones. The short-term is we can take existing agentic AI and just make the monitor stronger by having it to be basically state-of-the-art in terms of capability, but be trustworthy.
And so, we can trust when it says this action is dangerous, so we can, you know. OK. So, that's the shorter-term thing.
The longer-term thing is redesign the agent itself. And this is also something I'm thinking about. Here, the advantage of having this non-agentic oracle is that we have a better handle on the goals that it's trying to achieve.
So, remember when I talked about self-preservation, it's something that emerges implicitly from the training, either RL or pre-training. And that's something we don't want, right? So, yeah, I don't have all the answers here, obviously.
But I think there is a path to do this sort of thing. There's a related question that people ask me sometimes, which is, “Oh, you talk about a scientist, but a scientist also does experiment, which is an agentic thing.” It turns out that you can use a predictive model, probabilistic predictive model like this, Bayesian more precisely, to turn the problem of experimental design into three or four little problems that are well-defined and correspond to things like estimating mutual information and sampling according to mutual information.
And each of these pieces is non-agentic, but you wire them together, you add a guardrail to make sure that the experiment that is sampled is both useful to disentangle theories and not going to harm people, and you end up with something that is agentic and I think has pretty strong guarantees of safety. But yeah, this is a very important direction. Thanks for asking.
Question: Yeah, thank you for the great talk. So, I have two questions.
The first one is about developing these agentic systems. So, there is a huge economic incentive to develop these systems and to deploy them at a scale. And so, and also the public, it seems that they love this idea of agentic system because it's somehow making their life easier, a task that would take them, I don't know, two or three hours. It's being done in like, I don't know, 30 minutes by these systems.
And so my question, first of all, since there is lots of money that can come out of these systems and also people love this, how can we convince these top labs or some entities to not develop agentic?
Bengio: We're not going to be able to do that.
Question: Yeah, and then the second one, I think you somehow answered—talked about it, is that you talked about developing a super-intelligent system and not necessarily an agentic system, but wouldn't the super-intelligent system transition to become naturally an agentic system?
Bengio: Yeah. Not naturally. That's the thing. Under our control.
So, for example, if you have an oracle, you can ask it, what should I do in order to achieve X? But it is something that we control and that we decide to turn this non-agentic thing into an agent.
Your first question—I mean, the company is just going to go ahead. They have no choice. It's like their survival is at stake. So, if anything puts a bit of wisdom there, it's going to be regulation and governance.
I don't see any other option. And I know we're not on a very encouraging path for this right now, but we should keep pushing.
Well, they're going to say that their AI systems and the agents are safe, but show me the money, right?
So, the whole point of risk evaluation is that the developer needs to demonstrate, for example, trustworthiness. And we now have these papers and others showing methodologies that elicit deceptive behavior.
So, until they solve the problem, I don't think we should just believe it because somebody says it's going to be fine. And you do hear that already.

Singapore Alignment Workshop

View all events

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI