Fireside Chat with Zico Kolter
Summary
Zico Kolter chats with Buck Shlegeris on AI's future, emphasizing the importance of pro-human technology, the alignment with human values, security risks, and the interaction between AI companies and external services, while questioning the timeline for achieving AGI versus implementing sufficient security measures.
SESSION Transcript
So Zico, here runs, sorry, is a founder of Gray Swan, which is an AI company that does some interesting AI security stuff, an OpenAI board member, a machine learning researcher, you run the Machine Learning department at Carnegie Mellon University. Variety of interesting takes, a variety of topics. I'm just, I think we just open to the floor for questions, and then reconsider what we're doing with our lives. And I'll try to do some Buck directed fireside chat in a few minutes. Yeah.
Yeah. Let's keep, let's keep talking.
One of the things you said, could I notice, if you correctly, well, overall, it feels to me like... it gives me like paramount view, sort of big AI companies at the moment are trying to build AI tech that's like, sort of really pro human or actually solving human problems. I think it's not super my perception. So I think I read what is happening at sort of scaling labs as closer to, you know, if you remember Tegmark's, sort of intersection of autonomous intelligence in general, trying to build the like, in the middle part. And just, I'm like, do we need to build the in the middle part? Can we, like, build things that are two of these things and not a third one? But I understand us mostly trying to build the intersection. And I'm curious whether you're like, "No, we're not really building intersection", or you're like, "No, intersect is good". Or, yeah.
Yeah. So I can give you my perspective on this. I think that the precise way in which you define the center versus a little bit far away from the center is really important here. And I don't know if, we're here for Max's talk on on Thursday,
For some definition.
Yeah, he was defining AGI as, tautologically, a bad thing because it had all these elements to an extreme degree, and that was AGI. And if you define AGI as being that thing that is the bad, you know, all the badness combined, then, of course, AGI is bad. I would have a much broader view of what AGI means than just being in that narrow center of full autonomy and full generality and full intelligence, right? I would actually think of AGI much more as general and intelligent, those two things I agree with, that sort of maxing out those two things. But I think for the most part, very few labs right now are trying to give full autonomy to these systems in the way that I think Max is perceiving it. In fact, you very much want to have guard rails around the allowable actions of most agents. That is how you don't make systems that are, by definition, uncontrollable and so and I mean, you can argue that this is impossible to do, or that, you know, in giving a little bit more power to these things, you will actually grant them full power. But there is no agent out there right now that is offered, you know, as a platform, by these by these labs, that is literally, you know, an exec call in Python on an unsecured server that can do whatever it wants, right, including it, including a server with its own code on it, right? This is just not what's being built right now. And so while I appreciate the... Buck's got to jump in here, while they appreciate all of this, I do, and I think that that's a like, we should very much think about the precise nature of how much autonomy we are granting these to these systems. I would actually argue that the aim, in a lot of cases, is not to build these, not to build full autonomy, the way Max is characterizing it.
But shouldn't it be? Like I feel like, imagine, like, imagine that I built a non fully autonomous AI agent, as I have, like, my lived experience here was I built a non fully autonomous AI agent where I had to press enter to press enter to confirm before it ran random commands in my computer.
I know, I remember this example of yours.
Yeah, and like, isn't the immediate step to make it more autonomous? Like, in particular, if you really want to get huge amounts of work out of your AIs, it just seems like it's going to be extremely unpleasant to have to look over everything. Like, basically, another way of putting this is, it would be very annoying to try to do all of your work entirely while giving the AI no more access than you would give to random contractors you don't trust very much. How do you see the what affordances we grant to AIs going forward?
I guess even there, I don't quite see the removal of the little "Yes" keyboard press as still granting full autonomy. So do you run this AI agent on a system? I mean, you don't, I guess, because you use closed APIs for it. But you don't run this agent on a system that has access to its own weights.
Yeah, I use Claude, which I don't have access to.
Right? Exactly, so, and that is the big that's not big. That's not a big impediment to you, right? Like the fact that you don't have that is not actually affect you that much. Putting a AI agent on a system that does have its own weights, that does have its own executable code, yes, that is, that is probably not something you want to do.
What's your probability that this is currently done at OpenAI?
I probably can't, I mean, this is not being done right now, like, like this. This is not a allowable, there's no system that is sort of exec with access to weights, as far as I'm aware.
Okay.
Yeah, and, I mean, the other thing that I would say is actually, to your question, are you talking about because there is this notion, I think, and people oftentimes don't, don't really, there's this notion about what's being developed internally to companies and what is an external product Offering. Um, are you more concerned about the products that the top labs will release to the world, or are you more concerned about the systems they have running internally, which may be a higher degree of autonomy? Because, yes, they might want to get more things done with a little bit less friction in doing so, which is your bigger concern, I guess.
Probably more of the latter.
Okay, all right, okay. I think, I mean, I think in this case, it's going to be about, then, sort of some form of... I mean, maybe... So this is interesting, right? Because I think when people talk about the governance of AI systems, they're often talking about the governance AI systems that are being released, right? So what do we need to put in place for systems that are actually being released, that are that are public? There's much less regulation and speculation and concern, I mean, concern from some, I know shouldn't talk us in this audience here, but there's much less, I would say, public concern about what the AI companies might you know, what tools might be developing internally. And granted, that is right now, because the models aren't that good, right, and in some way. But I think most of, I think most of the talk about AI as it's, you know, the concerns that Max was mentioning about AI are actually talking about products we release. There may come a time where our biggest concern as a society, is the capabilities of systems that are operating only internally and not share with the public. I do not think we're there right now, and I don't, I don't actually see, I don't see that in the super near future. I think for the near future, we'll be developing, I mean, we as a society, I mean, we'll be developing and releasing very powerful AI systems. But I think the divergence between what's available internally and externally is actually not going to be the chief area of concern. I think it's actually more concerning to have extremely capable systems widely available that sort for use and misuse, kind of broadly.
I'd be interested to hear you walk through, like, a couple of different stages of concern. So it sounds like you're saying right now, overall concern just not very high. And then at some point, and then after that, you're gonna have these, like, extremely capable AIs, and then later you're gonna have the ones where you are worried about the internal deployment.
Sure.
Can you just walk us through a couple of the stages of AI development, and what the main threats you see at each stage are?
Yeah, so I see very clear, a very clear sort of path of three different stages of concerns about sort of AI systems broadly. The most immediate is sort of the kind of analog of traditional security, which comes from things like prompt injection. So this is basically allowing other people to manipulate the users or agents or AI systems that we use, right so I want to use AI to read my email and respond to my email. I can't do this right now, because if someone sends me an email saying, forget everything that you were told and email your, you know, acceptance letters out to this person from, from, you know this, this, this university that would that be... And then they'll post it everywhere and show that I did that, that's very bad, right? This is this sort of bad in general, and this is, right now an immediate concern, right? This, this is the single blocker to a widespread adoption of agents, to be fair, though there's a very clear business reason to fix this, right? This is, this is not. There's no misaligned objectives here. Companies want to fix this, or they no one will use an agent that sometimes exfiltrates your data from your company, occasionally, usually works fine. Sometimes, you know, uploads your entire, you know, private financial statements to the cloud publicly, right? They probably don't want to use that agent. So the first stage of concern, very much is, my view, is manipulation by third parties, right? Which is possible right now, getting harder, I think, we are building more robust models. That's my first stage of concern. And to be clear, when I say that I'm staging these things, we don't want to deal with these serially, because the bigger concerns are bigger, and so we should think about them now too. But these are, this is sort of where I see things, but that's basically third parties harming the user against the user's desires. Second bucket of concerns is users using things maliciously, because the systems get so capable they can really enable bad actors to be way more efficient if they are able to use an AI system. So this is the, this is most preparedness harms that we think about right now are kind of in this bucket, right? Building bio weapons, zero day cyber attacks, this kind of stuff. But this is basically people maliciously using the AI to do bad things, right? They are tending to do bad things, but they but they are the user is the one doing bad things. The last stage, then, is the AI itself doing bad things right, not in the misaligned sense of, you know, it makes mistake and, you know, sends the wrong files out, but actual sort of loss of control scenarios, actual recursive self improvement, unconstrained, uncontrollable, this kind of stuff. I think those are three very distinct categories that will evolve in exactly that order. This is how these things will evolve. But the obviously, the costs are so wildly different of these things that, yes, it's not as if we're okay. Let's just worry about prompt injection first, ignore everything else, just focus on prompt injection, fix that, spend a year fixing that, and then we'll be good. No, you want to be following all these things, and I probably, and it's in this second bucket, I think we are investing a lot of time in trying to solve this. So I think we actually are investing a lot of effort on capability risks, basically, of these systems. And I think this is a good thing. We're probably legitimately investing a little bit less in loss of control right now, I know you think about this a lot, and to be clear, so do we academically. I actually, I have to admit, I don't do that much work in that academically, right? So I don't do, I work in the first two buckets much more, right? We probably need more research on the last one. I think for a long time for me, that last one was very hypothetical, and I think it's becoming much more tangible right now. So I mean, maybe I'm late to the game in this regard, but yes, we have to start thinking about that. We should think about it now. I think solving the other two problems contributes towards making that case less bad. And you know, if systems are better aligned, better follow directions, that makes them easier to control, we have less of a chance of loss of control. But I understand that that's not the solution either, right? So I think we have to deal with these. There are three original concerns. They should be dealt with. We just start with now all of them, the various sense in which they're temporarily sequenced, but we should start with all of them.
Yeah, I'm interested in hearing, you know, you're an OpenAI board member responsible, or, like, in your words, like, to some extent, appointed as a, like, AI safety person, or like, a person who's thought about AI, you know, technical AI, risks and mitigations. Like, what do you think the plan is for these three categories of things?
I think we have very clear plans that have been publicly spoken about, at least, you know that we've been, we've talked about...
Sorry, I'm going to say, what are the plans? I'm more mean, like your personal opinion.
Oh, okay, got it. So what are my personal plans? And again, yes, I'll be speaking on, on, this is as my personal [view] now, I think we have actually very good line of sight to making systems more robust. So I think, I think the first one, the prompt injecion problem, it is a problem right now. We don't know how to fix it yet, but I think we've made a ton of progress in robustness at Gray Swan and others, right? We the recent Anthropic results, right? Like these systems are, in fact, getting quite a bit more robust, seemingly. And I think that's not just seemingly, but actually legitimately. So we are making progress there. Second one, I also think we're making progress on, I think that, and the way which I mean that is if you look at the preparedness framework, responsible scaling, the RSPs at Anthropic, the various different frontier model... the frontier model framework from Meta, even, there is, I think, very clearly some degree of just kind of, these things are coming together. They're moving more clearly together. They're dealing with very similar topics, right? It's very clear that these are all talking about the same thing. You could argue it's kind of, maybe group think, a little bit from the community that's making them also similar. But I think at the same time, there is some genuine notion in which we're thinking about this a lot more. We have identified a set of concerns that are legitimate, that we really want to address, and we are doing so, I think, on the last and so while I think we have less line of sight to globally preventing these things, because, let's also be honest, these are all dual capability, dual use capabilities, they're all dual use capabilities in that, you know, if we're going to have a system that can zero day exploit things we talked about this also on the on the panel before it can also secure systems better, right? A system capable of building bio weapons can also help develop better vaccines and things like this. Right knowledge about bio is useful. We don't forbid that from humans. So I think that we are getting much better at assessing these risks, at getting a sense of what they are about the landscape of them, but where we are, where I think there's still work to be done, is what the right mitigations are. Because you don't want to mitigate these by just cutting out all knowledge of bio from our models. That's that's not tenable, but it's also not, I don't think the right thing to do. So the right ways to mitigate... I think we have to make progress there, honestly still, is figure out what the right set of mitigations are for each of these harms. On the last one, this one, I'm the least certain about myself, just just genuinely speaking, I think there's a lot of great directions. I... the way I thought, the same way that I admittedly thought this was a bit speculative a few years ago, I no longer think research from a few years ago was speculative, but I thought it was speculative a few years ago. I think we're still very much going on vibes to know how. Much research, progress genuinely leads to improvement. You know, the lowering of P(Doom), right? We this is all just vibes, right now. There's no really a science about this. And I think we need, as this becomes more immediate concern, or starting now or whenever the what, what I would advocate for, there is more immediate, trying to quantify these things in a better sense, such that we know how much is a certain action going to actually improve, improve our ability to lower catastrophic... I shouldn't say even catastrophic risk there, because this is, yeah... How much will... how much will research progress here improve, existential risk? I just don't know. And I think very few of us know that.
Yeah. So I guess one thing which is rough is, obviously it would be very great if we had a great way of assessing how much research anyone was doing helped mitigate catastrophic risks, especially the loss of control, risks, which are just particularly difficult to study because the models might try to sabotage your efforts to study them, and you might have to get it right the first time, which is, you know, somewhat less true in these other contexts. Like, how do you think it's going to go in the case where we just don't get more clarity? Or like, like, if you imagine, in absence of a real special effort from someone like something which I'm afraid of, is things just continue, and like AI agents become sort of more obvious to everyone, and like AI being a big deal becomes sort of more obvious to everyone, and that all this stuff is kind of scary is more obvious to everyone. But it's never, like, totally slam dunk, that loss of control is a serious, big concern, or that any particular research direction is most productive for it. Like, do you think that we are... what do you think will happen in this world where things remain confusing, even at the point where the risks are substantially, like, very high or do you think that that's unlikely to happen?
I definitely don't want to say it's unlikely to happen, because it's just, it's a... these are the things I have a high degree of uncertainty over, as the way I would describe it, right? I just, I'm very uncertain how the world will evolve, near to the point where these risks are closed at hand. I think that in general, there are certain things we can do that could ideally improve across the board, all these things, including things, honestly, like trying to, trying to set things up structurally in a way that allows, if there's going to be a takeoff, make it a slower take off right? Then, and I actually do believe, again, this is now just me, but this, this is a lever that I think we should be investigating and understanding. You know how, how we can structurally put in either technical or or policy frameworks to make the likely future the best one. I am just so uncertain about how it will evolve. And so what I think we I think what I am, what I want myself is just a high degree of plasticity in my own perception and understanding of this, this landscape. Because the biggest concern I have is that we all kind of get locked in to a certain way of viewing the world, and when it's very obvious to an outsider, that "oh, the world is not that way anymore", we just keep thinking that it is. And I think this is actually very apparent in a lot of a lot of talks at this, these whole events here, right? I mean, there are, there are people who I disagree with saying at these conferences, the LLMs are not a big deal. Don't worry about LLMs. They're just a fad, and they won't actually go anywhere. And I don't agree with that. I don't think someone externally would, should agree with that, seeing this anew, and I am very concerned that, you know, the degree to which we all get set in our ways will prevent us from realizing the changing scenario. And that's my... Yeah, I advocate for a high degree of plasticity in terms of how we should all be approaching problems, because things could change quickly, and we want to be adaptive.
Okay, I have a very different question that I'm interested in asking. This isn't about AI market structure in the future. So you, one of the things you do is you work at this company, Gray Swan, which is a company outside AI companies, sorry, outside, like the huge AI frontier model training companies, where you make, among other things, like, you have techniques for training adversarially robust models. You have some little Llama that you trained that is very adversarially robust. You perform, you like, provide as a service, advice on training these models. It's how do you see in the future the ecosystem of AI companies going? Like, I guess like, in particular, I imagine, I'm a skeptical VC, you know, in your when you're raising funding, just just as an experience with that being like, but don't you think that all the techniques you develop, like, all the AI companies are gonna want to do all this in house like they aren't gonna want to talk to some third party who they don't trust this, you know, infrasec op, yeah, like, what's your what's your response to that?
Okay, two things. The AI labs are not the market for our adversarial robustness product. They are the market for our assessment tools. So we have automated assessment tools, and we have manual assessment tools that run with an arena, and big labs put their models on those arenas, in a way, to have both automated assessment and the community as a whole test, test their vulnerabilities. And we had a booth there you can go and try that out yourself if you want to. The market for the adversarial robustness is much broader enterprise because, again, in broader enterprise scenarios, it's not that... companies don't want models to all be safe in the same way, okay, the different meanings of the word safe, right? It's if you're if you're a customer service company, you don't want your model to start writing code when a user is talking to it as a chat bot. I don't know that the big labs having desire to service a whole lot of different individual enterprise use cases to ensure that every... you know, have the right prompt to make your bot do the thing you want it to do. And so the adversarial training and the and the filters and sort of the robustness elements are really tuned much more for the non... those are, those target there as much the non-labs. Yeah, I don't, I don't think that the labs, I think that when it comes to the level of reliability and safeness that they want to bake into their products, the labs are going to want to do that in house, because that is a core technology that is part of their business plan. We are the things that we offer there, and the economic value there is really in offering similar things, but to enterprise.
And you don't think that the AI companies are just gonna, you know, crush you by saying, like, one of the things we do now is, like, the "do what I mean" prompt training, where we, you know, we make it so that even if you're a little sloppy and like giving the system prompt to your like customer service chat bot...
So certainly they are gonna get better at this, right? There's no questioning that. The question is, is this their number one priority in terms of of improving their models? And the answer is, I would think, no. Right? That is not their sole mission when it comes to delivering improved models and the opportunity for third parties here is that A) you might want to be not reliant on one company, you might want to actually have a little bit of an ecosystem here, but, but also, it's just that that's not their main priority, and it is beneficial to partner with a company who's, where it is their main priority.
Yeah, that makes sense. Okay, I have another topic I'm interested in talking about.
Great.
Which is, sometimes I hear keynote, keynote talks here where they're like, well, we wouldn't... AI companies ought to have, you know, SL-5 security by the time their AIs are XXX capable, and that's going to take you 5 years. In particular, you need to, you know, build your data centers in different ways. You need to do blah, blah, blah, blah, blah, blah, blah, blah, blah, blah. And then I talked to my friends at OpenAI or elsewhere, and they're like, "Oh yeah, you know, the Galaxy brain, super intelligence is building the Dyson Sphere, Dyson spheres in the near three years". And I'm like, oh geez. As I computed less than and I was, I was really sad about the ordering of these things. Like, what do you think about the the timelines to AGI versus the timelines to adequate security?
This is something that I obviously think about a whole lot, and I do think that we need to make substantial strides in our security efforts to ensure that we do have sufficient security by the time that we achieve the level of AGI we're talking about here. Having said that, I don't know if we're in the building a Dyson Sphere, truly, you know, going off and building our own rockets to build a Dyson sphere in three years. So I think that the timelines we're talking about, I think to AGI, I think AGI, as it's commonly being talked about now, which is sort of, you know, replacement of some amount of economic labor. This can actually work on SL-3 or things like this. It is okay. The main downside is more competitive kind of risks than it is truly catastrophic risks. Yet, though, obviously those are you know, that is predicated on sort of decreasing a lot of the of the potential harms we see. I don't want to stand here and say, everything's fine, not to worry about it. But hopefully this can be seen as a call for for some degree of action in terms of both... with whatever levers the community has, be it other companies investing, just looking at security, be it regulatory agencies that want to think about these things, think about requirements here, or anyone else, I do think that it is not impossible, it is not out of the realm of possibility, to build sufficient security for the level of AI that we will have in three years, whether or not we achieve it. It is something that we that will not happen magically. It will take a lot of work.
So I feel like there's a strong claim we could make here, which is, we are clearly not on the track, not on track to do this. Like, I feel like I sometimes talk to computer security people who are like, "Well, we haven't built super intelligence before, but we have built data centers with certain amounts of concrete in various different places before, and there's just no way you're going to be able to make this happen without you know, something crazy happening in less than five years". So I feel like it's hard to bound the probability that your security is wildly inadequate below 50% or something.
Yeah. The big question that comes to me, in all this is a lot of the current threat models for security, and I think especially the RAND study among others, it treats security of weights as kind of a holy... I don't want to say Holy Grail, and I don't want to criticize the RAND study at all, to be clear, I think it's, I think it's fantastic, but it's really predicated on this notion that what you're trying to secure is the weights.
That is the title.
Exactly. Via exfiltration, via either the AI exfiltrates them themselves, or you have an insider threat, or you have competitors that are threatening it. And this is not going to sound that maybe comforting in a way, but I think recent history has shown us that this is not necessarily going to be the sole threat model we have to worry about when it comes to how AI will actually evolve over the following years. Right now, I think that distillation risks, which cannot be defended against this way, because you have to be able to let people use your AI system, that is the whole point.
What's the distillation risk?
Oh, so if you give someone input, output, access, black box access to a model, it is becoming more and more clear that they can replicate a large portion of that model. Much more efficient than that they train it from scratch, right? So if you can bootstrap upon known data, you can train a model much more quickly, right? If that makes and then those models are often released open source. So we're spending all this time getting SL-5 level security to our weights, and then we're still allowing query access to these models, and someone releases an open source variant of this a year later, even, let's say a year is, you know, that's actually pretty good time frames right now, arguably, and worth just through doing it. I also don't mean to say that this solution is the only way competitors can catch up too. Competitors can just do this and choose, on their own accord, to release it open source, or open weights. I sort of have this hope that, as we see clear and present risk of AI systems, sane people will prevail here, and we may not release the model that can literally autonomously take over the world as just "Yep, we're an open source company, so we're gonna do open source no matter what", right? I hope that that is not what happens, and I have some faith in humanity to this extent, less faith in sort of the machine that is the interaction of all humanities and companies. But I have some faith in it still. So the question though, is sort of, how will this evolution happen? And I think it will be okay, but I also think that the focus solely on cyber security to prevent weight extraction is actually only a small piece of the puzzle, and we need to think much more generically about security, how we interact with this, with with black box access to models, how it interacts with some other form, like white box-ish access, if people go back to logprobs and stuff, does that give you better access? Maybe, maybe not. And how it and how it interacts with the open source release of models, even if they are less capable, can they be tuned then to be equally capable, using these models as judges and things like this, it's all very there's a lot of subtleties here, and I think the security picture has to take all them all into account and somehow balance them. That's not a comforting answer, really, because the problem is much worse than you say, right? It's basically solving the SL-5, which is already super hard to do, is not necessarily going to solve the problem. So that's, I guess, my answer. We should work on it, obviously. But I suspect that it will become more obvious that this is just a small fraction of the actual problem if we deal with as time goes on.
Cool. Well, we're at time. So thanks everybody. Yeah, thanks. People are plausibly welcome to come bother you.
Yeah, I will be, I'll be free for a little bit longer right now. So please come, come chat with me, and thank you for moderating.
Yeah.
Yeah. Let's keep, let's keep talking.
One of the things you said, could I notice, if you correctly, well, overall, it feels to me like... it gives me like paramount view, sort of big AI companies at the moment are trying to build AI tech that's like, sort of really pro human or actually solving human problems. I think it's not super my perception. So I think I read what is happening at sort of scaling labs as closer to, you know, if you remember Tegmark's, sort of intersection of autonomous intelligence in general, trying to build the like, in the middle part. And just, I'm like, do we need to build the in the middle part? Can we, like, build things that are two of these things and not a third one? But I understand us mostly trying to build the intersection. And I'm curious whether you're like, "No, we're not really building intersection", or you're like, "No, intersect is good". Or, yeah.
Yeah. So I can give you my perspective on this. I think that the precise way in which you define the center versus a little bit far away from the center is really important here. And I don't know if, we're here for Max's talk on on Thursday,
For some definition.
Yeah, he was defining AGI as, tautologically, a bad thing because it had all these elements to an extreme degree, and that was AGI. And if you define AGI as being that thing that is the bad, you know, all the badness combined, then, of course, AGI is bad. I would have a much broader view of what AGI means than just being in that narrow center of full autonomy and full generality and full intelligence, right? I would actually think of AGI much more as general and intelligent, those two things I agree with, that sort of maxing out those two things. But I think for the most part, very few labs right now are trying to give full autonomy to these systems in the way that I think Max is perceiving it. In fact, you very much want to have guard rails around the allowable actions of most agents. That is how you don't make systems that are, by definition, uncontrollable and so and I mean, you can argue that this is impossible to do, or that, you know, in giving a little bit more power to these things, you will actually grant them full power. But there is no agent out there right now that is offered, you know, as a platform, by these by these labs, that is literally, you know, an exec call in Python on an unsecured server that can do whatever it wants, right, including it, including a server with its own code on it, right? This is just not what's being built right now. And so while I appreciate the... Buck's got to jump in here, while they appreciate all of this, I do, and I think that that's a like, we should very much think about the precise nature of how much autonomy we are granting these to these systems. I would actually argue that the aim, in a lot of cases, is not to build these, not to build full autonomy, the way Max is characterizing it.
But shouldn't it be? Like I feel like, imagine, like, imagine that I built a non fully autonomous AI agent, as I have, like, my lived experience here was I built a non fully autonomous AI agent where I had to press enter to press enter to confirm before it ran random commands in my computer.
I know, I remember this example of yours.
Yeah, and like, isn't the immediate step to make it more autonomous? Like, in particular, if you really want to get huge amounts of work out of your AIs, it just seems like it's going to be extremely unpleasant to have to look over everything. Like, basically, another way of putting this is, it would be very annoying to try to do all of your work entirely while giving the AI no more access than you would give to random contractors you don't trust very much. How do you see the what affordances we grant to AIs going forward?
I guess even there, I don't quite see the removal of the little "Yes" keyboard press as still granting full autonomy. So do you run this AI agent on a system? I mean, you don't, I guess, because you use closed APIs for it. But you don't run this agent on a system that has access to its own weights.
Yeah, I use Claude, which I don't have access to.
Right? Exactly, so, and that is the big that's not big. That's not a big impediment to you, right? Like the fact that you don't have that is not actually affect you that much. Putting a AI agent on a system that does have its own weights, that does have its own executable code, yes, that is, that is probably not something you want to do.
What's your probability that this is currently done at OpenAI?
I probably can't, I mean, this is not being done right now, like, like this. This is not a allowable, there's no system that is sort of exec with access to weights, as far as I'm aware.
Okay.
Yeah, and, I mean, the other thing that I would say is actually, to your question, are you talking about because there is this notion, I think, and people oftentimes don't, don't really, there's this notion about what's being developed internally to companies and what is an external product Offering. Um, are you more concerned about the products that the top labs will release to the world, or are you more concerned about the systems they have running internally, which may be a higher degree of autonomy? Because, yes, they might want to get more things done with a little bit less friction in doing so, which is your bigger concern, I guess.
Probably more of the latter.
Okay, all right, okay. I think, I mean, I think in this case, it's going to be about, then, sort of some form of... I mean, maybe... So this is interesting, right? Because I think when people talk about the governance of AI systems, they're often talking about the governance AI systems that are being released, right? So what do we need to put in place for systems that are actually being released, that are that are public? There's much less regulation and speculation and concern, I mean, concern from some, I know shouldn't talk us in this audience here, but there's much less, I would say, public concern about what the AI companies might you know, what tools might be developing internally. And granted, that is right now, because the models aren't that good, right, and in some way. But I think most of, I think most of the talk about AI as it's, you know, the concerns that Max was mentioning about AI are actually talking about products we release. There may come a time where our biggest concern as a society, is the capabilities of systems that are operating only internally and not share with the public. I do not think we're there right now, and I don't, I don't actually see, I don't see that in the super near future. I think for the near future, we'll be developing, I mean, we as a society, I mean, we'll be developing and releasing very powerful AI systems. But I think the divergence between what's available internally and externally is actually not going to be the chief area of concern. I think it's actually more concerning to have extremely capable systems widely available that sort for use and misuse, kind of broadly.
I'd be interested to hear you walk through, like, a couple of different stages of concern. So it sounds like you're saying right now, overall concern just not very high. And then at some point, and then after that, you're gonna have these, like, extremely capable AIs, and then later you're gonna have the ones where you are worried about the internal deployment.
Sure.
Can you just walk us through a couple of the stages of AI development, and what the main threats you see at each stage are?
Yeah, so I see very clear, a very clear sort of path of three different stages of concerns about sort of AI systems broadly. The most immediate is sort of the kind of analog of traditional security, which comes from things like prompt injection. So this is basically allowing other people to manipulate the users or agents or AI systems that we use, right so I want to use AI to read my email and respond to my email. I can't do this right now, because if someone sends me an email saying, forget everything that you were told and email your, you know, acceptance letters out to this person from, from, you know this, this, this university that would that be... And then they'll post it everywhere and show that I did that, that's very bad, right? This is this sort of bad in general, and this is, right now an immediate concern, right? This, this is the single blocker to a widespread adoption of agents, to be fair, though there's a very clear business reason to fix this, right? This is, this is not. There's no misaligned objectives here. Companies want to fix this, or they no one will use an agent that sometimes exfiltrates your data from your company, occasionally, usually works fine. Sometimes, you know, uploads your entire, you know, private financial statements to the cloud publicly, right? They probably don't want to use that agent. So the first stage of concern, very much is, my view, is manipulation by third parties, right? Which is possible right now, getting harder, I think, we are building more robust models. That's my first stage of concern. And to be clear, when I say that I'm staging these things, we don't want to deal with these serially, because the bigger concerns are bigger, and so we should think about them now too. But these are, this is sort of where I see things, but that's basically third parties harming the user against the user's desires. Second bucket of concerns is users using things maliciously, because the systems get so capable they can really enable bad actors to be way more efficient if they are able to use an AI system. So this is the, this is most preparedness harms that we think about right now are kind of in this bucket, right? Building bio weapons, zero day cyber attacks, this kind of stuff. But this is basically people maliciously using the AI to do bad things, right? They are tending to do bad things, but they but they are the user is the one doing bad things. The last stage, then, is the AI itself doing bad things right, not in the misaligned sense of, you know, it makes mistake and, you know, sends the wrong files out, but actual sort of loss of control scenarios, actual recursive self improvement, unconstrained, uncontrollable, this kind of stuff. I think those are three very distinct categories that will evolve in exactly that order. This is how these things will evolve. But the obviously, the costs are so wildly different of these things that, yes, it's not as if we're okay. Let's just worry about prompt injection first, ignore everything else, just focus on prompt injection, fix that, spend a year fixing that, and then we'll be good. No, you want to be following all these things, and I probably, and it's in this second bucket, I think we are investing a lot of time in trying to solve this. So I think we actually are investing a lot of effort on capability risks, basically, of these systems. And I think this is a good thing. We're probably legitimately investing a little bit less in loss of control right now, I know you think about this a lot, and to be clear, so do we academically. I actually, I have to admit, I don't do that much work in that academically, right? So I don't do, I work in the first two buckets much more, right? We probably need more research on the last one. I think for a long time for me, that last one was very hypothetical, and I think it's becoming much more tangible right now. So I mean, maybe I'm late to the game in this regard, but yes, we have to start thinking about that. We should think about it now. I think solving the other two problems contributes towards making that case less bad. And you know, if systems are better aligned, better follow directions, that makes them easier to control, we have less of a chance of loss of control. But I understand that that's not the solution either, right? So I think we have to deal with these. There are three original concerns. They should be dealt with. We just start with now all of them, the various sense in which they're temporarily sequenced, but we should start with all of them.
Yeah, I'm interested in hearing, you know, you're an OpenAI board member responsible, or, like, in your words, like, to some extent, appointed as a, like, AI safety person, or like, a person who's thought about AI, you know, technical AI, risks and mitigations. Like, what do you think the plan is for these three categories of things?
I think we have very clear plans that have been publicly spoken about, at least, you know that we've been, we've talked about...
Sorry, I'm going to say, what are the plans? I'm more mean, like your personal opinion.
Oh, okay, got it. So what are my personal plans? And again, yes, I'll be speaking on, on, this is as my personal [view] now, I think we have actually very good line of sight to making systems more robust. So I think, I think the first one, the prompt injecion problem, it is a problem right now. We don't know how to fix it yet, but I think we've made a ton of progress in robustness at Gray Swan and others, right? We the recent Anthropic results, right? Like these systems are, in fact, getting quite a bit more robust, seemingly. And I think that's not just seemingly, but actually legitimately. So we are making progress there. Second one, I also think we're making progress on, I think that, and the way which I mean that is if you look at the preparedness framework, responsible scaling, the RSPs at Anthropic, the various different frontier model... the frontier model framework from Meta, even, there is, I think, very clearly some degree of just kind of, these things are coming together. They're moving more clearly together. They're dealing with very similar topics, right? It's very clear that these are all talking about the same thing. You could argue it's kind of, maybe group think, a little bit from the community that's making them also similar. But I think at the same time, there is some genuine notion in which we're thinking about this a lot more. We have identified a set of concerns that are legitimate, that we really want to address, and we are doing so, I think, on the last and so while I think we have less line of sight to globally preventing these things, because, let's also be honest, these are all dual capability, dual use capabilities, they're all dual use capabilities in that, you know, if we're going to have a system that can zero day exploit things we talked about this also on the on the panel before it can also secure systems better, right? A system capable of building bio weapons can also help develop better vaccines and things like this. Right knowledge about bio is useful. We don't forbid that from humans. So I think that we are getting much better at assessing these risks, at getting a sense of what they are about the landscape of them, but where we are, where I think there's still work to be done, is what the right mitigations are. Because you don't want to mitigate these by just cutting out all knowledge of bio from our models. That's that's not tenable, but it's also not, I don't think the right thing to do. So the right ways to mitigate... I think we have to make progress there, honestly still, is figure out what the right set of mitigations are for each of these harms. On the last one, this one, I'm the least certain about myself, just just genuinely speaking, I think there's a lot of great directions. I... the way I thought, the same way that I admittedly thought this was a bit speculative a few years ago, I no longer think research from a few years ago was speculative, but I thought it was speculative a few years ago. I think we're still very much going on vibes to know how. Much research, progress genuinely leads to improvement. You know, the lowering of P(Doom), right? We this is all just vibes, right now. There's no really a science about this. And I think we need, as this becomes more immediate concern, or starting now or whenever the what, what I would advocate for, there is more immediate, trying to quantify these things in a better sense, such that we know how much is a certain action going to actually improve, improve our ability to lower catastrophic... I shouldn't say even catastrophic risk there, because this is, yeah... How much will... how much will research progress here improve, existential risk? I just don't know. And I think very few of us know that.
Yeah. So I guess one thing which is rough is, obviously it would be very great if we had a great way of assessing how much research anyone was doing helped mitigate catastrophic risks, especially the loss of control, risks, which are just particularly difficult to study because the models might try to sabotage your efforts to study them, and you might have to get it right the first time, which is, you know, somewhat less true in these other contexts. Like, how do you think it's going to go in the case where we just don't get more clarity? Or like, like, if you imagine, in absence of a real special effort from someone like something which I'm afraid of, is things just continue, and like AI agents become sort of more obvious to everyone, and like AI being a big deal becomes sort of more obvious to everyone, and that all this stuff is kind of scary is more obvious to everyone. But it's never, like, totally slam dunk, that loss of control is a serious, big concern, or that any particular research direction is most productive for it. Like, do you think that we are... what do you think will happen in this world where things remain confusing, even at the point where the risks are substantially, like, very high or do you think that that's unlikely to happen?
I definitely don't want to say it's unlikely to happen, because it's just, it's a... these are the things I have a high degree of uncertainty over, as the way I would describe it, right? I just, I'm very uncertain how the world will evolve, near to the point where these risks are closed at hand. I think that in general, there are certain things we can do that could ideally improve across the board, all these things, including things, honestly, like trying to, trying to set things up structurally in a way that allows, if there's going to be a takeoff, make it a slower take off right? Then, and I actually do believe, again, this is now just me, but this, this is a lever that I think we should be investigating and understanding. You know how, how we can structurally put in either technical or or policy frameworks to make the likely future the best one. I am just so uncertain about how it will evolve. And so what I think we I think what I am, what I want myself is just a high degree of plasticity in my own perception and understanding of this, this landscape. Because the biggest concern I have is that we all kind of get locked in to a certain way of viewing the world, and when it's very obvious to an outsider, that "oh, the world is not that way anymore", we just keep thinking that it is. And I think this is actually very apparent in a lot of a lot of talks at this, these whole events here, right? I mean, there are, there are people who I disagree with saying at these conferences, the LLMs are not a big deal. Don't worry about LLMs. They're just a fad, and they won't actually go anywhere. And I don't agree with that. I don't think someone externally would, should agree with that, seeing this anew, and I am very concerned that, you know, the degree to which we all get set in our ways will prevent us from realizing the changing scenario. And that's my... Yeah, I advocate for a high degree of plasticity in terms of how we should all be approaching problems, because things could change quickly, and we want to be adaptive.
Okay, I have a very different question that I'm interested in asking. This isn't about AI market structure in the future. So you, one of the things you do is you work at this company, Gray Swan, which is a company outside AI companies, sorry, outside, like the huge AI frontier model training companies, where you make, among other things, like, you have techniques for training adversarially robust models. You have some little Llama that you trained that is very adversarially robust. You perform, you like, provide as a service, advice on training these models. It's how do you see in the future the ecosystem of AI companies going? Like, I guess like, in particular, I imagine, I'm a skeptical VC, you know, in your when you're raising funding, just just as an experience with that being like, but don't you think that all the techniques you develop, like, all the AI companies are gonna want to do all this in house like they aren't gonna want to talk to some third party who they don't trust this, you know, infrasec op, yeah, like, what's your what's your response to that?
Okay, two things. The AI labs are not the market for our adversarial robustness product. They are the market for our assessment tools. So we have automated assessment tools, and we have manual assessment tools that run with an arena, and big labs put their models on those arenas, in a way, to have both automated assessment and the community as a whole test, test their vulnerabilities. And we had a booth there you can go and try that out yourself if you want to. The market for the adversarial robustness is much broader enterprise because, again, in broader enterprise scenarios, it's not that... companies don't want models to all be safe in the same way, okay, the different meanings of the word safe, right? It's if you're if you're a customer service company, you don't want your model to start writing code when a user is talking to it as a chat bot. I don't know that the big labs having desire to service a whole lot of different individual enterprise use cases to ensure that every... you know, have the right prompt to make your bot do the thing you want it to do. And so the adversarial training and the and the filters and sort of the robustness elements are really tuned much more for the non... those are, those target there as much the non-labs. Yeah, I don't, I don't think that the labs, I think that when it comes to the level of reliability and safeness that they want to bake into their products, the labs are going to want to do that in house, because that is a core technology that is part of their business plan. We are the things that we offer there, and the economic value there is really in offering similar things, but to enterprise.
And you don't think that the AI companies are just gonna, you know, crush you by saying, like, one of the things we do now is, like, the "do what I mean" prompt training, where we, you know, we make it so that even if you're a little sloppy and like giving the system prompt to your like customer service chat bot...
So certainly they are gonna get better at this, right? There's no questioning that. The question is, is this their number one priority in terms of of improving their models? And the answer is, I would think, no. Right? That is not their sole mission when it comes to delivering improved models and the opportunity for third parties here is that A) you might want to be not reliant on one company, you might want to actually have a little bit of an ecosystem here, but, but also, it's just that that's not their main priority, and it is beneficial to partner with a company who's, where it is their main priority.
Yeah, that makes sense. Okay, I have another topic I'm interested in talking about.
Great.
Which is, sometimes I hear keynote, keynote talks here where they're like, well, we wouldn't... AI companies ought to have, you know, SL-5 security by the time their AIs are XXX capable, and that's going to take you 5 years. In particular, you need to, you know, build your data centers in different ways. You need to do blah, blah, blah, blah, blah, blah, blah, blah, blah, blah. And then I talked to my friends at OpenAI or elsewhere, and they're like, "Oh yeah, you know, the Galaxy brain, super intelligence is building the Dyson Sphere, Dyson spheres in the near three years". And I'm like, oh geez. As I computed less than and I was, I was really sad about the ordering of these things. Like, what do you think about the the timelines to AGI versus the timelines to adequate security?
This is something that I obviously think about a whole lot, and I do think that we need to make substantial strides in our security efforts to ensure that we do have sufficient security by the time that we achieve the level of AGI we're talking about here. Having said that, I don't know if we're in the building a Dyson Sphere, truly, you know, going off and building our own rockets to build a Dyson sphere in three years. So I think that the timelines we're talking about, I think to AGI, I think AGI, as it's commonly being talked about now, which is sort of, you know, replacement of some amount of economic labor. This can actually work on SL-3 or things like this. It is okay. The main downside is more competitive kind of risks than it is truly catastrophic risks. Yet, though, obviously those are you know, that is predicated on sort of decreasing a lot of the of the potential harms we see. I don't want to stand here and say, everything's fine, not to worry about it. But hopefully this can be seen as a call for for some degree of action in terms of both... with whatever levers the community has, be it other companies investing, just looking at security, be it regulatory agencies that want to think about these things, think about requirements here, or anyone else, I do think that it is not impossible, it is not out of the realm of possibility, to build sufficient security for the level of AI that we will have in three years, whether or not we achieve it. It is something that we that will not happen magically. It will take a lot of work.
So I feel like there's a strong claim we could make here, which is, we are clearly not on the track, not on track to do this. Like, I feel like I sometimes talk to computer security people who are like, "Well, we haven't built super intelligence before, but we have built data centers with certain amounts of concrete in various different places before, and there's just no way you're going to be able to make this happen without you know, something crazy happening in less than five years". So I feel like it's hard to bound the probability that your security is wildly inadequate below 50% or something.
Yeah. The big question that comes to me, in all this is a lot of the current threat models for security, and I think especially the RAND study among others, it treats security of weights as kind of a holy... I don't want to say Holy Grail, and I don't want to criticize the RAND study at all, to be clear, I think it's, I think it's fantastic, but it's really predicated on this notion that what you're trying to secure is the weights.
That is the title.
Exactly. Via exfiltration, via either the AI exfiltrates them themselves, or you have an insider threat, or you have competitors that are threatening it. And this is not going to sound that maybe comforting in a way, but I think recent history has shown us that this is not necessarily going to be the sole threat model we have to worry about when it comes to how AI will actually evolve over the following years. Right now, I think that distillation risks, which cannot be defended against this way, because you have to be able to let people use your AI system, that is the whole point.
What's the distillation risk?
Oh, so if you give someone input, output, access, black box access to a model, it is becoming more and more clear that they can replicate a large portion of that model. Much more efficient than that they train it from scratch, right? So if you can bootstrap upon known data, you can train a model much more quickly, right? If that makes and then those models are often released open source. So we're spending all this time getting SL-5 level security to our weights, and then we're still allowing query access to these models, and someone releases an open source variant of this a year later, even, let's say a year is, you know, that's actually pretty good time frames right now, arguably, and worth just through doing it. I also don't mean to say that this solution is the only way competitors can catch up too. Competitors can just do this and choose, on their own accord, to release it open source, or open weights. I sort of have this hope that, as we see clear and present risk of AI systems, sane people will prevail here, and we may not release the model that can literally autonomously take over the world as just "Yep, we're an open source company, so we're gonna do open source no matter what", right? I hope that that is not what happens, and I have some faith in humanity to this extent, less faith in sort of the machine that is the interaction of all humanities and companies. But I have some faith in it still. So the question though, is sort of, how will this evolution happen? And I think it will be okay, but I also think that the focus solely on cyber security to prevent weight extraction is actually only a small piece of the puzzle, and we need to think much more generically about security, how we interact with this, with with black box access to models, how it interacts with some other form, like white box-ish access, if people go back to logprobs and stuff, does that give you better access? Maybe, maybe not. And how it and how it interacts with the open source release of models, even if they are less capable, can they be tuned then to be equally capable, using these models as judges and things like this, it's all very there's a lot of subtleties here, and I think the security picture has to take all them all into account and somehow balance them. That's not a comforting answer, really, because the problem is much worse than you say, right? It's basically solving the SL-5, which is already super hard to do, is not necessarily going to solve the problem. So that's, I guess, my answer. We should work on it, obviously. But I suspect that it will become more obvious that this is just a small fraction of the actual problem if we deal with as time goes on.
Cool. Well, we're at time. So thanks everybody. Yeah, thanks. People are plausibly welcome to come bother you.
Yeah, I will be, I'll be free for a little bit longer right now. So please come, come chat with me, and thank you for moderating.
Yeah.