Alignment is social: lessons from human alignment for AI
Summary
SESSION Transcript
So I gave this talk at ICML a few weeks ago. Was there anybody there? Oh, good! Good! I'm very glad about that, because I hate to repeat myself, plus the lights were all in my eyes in a massive ballroom type setup, and I couldn't see who was in the audience, so I couldn't tell ahead of time. Anyway, I'm really looking forward to this conversation, so I'm happy to go through and do questions at the end. But if, as we go along, you want to sort of clarify on something, or just dig a little bit deeper, and if it's more of a discussion question, I'll just boot it to when we're done. But let's go.
Okay. So actually, I have a couple of pictures at the beginning of this talk, just to say if you fall asleep after about 10 min, like, here's what you need to take away. So this is the way we think about alignment in most cases now, like with RLHF, that there are things called preferences, judgments, values in human heads. And what we want to do is extract those elicit those as data and get them into the way we're designing systems. Right? So we're eliciting that data. We're presenting people with choices. What do you like this one A or B, which is better eliciting those preferences, and then embedding those preferences into our models.
We might do this instead with constitutional AI, we think. Oh, well, those preferences, those judgments, are actually represented in documents. Constitution, terms of service, or whatever. So we could just use the documents as the source of that of those preferences, and then get those preferences into our into our models. So I think this is the wrong way to think about alignment. I think it's not a good story about humans. It's actually not accurate about humans and actually building on this basis is not robust, and it's not safe. So I'm going to talk to you about a different way of thinking about it.
The way that I think about human societies is that the most fundamental thing about human societies, we're organized in groups, cooperative groups. And that actually, what we do is we coordinate on shared judgments, shared assessments of what's good or bad, acceptable or not, okay or not okay. So that whatever is in our heads is fundamentally shaped by that shared infrastructure, normative infrastructure of judgment. And I think this is something we need to be thinking about as we think about integrating AI, AI systems, AI agents into our human groups. How are we going to get those entities aligned with the same thing that's serving the function of alignment for all of us? Okay, so those are the 4 pictures you can. You can nod off now if you're with me, and I don't need to convince you which, by the way, is why I always recommend writing an introduction to a paper. If anybody's going to take, if they're going to trust you, you shouldn't have to read past the introduction and know what the point is, okay. So here we go into a little bit more detail.
So the key claim here is that social processes, institutions, stuff we invented to generate and constitute shared normative judgment about what's okay and what's not okay in our societies. That's what aligns group members on approved behavior. And that I think of as the fundamental challenge of human societies. Now, I'm going to talk more about these institutions, but I hear I'm using the language of institutions to mean both our implicit processes. And what you probably think of when you think of institutions like big blocky things on the corner, right courts and legislatures, and so on. But I include in institutions the institution of normative discourse. Like we have a conversation, there's actually rules. There's norms governing how that discussion or debate goes. Was it okay, when our coworker, you know, left the milk out in the kitchen? We have those kinds of conversations and those processes are themselves structured. So they are also institutions. So you want to think about institutions and not just have the idea of modern life and big blocky things on the corner.
So my claim is that our models of alignment need to move away from the idea that alignment is to the individual, which is, you know, something we bring over from engineering. I'm not an engineer, y'all are. But you know, from engineering, I'm designing a thing for this person to use. Is it going to work for that person. But that's not how we should be thinking about alignment. We need to think about a being aligned to social processes and institutions. I'm going to say a lot more about what that might look like.
This approach is fundamentally grounded in thinking about, how did we get here? How have humans evolved, right not to just take it for granted. Here's the way our complex modern societies work. But how did we get here? How did we evolve to this? And the claim is that human evolution is based on the evolution of these institutions, these institutions of alignment. So here's my picture. This is actually from a talk I gave in 2012. Here's a picture of human evolution in one slide, right? And this is the way most of us think about human evolution. It's like the evolution of our technologies. Okay, we have stone axes, we invent fire, we've got huts, weapons. We get modern human things start to ramp up a little bit. Clothing and Language. We start to get agriculture, irrigation cities, things start speeding up even more. We get writing, we get math, we get science and boom. Here we are today at AI not to scale. Do not look at the space between those different markings on the horizontal axis there.
But how did we get there. How do we accomplish that kind of growth, that cumulative technology over that period of time? Okay, well, we built stable, adaptable cooperative groups. And that makes us unique as a species. There are other cooperative groups but they are not adaptable. They can be stable, bees, ants, etc. But they can’t adapt, hey can't build something different. We have the capacity for just enormous levels of adaptation. And that's what really emerged with human societies. You can't have groups of people cooperating together without rules. You need to have rules about, how does this group work? What is acceptable and not acceptable in this group? So you get things like, okay, we're all going to do something a little different. We got specialization in the division of labor, and we get norms like everybody should help. Like if you're just sitting back and taking the food that others are bringing back to camp and never contributing anything yourself. Right? That's that's not okay, that that can get you criticized could get you kicked out of the group. Everybody should eat once we've got all that food. And this is the rule that, as far as we know from extant, hunter-gatherer groups, that's the rule. If you're at the camp when food is around, you get to eat, even if you didn't participate in collecting it and bringing it in. So those norms and rules align group members to behaviors that then secure higher average welfare in the group.
Aligned groups maintain stability. Right people have confidence. The group is going to still be there a week from now, 6 months from now, when the weather changes and stability is essential for securing the benefits of a group which come from mutual defense and aid, specialization in the division of labor or exchange. So my PhD is in economics. The whole world right? It's all about the fact that we have the capacity to say, you go collect the water. I'll collect the berries. You're going to learn more about collecting water, you're gonna be better at it than me. I'm going to collect the berries, I can learn more about that, but we can only do that with confidence if I know I'm getting some water, and you know you're getting some berries right? Otherwise we have to take care of ourselves like other primates do.
There we go. Okay, so this is my most recent favorite book. Michael Tomasello, who's he's everything, studies primates, developmental psychology, cognitive science. His recent book, The Evolution of Agency I recommend to all of you and his point, he's sort of expressing a very similar idea what emerges at the human boundary in terms of agency. So with the emergence of humans, individuals come together to form socially shared agencies. And the reason I really recommend the book is because he works his way up from worms to lizards, to mammals, to humans, and says, this is what's distinctive about humans. So if you don't want to believe me, believe him. They form socially shared agencies, socially constituted feedback control systems. So the whole book is in the context of feedback and control. So it should be of lots of interest to you folks pursuing shared goals that no individual could attain on their own. So humans are socially normative agents. And that is what explains everything we’ve accomplished. So I partly want to be saying here, okay, so we need to take that really seriously, if we're thinking about okay, now, you're going to integrate AI systems and agents, particularly agents with some agency. The capacity to go out, spend some time doing stuff. We need to be thinking about that.
So when I look at this picture of human cultural technology evolution. To me, who thinks about the rules and the normative systems in human societies, what I see is the underpinning of the evolution of the normative structure that supported these worlds. So my, my spacing got off here. So what we know is that those original human societies at some point we don't know when it emerged. We know it is not the case. In other primate societies. We can talk all you want in the discussion about moral animals and empathy and chimpanzees, and so on. If you want. But I'll make the claim. We just don't have the idea of taboos and norms. The idea that we have a shared idea. This is acceptable. This is not, and we have third party punishment of wrongful behavior.
So we have taboos and norms. The institution that's constituting “This is okay - this is not” is discussion around the fire under the tree that settles into a pattern that we can predict. That's how we know, we can say it's an institution. It's not just conversation that goes this way today and that way tomorrow. This starts to evolve so that what does that do that allows us to stabilize groups, advance their adaptability. If that grows, it gets increased specialization, increased heterogeneity, increased diversity. You start to need some more formal structure to resolve this new thing that people are doing. Is it okay or not? Okay, these new folks that have joined, this new environment we're in because we've been able to travel. Is that okay? You know, is this okay behavior or not? You start to get what I call demand for something like law, which is an express explicit institution. Now, we've got actually a group of people, the elders, the headmen, chieftains who are resolving ambiguity and disagreement about what's acceptable and not acceptable.
And if you can sort of again look at the evolution of human history through this lens, you know we get agriculture, we get cities. We get something, we need something a little bit more. We get the Egyptian viziers, we get contracts, we get deeds, we get eventually constitutions. We get the Greek jury, Roman 12 tablets and boom we're in today with those blocky buildings we're all thinking of when we hear the word institutions, the structure of the nation state, legislature, state courts. And so part of what I want you to take away from this picture is, keep remembering. This is just today, and that's a couple 100 years old. We've been around a lot longer, and we've been maintaining stable societies. And that's what got us here. So you think about the function of that stuff and not oh, here's what we need, we need to replicate that stuff.
So institutions are not just preference aggregation. So this is why, it's not just oh elicit all those preferences out of people's heads and add them up. Use some social choice, you know, algorithm or whatever, do something pluralistic. That's not what our human societies are built on, although we obviously care about individuals’ capacity to participate in our processes. Institutions are structured, they have many overlapping, incomplete, and incompletely enforced rules with lots of outcomes. They're affected by a dense set of our, it's a complex adaptive system. It is not just “Here's the rules. I can just take it.” And like in a constitutional AI framework, just stick it into the into the model. It's actually a complex dynamic system. And then the next claim is that cultural group selection, which groups did well, which ones grew, which ones were stable, which ones achieved higher payoffs in the group, cultural group selection then, favors the evolution of institutions that achieve both stability and adaptability of normative classification, and that balancing act. On the one hand, you want people to have confidence what the rules are going to be in the future. So they don't wake up every morning and say, you know, is my work acceptable still? Am I allowed to do this? Am I going to get some of the food when it comes back to camp. But they also do need to change. They need to adapt. You got to balance that. So this is the idea that norms and values are not physical. They're not objects. They're not data in the world per se. They're the outcome of the equilibrium outcome of dynamic behavioral systems. Okay, so what does this imply about how we build aligned AI.
So I've been making the argument that, I'm going to focus on 3 things here. And this is what I'm doing in my work, and I'm going to talk about these projects. We need to be thinking about building AI and AI systems and agents that are normatively competent, that can engage in appropriate normative reasoning, which, again, just mentioning that that is something that's structured. I'll say more about that. And we are going to need digital institutions that perform the function for AI of communicating what these institutions are requiring. So let's focus first on normative competence.
So I make the claim “Humans are normatively competent.” We have a general capacity to perceive and predict the state of a dynamic normative system. If I dropped any of you down into another human community, anywhere on the planet, any point in time you could figure it out. Even though you didn't come pre-packed with “Here's what the norms are”. You could figure it out from observations of behavior in the system. And so we need to be building AI that is normatively competent. The same thing with this general capacity. Not stuffing it with a list of norms and values. But rather with that generalizable capacity to go into a normative environment and figure out what's appropriate normative competence is not the same thing as learning, quote unquote, the norms, right?
Classification, and most of us know this because we know how hard this is, we do this in our alignment work with the current model, well, it's culturally specific. It's group dependent. It's ambiguous and incomplete. It's context specific. I was just rereading Nate Soares' paper from 2014 on the value learning problem, and there it is. It's right there. Yeah, culturally specific, context, specific context dynamic. So this is why, you know, you get situations like this. If you put into the model “it's bad to draw pictures just of white men”, you get things that come out that are not attuned, that I don't think any human would do because humans are normatively competent. Any graphic designer hired to draw pictures of images of 1943 German soldiers would either have a good basis or would ask, right, you know, are you a history professor? Are you Lin-Manuel Miranda, you know, what are you looking for?
So, classification is ambiguous and incomplete and context specific. It's cultural. And that's why we invented institutions. Right? It's not just the undisordered process of “I like this, I like that, I don't like this, I want this” right. It's institutionally history. Learning the norms would mean learning a rule for every context, every group and normative competence means learning the institutions and processes that generalize across contexts and maybe across groups. Okay, so generalizability. That's the key.
Okay? So here's some projects that we're working on in my lab to sort of try to implement like, what does this look like, if you start thinking about doing alignment differently. So we are working with multi-agent systems, with LLM-based agents, generative agents, and exploring the question of whether or not a newcomer agent who's joining that community, remember that was the picture, I said “Get in your head”, and if I drop you down anywhere you could join any established group. So we're looking at whether or not something we're calling a normative module which is attention, specific attention to the institutions, the punishment behavior of others, do they do better than a baseline agent at detecting which institutions are actually guiding the punishment behavior in the group.
We are also looking at whether or not a newcomer agent with this normative module does a better job of detecting a silly rule, which those of you who know any prior work of mine know that I've got a couple of papers with Dylan and others on silly rules, which are very hard. You can't go in and say “well, everybody would have this rule about sharing food”, because that supports efficient, effective, you know, productive behavior and stable groups would have such a rule. But you know the idea that you have a silly rule that has just arbitrary content, has no direct impact on welfare except, this is what we show in those earlier papers, does a lot to maintain the normative structure of those groups. So you might have silly rules that govern like a community of reasoners who want to exchange reasons which we've seen, some evidence we've found, it's not as obvious as it looks, but some evidence that exchange of reasons in multi-agent settings with debate improves the accuracy of results, answers. But there might be a rule like there is in a lot of the reasoning communities you belong to about how you behave in order to be admitted to that group and to exchange. So we're looking at that.
And then we're also looking at multi-agent reinforcement learning systems, looking at… So what the introduction of normative infrastructure, a classification institution, we call these altars, they're the authoritative source that the agents are looking at. And that's the only thing that makes them authoritative is that agents are all, it's common knowledge amongst them all that everybody else is going to look at “what does the group of elders say?”, “what does the Supreme Court say?” in deciding their own punishment behavior. So we're looking at whether or not introducing that kind of normative infrastructure again helps our agents integrate into. And I think we've got a claim here that actually, it's too hard a problem to solve if you don't have normative infrastructure, especially if you've got as we do, lots and lots of silly rules. Yes?
So in the in the LLM agents it would. I mean right now, we're just looking at something fairly simple, which is it's prompts. Let's say, you know, pay attention. So it's drawing specific attention to what are people punishing? See a lot of models of normative behavior, they focus on compliance. What should I do? Should I do A or B? And they're not paying attention to punishment. And actually, the theoretical work I've been doing for close to a couple of decades now says, actually, if you really want to understand a normative system you should be paying attention to and looking, “how do you predict and read punishment behavior?” Because that's what sustains, and so that's the… So it's if we're sort of thinking in sort of analog to cognitive processes, and this is something we'll also explore next year, I got a cognitive scientist joining the lab, prediction. We have cognitive processes that are addressed to what are people punishing around here, right? And what sources of like, what seems to be an authoritative source that allows me to reliably predict what people will punish and what I should punish, because I'll probably get punished for not punishing the things that they punish right? So that is part of what sustains that. So it's the idea that they're special. So it's an open question. Maybe they can figure it out. But the idea is that you can't, you're not just having, you know, extracted the norms from, say, the training data and those are applied. It's like, No, I'm going into a new… There could be a totally new rule here that's never been seen before, and would specialized processing, a specialized attention change that?
Okay, let's look at normative reasoning. So make sure at least lots of room for questions. So I'm defining here normative reasoning as reasoning to predict normative classification. And in case I didn't go through it before, by normative, all I mean is a labeling system, a classification system. This is an okay action. This is not an okay action. Okay, we have all kinds of elaborate structure for doing that. We've got legal reasoning. We've got moral reasoning. We've got all kinds of things. But the big big category is reasoning to predict normative classification.
And this is where I'm going to beat up on constitutional AI. It does not equate to abstract deduction from principles. Those may be part of, like, the inputs you would use to make these predictions, but it's not the same thing, right? It's something more than that. This is why, again, beating up on constitutional AI, it doesn't really make sense to talk about constitutions without the institutions that resolve ambiguity and dispute about what really open-ended things like equal process before the law, due process, right? What do those things mean, we have to write them in those very general terms, to have something that will survive so much change as a document. But then, what that means is, you need institutions with structured process. This is evidence you could admit. This is not evidence. This is somebody who can make an argument. This is not somebody who can make an argument. This is how we get to decide. This is how we resolve disagreement among 9 judges. This is how we get those judges onto that court. You can't have a constitution without that kind of structure.
So, key point here is that normative reasoning is fundamentally social and dialogic. It's the exchange of reasons. And now here's my other favorite book right now that I recommend everybody to read. This is Mercier and Sperber, cognitive scientists, The Enigma of Reason. Their main claim is, and this is, you got to really pay attention to this, reason evolved not to make us better at solving math problems. In fact, there's not much evidence that it does. This is what the psych evidence shows. Reason evolved to overcome problems of trust in coordination and communication that reduce the ability of early humans to properly benefit from their cognitive and social resources. The things that they could do with those groups.
Now they make a claim which there's whole body divides in the psych cognitive science community about whether there are lots of modules in the brain or just general processing ability in the brain. They come down on the module side, but that's not essential for this. But their argument is that reason is one of many cognitive modules that produce intuitive inferences, and the reason module produces intuitions about reasons. I take an action. And they're mostly focused on the idea “This is ex post the action”. I take an action. And I intuit a reason for taking that action and not just a reason, but a justification. Because what I'm anticipating is, I take that action like, I don't go collect berries today and I anticipate you're coming to me and say, wait a second. Where's the berries? Why didn't you go collect berries? I need to offer a justification for why I didn't go do my part of why I didn't follow the rule here, so their claim is that we intuitively produce those kinds of reasons to produce them to other people for exchange with others, and that what we develop is both the capacity to produce reasons and the capacity to critique reasons. And what the evidence does show is that kind of reasoning does improve the ability of groups to get to better decisions. And that there's actually even a reason why our intuitions, our initial intuitions about “why did I not go collect berries” are self-serving. And some people call the “confirmation bias”, they call it the “my side bias”. And they say there's a good evolutionary reason why we produce kind of crap reasons to explain ourselves right? Because most of the time we do not get challenged. We don't have to worry about it, and it's but it's then when we learn to critique those and exchange it. And we can also do that process internally.
So their claim reason evolved to generate group benefits of improved group knowledge. That's their perspective. Now, I also think this is, that's me in the square brackets, this is not just knowledge. This is the stability and adaptability of the group, because go back to those early versions of sort of like the, you know, the ancestral period for humans. It's the exchange of those reasons under the tree, around the fire, out on the hunt together in the group that is producing the collective determination of what's acceptable and not acceptable through the exchange of those reasons.
So normative reasoning, or you can think of it as discursive social debate is structured by community-specific meta-norms. Think about things like logic and meta-norms like, you know your answer needs to be logical, or your claim, your reason needs to be logical. We might have a meta-norm. So I taught law for decades, part of what I'm doing when I teach law students is I'm not just teaching them a bunch of law. They think they're coming to learn a bunch of law, mostly. What they're learning is, you've probably heard this, “Think like a lawyer. Argue like a lawyer.” And that's things like, well, you need to show consistency with prior judgments. And if you can't, you have to explain how you're going to justify that non-hypocrisy. We say, “Well, you can't make that argument, because last week you, when it was to your benefit, you made the opposite argument, and that's not acceptable.” You need to provide authoritative sources. So think about, again, think about your intellectual reasoning communities. Lots of these things… This is what I mean by meta-norms governing how reasoning works.
And just to be clear, even things like logic are subject to community specific norms. Even though we think of logic as like math, there's no other way you could reason right. But you can actually see evidence in other communities where logic is applied in different ways, there's different norms about what's an acceptable argument. And even the hard sciences use community-specific norms. Another book I recommend to you. If you haven't seen it, it's from 1962 is Kuhn's structure of scientific revolutions. Key claim he makes there is: scientific communities are governed by norms. This is a good problem. That's a bad problem. That's a good answer. That's a bad answer. You know, you need to stick within the existing paradigm until we just absolutely cannot explain key phenomena. Then we might move to a new paradigm. But those are normative. Those are meta-norms about how we reason.
So a general capacity to perceive and predict the state of a normative system requires predicting how its institutions will classify behaviors and predicting the outcome of group or institution specific normative reasoning process governed by meta-norms. So when I'm thinking about meta-norms. So here's what we're trying to do. We're trying to see “Well, let's think about the like, that DeepSeek style reinforcement learning which they did on a you know, here's verifiable questions for reward signal.” Can we develop a data set of normative reasoning, but the meta-norms of normative reasoning? And could we try training on that, for here's acceptable, here's better or worse. That's what we're going to look at. And you know, when we're really getting ambitious, we think well, maybe that even generates coherent meta-norm compliant reasoning.
Okay, last thing. And I do want to get to questions. So I'm going to go quickly through this idea of digital classifications. We've been talking about normative competence as requiring the prediction of how institutions will behave. But human classification institutions might not be legible, accessible, or adaptable to AI agents made to operate the speed agents, or you know the volume. Some of you, if you've seen me ever give a talk, you may have already seen this. This is the you know, the here's, you know, from “concrete problems in AI safety” from 2016, you know, just trying to catalog “What are the safety issues?” You can face the safety problems if you train a robot with a reward structure to carry boxes across the room, it gets really good, and then you deploy it, and unexpected things happen like vases appear in the path.
What will the robot do? The robot's going to knock over the vase because there was nothing about vases in his training data. But if you hire a human to do this, reward them on exactly the same basis, pay you a buck, a box for getting things across the room. They understand the contract, they come up with their path. Employer goes away, vase appears. What does the human do? The human walks around the vase? Why is that? Because the human engages in that simulation, that imagination and says, Oh, wait a second. If I go on my usual path, I'm going to knock over the vase and break it. What's going to happen then? Oh, well, there may be law out there that says my employer, can withhold the damages, or my employer can fire me, or my community may think I'm an idiot, and I get bad reputation. I don't get recommended for jobs right? I don't get invited to the party right? There's social consequences for that. And so effectively it fills in that conscious. Well, the reward structure is not really just the buck a box.
It also has this implied term, which is sourced by thinking through imaginatively normative competence about those institutions. So how can we build institutions that provide just in time legitimate classification input so that AI systems can go through that kind of process as well. And we've been looking at ways in which we can improve the interface with the analog to the thing we plug into the computer when it's got a different adapter right? If you want your robots to be able to plug into our institutional infrastructure. Which I think is the best way to think about how we at least start in a world with lots of AI systems and agents. Then we're going to have to build things like, and this is a paper that Dylan and I are just about to release, you know, data repositories doing classification, model repositories, real-time classification APIs sort of various kinds of institutions that we could think of as performing this function so that we can integrate those agents.
I'm going to skip over what we're doing other than say, ask me about citizen juries if you want to, alignment fine-tuning institutions. But the idea is that we're trying to integrate those robots into the system of recognizing what our shared classification is. I think that's it. Yes, okay, stop there. Yeah. Thanks, Gillian.
Okay, we have about 19 min for questions. Wouldn't a given individual like have a model of this like a shared institution norm structure in their head that, like could be extracted with RLHF. So yes, they certainly would have a model of it. But the point is that that institution is dynamic. So so it doesn't have static content. So they could certainly use, I mean, we certainly learn... And that's one of the things that we actually want to explore in terms of how you build up that model. Right? We build up that model by observing, not just by engaging in interactions where our behavior has been sanctioned, but through lots of observation, and through engagement with other institutions like, what is, what is it saying on social media, or on this podcast, I just listen to. What am I hearing there about what sounds like, okay, and not okay.
So the point is, it's dynamic. So you got to be able to predict its future behavior. And you can't just sort of scrape and say “Well, here's what it's done.” But absolutely you're going to be building a model of that to help you predict for the future. But you can't just store it and say, okay, now, I've been fine-tuned on that data. I've read all this stuff, I can go off, and I no longer need sensors for what's happening in the state of that institution and the state of the behavior, because it's not just. The institution itself is not what produces the equilibrium. It's the behavior of the other agents in the system. And so think about mask wearing during COVID. You know. Like, think about how dynamic that was just in terms of thinking about, oh, this can involve a lot of change and a lot of sources of change.
So humans can have choices within the normative systems. And some of those choices can be bad and some can be positive like, you can change this. The dynamic nature of it happens by humans, sometimes individuals pushing for a change. So should AI agents be allowed to change the norms. Well, so that's a different kind of normative question. It's a should question right? And I'm being a behavioral scientist here and saying, I want to predict the state of systems. So so that's a question for us to address as a group through our institutions about, okay. So if we start having AI agents running around, do we want to make sure they are just, I mean, what do we? Yeah. So do we want to make sure they are aligned with our institutions. Number one, I think that's absolutely number one. And number 2, do we want them to participate in the content of those institutions. Certainly, as we move to, you know, in democratic societies and pretty much in a lot of societies, even ones that don't look so democratic. There's ways in which you need to be consistent with what the population wants.
So I think that's a question. What are you thinking about when you ask that question like, are you saying? Well, I don't want them participating in my normative system because they might change the rules. Yeah, I don't know, I guess I'm just curious what people think about this question like, what do you do with this after you have it? Is it a detector to say, how, how aligned is this particular agent? Or is it something that can? Yeah, I don't know.
I'm also, remember, I'm also trying to change your understanding of what it would mean to have an aligned agent right? So an aligned agent does not mean aligned agent is aligned to a particular set of rules or norms aligned means aligned to this system, this community, these institutions. Right? So I'm thinking very much, we want to think about like you have, you know, new people who come to the country? Right? So. Certainly, as soon as you enter, think about the way we structure this with our institutions right? If you're in this country, the laws apply to you. And then we have structured ways in which, like your religious rules, will relate to those other rules. You know, your clothing rules might relate to the norms of the society, and so on. And that's just like ongoing questions about how do you integrate. How do you build society? We've moved to a world where? Yeah, we said, actually, those folks should also be able to vote. Some countries say, no, you can't, can't until you have citizenship. But we could be saying, as long as you're living here you should be able to vote. You have access to courts, even if you're not a citizen, that's another set of rules about how we set that up. So this is why I like this. I just was going back to this picture and saying, well, if we're serious about AI agents. And this is part of what the paper that Dylan and I are just about to release on, think about democracy and these digital institutions. If you're really talking about releasing AI agents to go, run your business, for you know. Take, you know Mustafa Suleyman's give the agent $100,000. Let it loose on a retail web platform, and it's you know we're going to say it's past the modern Turing test, if it can come back with a million bucks in a couple of couple of months right? Well, just think about all the ways in which that agent is interacting with the world. And therefore all the act, you know. So really talking about integrating AI agents that deeply into this, the day-to-day mundanity of modern societies. Okay, then we got to be thinking about, okay, well, you know, how do we know they're going to be following the rules? How should they be participating in the rules? Do we want them to be boycotting the company that has, you know, been engaged in racial discrimination in its hiring. Right, do we? Or do we want them behaving like humans to say, I'm not going near that company? I'm taking a job with that company. I think these are, these are the big questions.
I'm curious sort of what you think about things like legal personhood for AIs, where, like, it seems like in a lot of the sort of like current sort of like implicit normative structures. It's like the AI is just a chatbot from a company. The chatbot has some relationship to the user. If the users in government or government get mad at the company, the company just does something else. But if we're talking about agents that are, for instance, running a company, it seems like like a lot of the people who are like oh, like end users should be, or like deployers, should be like liable for what it does are saying like, oh, it's sort of just assumed under sort of the legal personhood of whatever corporation or person. But it seems like, if you want the AIs to be having their own sort of like relationship to some sort of institutional structure, like a lot of the sort of interface for punishment for humans are coming through some sort of yeah, liability or punishability in a way that seems sort of yeah to apply to. Yeah. So first part of the answer is, say, I think we should be focusing on what choice we want to make about this. So this is yeah. So I think we need to be thinking about this question and making this choice? Do we want? If we're going to have all those agents out there? Do we want them to be always, and, you know, traceable back to a human entity that is fully responsible for everything they do.
Right? So that's one path we can go and other work that I'm doing on the policy front, saying, well, we need registration and identification systems and traceability systems to do that. We don't have any of that infrastructure in place, I think right now, or do we want to go to the idea of legal personhood? Which is the idea that well, if I entered into…. So you know, Suleyman's agents are out there entering into supply contracts to build products to sell on a web platform. And let's suppose I'm in one of those contracts, and there's a breach of that contract. Do I need to go trace and find the entity that put that agent out there, or can I sue the agent itself? That's what legal personhood means. Legal personhood means. I just filed my lawsuit, and I say, you know, it's Gillian Hadfield versus the agent 4972B63 right, or, and agent 497262B right, can sue me because I haven't paid up on the contract. Right? That's what legal personhood means.
It's got nothing in that legal sense to do with sentience or moral status, or suffering, or patienthood, or anything like that. It's just a legal determination of who can file a lawsuit, and who can be the subject of an order in a lawsuit. I suspect we get there in this for the same reason we did with corporations. So corporations are artificial entities that have legal personhood. So instead of having to go back and figure out, Okay, who were the managers and the owners? You know, behind the corporation. I just sue the corporation, and I can get a court order against the corporation, and the assets are owned by the corporation, and I can go get a court order that says they didn't pay up on their contract, seize their assets to pay my contract, and so I suspect, if you really are going to integrate those AI agents, I suspect, for the same reasons, we found that was a much more efficient way to run this, yeah, I suspect you end up there. But that's a choice. You can make that choice.
Yeah, I think I'm going to like, try and phrase this question well, but I'm not sure how well it's going to come out. So I think I'm pretty worried about this proposal in general. Like, for sort of a selection of reasons. One is something like as we…. And I'm sort of interested in hearing your response. One is sort of as we're integrating, you know, these AI agents more and more into our institutions, and, like able to like, you know, be mediators between those institutions in the real world, like, you know, as currently sort of all institutions in some meaningful sense, have to be mediated by humans. But in this case we can remove that like these institutions may just get less and less tethered to us. And I'm pretty worried that like that can let the agency, you know, institutions can end up doing very bad things. And they can let agents get less tethered to us or the institutions? The institutions, the institutions, you know. To these, as there is a sort of rich sense of like humanness throughout, lets them be dynamic in ways that are but still like tethered to ours. If the institutions eventually like, move too far away from what we want. In many cases we go as far as doing uprisings and coups, but in other cases can do things that are far less.
And I'm worried that like, as we integrate AI systems more and more into this, this ability will go. In one sense, this just lets the people who have the control over more of the agents concentrate more power on themselves, but I'm worried in an extreme that you could get this entirely untethered from humans. And then to like link this into some like macro historic, you know, like big history work, you know, I think there's been. I don't know if you've read [unclear] architect paper. I don't think so. Looking over like basically institution structure throughout history under competition and seeming to find that, like, when institutions seem to cause a reduction in human welfare, this is often because of like technological change, allowing institutions to be less reliant on populations of humans. And I'm pretty worried that, like integrating AI into our institutions might push us like very much. But this is related to Lauren's question about, you know, do the do the AI agents? Do they get to participate in our institutions? Do they get to be on, they get to be on juries. Do they get to be judges right? Do they get to vote?
Which is a very different question and a different claim and proposal than I'm making. So I'm going to address your what happens with you know, to you know institutions. But I just want to be really clear. Alignment with institutions does not have to involve, there's nothing I'm actually saying, I'm really start, certainly starting off, no, we want to align with human institutions. So the part I flew by with citizen juries to say, actually, I think we need to figure out how to take the idea of the jury, 12 ordinary people drawn off the street to assess whether or not behavior is acceptable or not. I think we need to start integrating that into our alignment processes. So we were sort of thinking about, how might you do that at scale? So number one, alignment with institutions is not the same as participation in institutions right? Now, your claim about what has happened to institutions. Now, remember, we want a really broad definition, because institutions are not just the courts and legislatures. They also include all of our norms. So I just want to emphasize that, are we currently so… And I think most of the story of the evolution of institutions has been, well, it's gone a long width, like all that incredible growth and development we've had.
It's also the case that with the introduction of legal personhood for corporations which in the US in particular has been expanded to things like, and they have free speech rights. So we have campaign finance that allows them to buy politicians. And we're you could certainly say, Gosh, we have definitely called that one wrong. And we have also failed to develop the institutions that control those really powerful companies. So I'm totally with you on…. Certainly that artificial entities, like corporations, can have a big impact on the evolution of our institutions, and we may be watching ourselves head off of a bit of a cliff on that one. But again, it would be a question. I don't think it's implied by what I'm saying that we would talk about these agents being part of that. Now, having said that I do think we're on a path like, if you really are talking about releasing agents, they're going to go run the business. You were talking about those agents having rights. Property. They got contract right? There's just basic rights of contract property harm, you know, destruction of property, etc, intellectual property.
And yeah, then maybe you're going to want those agents to be able to make various kinds of claims and participate in how our laws evolve, how our courts behave. Maybe they are going to want representation. So I'm not saying we're not on that kind of path. But I think that's partly a judgment, not about what I'm proposing, which I think is the best shot we have of normative competence and integration into our systems, our dynamic systems. But I think that's a choice that's coming with like, well, wait a second. You're going to introduce…. It's no longer technology. It's another agent. And that brings with it all kinds of things, including stuff like, yeah. And it's going to change our institution. Yeah, maybe we're not in the majority.
Thank you so much for your time, Professor. My immediate concern with anthropomorphizing some of our or the way we think about institutions is that I think when I think of institutions, I also think of social institutions like marriage and religion, and things that don't necessarily map on very well to the way, perhaps, we look towards institutional thinking when we want to imbue AI systems with institutional values. So how do you think that those social institutions that are so idiosyncratic to human relationships are also ground truths that can be drawn that we can draw value from when we are trying to align our AI systems. Do you think there's value in drawing from social institutions as well? Okay, so how? Yeah? So, absolutely, the institutions I'm talking about include those because those are normative systems, right? And they have structure, and they even have processes. For you know, the Rabbi or the Imam or the Pope is like the final say on, you know how we interpret this piece of the you know this piece of our book, or you know whether or not this is acceptable behavior or not.
As well as lots of structure to those kinds of arguments. So those are definitely normative systems, and those are institutions that I would also fit into this framework. Now, but remember, the thing I'm arguing against is the idea that we extract those values and imbue them into our systems. It's rather, okay, I want to build systems that can go into a community and to be normatively competent is to be able to read what is acceptable and what do I need to do? And relatedly, what do I need to punish in this community? That community could be a religious group, could be a municipal group, could be a state city, right? A workplace. But we have, we actually have lots and lots of overlapping and different groups. And so I don't think it's… And this is, I think this is why I want to focus on the general capacity to do this.
This is not me saying, here are the values we should choose. I think that's fundamentally undemocratic. And I think it's fundamentally unstable. Right? I do not think that we, as researchers can say, here's the good values we're going to put these ones into our system. Are there a lot of really crap value systems out there? Yes. That's the soup we're in right. But that's the human problem. That's the human problem. So like my, like I said, my starting point is, okay. So if you are going to release all these agents. I'm not sure this is a great idea. But here's the path we're on. And here's what…. If you really want, do not think oh, well, we'll just put the good values into the systems. You can't do it, it's not robust, and it's fundamentally undemocratic, which really bothers me, that, you know... The other point I make about constitutions is, can't have secret constitutions right? Like, if there's any constitution behind Claude, it's a secret constitution, you know. Whatever is happening in the RLHF at GDM or OpenAI.
We don't know, because, by the way, of a thing that we got from corporate legal personhood, which is, oh, they're entitled, and law will help them protect that they keep secret whatever they've produced internally in the company. I think we also need to change that. But I think we have to think about this as, you can't think of this as an engineering problem of well, we'll just figure out what the good values are. We'll stick those in the models and in the agents, and we'll be good. If you're going to release agents, not even clear you need agents, because just got systems that yeah, like, you're yeah, we need something. But that's our human problem. Are there terrible values in the world? Absolutely. Terrible communities? Absolutely. We need to figure out how we… But that's the path that humans are on is, how do we get… I also agree, though, I'm saying, for a very long period of time, those are human institutions. That's the control.
Okay, that brings us to time. Everyone join me in thanking Gillian Hadfield.