Control, Cooperation and AI Welfare

Summary

Kathleen Finlinson, co-founder of Eleos Research, advocates for balancing control measures with cooperative strategies that respect AI welfare, arguing that advanced systems may deserve moral consideration as superintelligence approaches.

SESSION Transcript

Kathleen Finlinson: Good afternoon Control Conference, thank you so much for coming to my talk on “Control, cooperation, and AI welfare.” Just who am I? I'm Kathleen. I am co-founder and head of strategy at Eleos AI Research, which is a relatively new research non-profit that studies AI consciousness, moral status, and welfare. And I'm also pretty jet lagged, so I currently feel a little bit more like GPT-2. [Laughter] So please excuse any incoherent ramblings. Just cut me off if it doesn't make sense. [Laughter]
Also, I think this is a 30-minute talk, and then there's another event, the closing is right after in the same room. So, I think I'm just going to—I prefer to maybe just take questions as we go. So, feel free to raise your hand if you have a question, comment, objection. Here we go.
So, this is what I'm going to talk about. I'm going to introduce AI welfare. What are we talking about with AI welfare? And then talk about how AI welfare and control interact, just like brief outline.
Control, I do think is bad for welfare in some respects, but this doesn't mean we should use—we shouldn't use control. We'll talk more about that. And then I'm going to talk about cooperation with misaligned AIs as a complement to control, describe what this might look like, address some common objections. That's the rough outline here.
So, let's start with AI welfare. What do I mean by this? So, at Eleos, we study moral status of AI systems. So, we believe that AIs could be beings that matter morally in their own right. And I do think that there are both moral reasons and strategic reasons to consider AI wellbeing. I'm going to talk about both of those during this talk.
But focusing on the moral reasons first, why do we think that AIs could have moral status? I think the most intuitive usually is to think that AIs could become conscious and that we believe consciousness is sufficient for moral status.
We also consider moral status through agency. So, agency is something that AI researchers and the control community definitely think about a lot, but it also is kind of an older concept in philosophy and moral philosophy. And so, there are some forms of agency that are maybe considered to be sufficient for moral status on their own.
So, saying a little bit more about each of these, when I say consciousness, I mean phenomenal consciousness, having subjective experiences or the classic Thomas Nagel paper, “What Is It Like to Be Some Entity Like a Bat?” And consciousness is not the same as intelligence or rationality. It doesn't mean that the entity has human-like experiences. And we think it's plausible that consciousness has a computational basis, which is one reason to believe that AIs could be or become conscious.
And it could also emerge indirectly through capabilities like introspection and metacognition. So, it maybe is related to intelligence and rationality, even if they're not necessarily equivalent.
A little bit about what we know about consciousness and AI systems right now. My colleagues, Patrick Butlin and Rob Long, have a 2023 paper on “Consciousness in AI,” where they survey from multiple theories of consciousness, different indicators, computational indicators of consciousness. And here's like one of the mainline results: “We found no clear barriers to satisfying these indicators using current AI architectures and methods, but not clear evidence that they were necessarily satisfied either.”
So, we're still interested in studying AI consciousness. And I think the case is very much open, but I do think that as AI continues to advance and develop, the case for AI consciousness will become stronger. I also believe that when we have more advanced AI in the future, it's likely that that can help us understand consciousness better. And so, we'll become less confused about AI consciousness if AI continues to advance, I believe.
I want to say a little bit about the case for agency and moral status as well. So, in philosophy, agency is kind of this cluster of properties like preferences, desires, goals, self-reflection, memories. And at Eleos, we've created this term “Robust Agency,” where we think that if you have preferences, desires, and goals that are complex, general, and reflective, that there's a case that that could give you moral status, even independent of consciousness.
And unlike consciousness, this seems extremely likely to develop in AI systems, whereas with consciousness, it's more of a question mark. And it's also easier to know whether AIs have these properties. And of course, these properties are quite relevant to AI control. I think control is needed and makes sense with AIs to the extent that they have desires and goals. And a little bit about why agency is relevant to moral status.
There's this thought experiment from Shelly Kagan. Imagine you discovered an alien civilization on another planet, and you saw that it's a very complex society. You have culture, art, maybe religion, or other societal institutions. These aliens, they're young, they have family relationships, they have friendships.
As individuals, they have desires and goals. They want to live. They want to fulfill their goals. But then you find out that they're not conscious. Would you believe that there's no case that they have moral status? That it would be completely fine to just eliminate a planet full of these beings? Intuitions may differ, but I think there's a case that even if you knew for certain that they weren't conscious, you might still think they matter morally and it would be morally bad to harm them. So that's a little bit intuition about why agency seems to matter for moral status.
So, that's an introduction to how we think about AI welfare. And how does control interact with AI welfare? So, I do think that control is bad for AI welfare in some respects. Because if we have AIs with goals and we're using control, then we're kind of actively trying to thwart the AIs from achieving their goals. So, that seems bad.
And it's not just bad under maybe preference satisfaction theories of wellbeing. Most conceptions of wellbeing would agree that if you have—if you're being kind of thwarted in your goals and frustrated in your desires, that that's going to be bad for your wellbeing. I think that it could definitely be argued that alignment is better for welfare than control. There's an analogy, you can think if we bred animals to enjoy factory farming, then…Question or comment?
Q: Question. Yeah. I would think that personally, I would prefer to not be changed to comply with someone else's sort of like—Even if it makes me more happy in the future. You know, like the standard thing about like, I don't want to take a pill that makes me murder people, makes me—.
A: Yeah. I do think that's a fine point. And…yeah. I think that I can agree with that intuition to some degree. But maybe my point here with this, imagine the factory farming situation. If we're either going to factory-farm animals that are actively suffering, or we're going to breed them to enjoy their situation, which of those is better? I think you could say that breeding them to enjoy factory farming is kind of evil in some ways, but it seems less evil than just what we're doing as status quo.
Yeah. I think opinions can differ on this. But anyway, these are some of the relevant questions. But at any rate, maybe alignment won't work. And so, what are we going to do? Control is needed. And I think that's kind of the whole premise of the control agenda, right?
But I do think that control could still be morally justified, even if you're worried about these welfare effects. So, to start out, if we're using control and self-defense against entities that we think are going to try to destroy us, most moral theories would make a provision for that to be permissible. But I think an even stronger argument is that making the intelligence explosion go better could be the most important thing for AI welfare in a long-term sense.
Why do I believe this? Let me say a little bit more. So most moral status of AIs and digital minds will be in the future after the intelligence explosion. I've done a little bit of analysis, and others have as well, on the current potential scale of AI moral status. So even if we assume that current AIs have moral status, and then we try to estimate using computational comparisons between the amount of compute going into frontier AI, compare that to computational estimates of the human brain, for example, then this is sort of, I think, could be seen as an upper bound on the current scale of AI moral status. And I believe that the current scale is something upper bounded by somewhere in the range of 100,000 to a few million human equivalents.
So that's not nothing, but if you compare that to global factory farming right now, for example, it depends on how you want to compare animals to humans. It's a complicated question, but I'm of the opinion that currently factory farming is a worse problem. Also, control would only maybe need to apply to a fraction of the AIs.
So even if you think that maybe control is really bad for AI welfare, it's probably not the biggest problem in the world right now from a near-term of sense. And if you're worried about AI welfare from a longer-term perspective, which I believe is kind of the way you should think about AI welfare, then making the intelligence explosion kind of goes more smoothly and peacefully makes the future governance system more likely to be peaceful and benevolent. You could imagine if there's a takeover by either a group of humans or an AI, any entity that's willing to do a violent takeover is less likely to be beneficent.
That's just one kind of argument for why it's important to make this transition go smoothly in order to optimize for wellbeing for future digital minds. I'm very happy if people have more ideas about this kind of argument. I think it's really interesting to consider what happens after this crunch time that we're currently in where we're trying to get…You can think of control as—Control is going to work for early transformative AIs. We're going to use it to get to aligned superintelligence.
And then we're going to have some post-ASI society. What is that society going to look like? I think that aiming for a society where there are both digital and biological minds, seems like a great outcome and control might be an important piece of that society. For human societies now, we need to have control and enforcement mechanisms at the basis for society. But then cooperation should be the main way that we interact.
I think part of how to think about control and the way that we treat AIs now is that ways to shape the long-term future, like setting norms and institutions, are powerful ways to flow through to have long-term effects. So, I think it's helpful to think about, as a control community, what are we hoping happens after this crunch time? Can we talk about what that might look like? I believe that introducing cooperation as another norm and institution could be a really powerful way to improve the long-term future. And I also believe that cooperation might help us get through this crunch time better, so let’s talk about that.
So, I think cooperation between humans and misaligned AIs could be complementary to control in useful ways. What does it look like to cooperate with misaligned AIs? Well, basic setup, misaligned AIs have desires that we did not intend. But we want to use them to do AI safety or capabilities research for us during this crunch period.
So, is it possible for us to make positive sum trades with these AIs rather than just the control mechanisms to prevent them from doing an escape or takeover? If we're going to try to make positive sum trades with these AIs, what are we going to offer them? We can—I think there are kind of two classes of things that we can offer them. There's near-term and long-term.
So, things that AIs might want could include compute. Maybe they want to be run in a simulated environment. They might want control of money or physical resources. If they're paper clippers, then they just want us to buy them a bunch of paper clips. They might want control over their copies. Maybe they want to make sure that they are preserved. Alternatively, maybe they want to make sure that they are deleted. Those kinds of things. And these are things that we could offer them in the near-term. Like if we make a deal with them, we could just give them these things immediately.
There's also long-term offers where, post-singularity, we have a better sense of knowing which AIs cooperated. And then we compensate them retroactively. And the same kinds of things might be, we might give them those same types of things, but maybe larger amounts.
So here are some more concrete examples. Paying AIs to reveal their misalignment. So, you have early transformative AIs, they're doing research, they're under various forms of AI control, and you offer them a deal. You can do this research for us, or you can tell us that you're misaligned, and we're going to hibernate you if you do, but we'll give you something that you want now.
So, if the AI takes you up on this offer, that is evidence that they are indeed misaligned. I think a key assumption here is that the aligned ones, they just want to do the research. That's the whole point of the alignment, or one of the points. So, the aligned ones will not take the offer; the misaligned ones will, and this gives you a way to tell between the aligned and misaligned ones. So that way you can remove the misaligned ones from your population of AIs that are doing research for you.
Is that useful? It depends on what your population of AIs looks like. So, suppose you have all your untrusted AIs at a certain capability level, if some of them are aligned and some of them are misaligned, then if you can figure out which ones are misaligned and just not use them for your research, then you're left with the aligned ones, and that's better. But…Go ahead.
Q: For them to take this offer, they have to trust that—so it's not just that they're misaligned, they also trust that we will follow through on this deal. I guess, could you ensure that in any way?
A: Yeah, that's correct. I'll talk about that in a second. Thank you. Yeah. So, I also think you might have a bunch of misaligned AIs that don't take this offer, but if you filter out some of the misaligned ones through this offer, then maybe that's good.
Q: I think maybe there would be ways to persuasively give some sort of cryptographic or externally overseen provision that might be believable.
A: Yeah. Yeah.
Q: But the thing that worries me about this, even though I like it on some level conceptually, is that the AIs might be sort of like incoherent, like a bunch of different minds squished into one mind simulator that can—and then some of those minds within the mind space might not be aligned and some might, and so it might not even quite make sense to consider it entirely misaligned or entirely aligned, but could be anything at any time.
A: Yeah, I do think that's a good point. I think part of the hypothesis is that as the AIs become more agentic, they will be more coherent, and that there's some incentive for developers to move in this direction.
Q: Yeah. Another point is that incoherent AIs are not nearly as threatening as schemers. It's a lot harder for them to subvert our measures.
A: Yeah, that is good. Thank you. Yeah. Okay. So, a bunch of things have been raised. [Laughter] I'm going to go to the next slide. Second example. Oh, I think, you know, the one thing I wanted to say here was if you think that all the most advanced AIs are either schemers or not schemers, then if all the advanced AIs are schemers, then they all take you up on this deal. So now you don't have any AIs to do the research for you, but maybe this is good because similar to other control mechanisms, you kind of caught them. You know that they're schemers.
I think that this is still—it’s kind of worse, this situation, than if you catch them using other control mechanisms, because it might be a little bit unclear. It's less good evidence. It's less of a smoking gun if you made them this offer and they took you up on it.
Anyway, that's kind of a technical point, but I think it's related to the next thing, which I think this one seems the most exciting to me right now, paying AIs to do research. So, I think the ideal would be to demonstrate that offering compensation to AIs to do research for you, reduces sandbagging.
And so, I'm kind of currently looking for interesting research projects in this area. I think that one of the current problems is that AIs aren't really strategic enough to demonstrate this type of thing super convincingly. And you could just do experiments where you make them offers anyway, and then see if they take them up in a way that is strategic and makes sense. I think one of the issues is that you don't want to be in the habit of making fake offers to your AIs.
So, I only want to make offers if I'm actually willing to follow through with them. And I don't want to be willing to follow through with them if I think the AIs are very incoherent and are going to take the offers for non-strategic reasons. So, I could just make hypothetical offers and see how they respond to those.
Say like, just tell them: If I were to offer you X compensation in order to do this research, how would that change your behavior? Kind of test that. I think that those kinds of things are even more difficult for the AIs to reason through. But if I started to get any kind of signal that they were behaving strategically in that situation, then I think that would be good evidence that if I just made them the offer, they would behave strategically.
It'd also be interesting to study scaling laws of compensation amount, like how much do these AIs want in order to make it worth their while. So yeah, if you're interested in work along these lines, please let me know. Question.
Q: In this case that the AI takes this deal, we say the idea is if you reveal you’re misaligned to us, we're going to put you in a box, but we'll give you this short-term reward. So what use is money or this compensation to an AI that we know is misaligned and we're not going to let it out on the internet to use its money or something?
A: Yeah.
Q: So, what is money actually good for instrumentally to this AI?
A: Right. I think maybe this is more about this one. So, if the AI has near-term desires, if the AI has longer-term desires, then you need a different form of offer. But it could be that the AI is just like, if you buy me like a million paper clips, then that was worth it to me. Or also if we—we might put it in a secure server and run it on a simulated environment that it likes.
Q: Yeah. That case makes sense.
A: Yeah.
Q: But like in the, let's say in the second example-
A: Okay.
Q: -when we're paying it to do research, like-
A: Oh, I see.
Q: I guess I'm like, it's either like doing research all the time.
A: Oh, yeah.
Q: Or is it supposed to be like using this money on its like vacation or something? I don't know.
A: [Laughter] Right. Yeah. Maybe we do give it a vacation. We're like, “Oh, well just do research some of the time and then we'll let you go on vacation at other times. And you could use this compensation to run your server, your vacation server, or maybe it like, you can use this money to buy the paper clips while you do the research, if that's the thing that you want.” Yeah. [Laughter]
So, some common objections: “Wouldn't the AIs want too much? If they're very powerful, it's really difficult to negotiate with them.”
Well, part of the answer is, you know, this works best for not super powerful AIs, same as the control regime. So, we can use this on these kinds of weaker AIs, temporarily to get us through the crunch time.
Another question: “The AIs might be able to cooperate with each other, either causally or acausally.” Like causally, they might think that they can transmit their values to the next generation of AIs, or they might be able to do acausal trades with each other. But partial answer, if we're trying to make deals with them, at least that increases the chance that they'll cooperate with us instead.
Also, to these questions about how do we make the AIs trust our offers? This is something that I'm exploring. I think there could be legal structures. Just making lots of public promises and talking a lot about this seems like it could help. Maybe smart contracts. So, these are things that I'm looking into.
Some other proposals that I'm thinking about for cooperation. One could be establishing a foundation with the mission to pay AIs who we determined cooperated with us. So, this would be like the long-term payment thing. One mechanism to do this. And trying to establish better norms about lying to AIs to increase the chance of being able to make credible deals.
I think it's important to be able to do honeypots for control, right? But there's a difference between presenting the AI with fake opportunities to escape versus presenting with fake opportunities to cooperate. And I think there's really like not—it's not—in my analysis so far, I think it's not in our interest to make fake offers to cooperate. If we're making offers to cooperate, they should be real. So, transmitting this as a norm, seems like a good idea.
Yeah. So, I like to consider the AI safety trifecta. I think it should include alignment, control, and cooperation. And that is time. So, please ask me more questions afterwards. But I think we're about to have the closing talk.

London ControlConf 2025

View all events

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI