Making Sense of the AI Auditing Ecosystem
Summary
Addressing how imprecise language often renders policy ineffective, Miranda Bogen's team designed a two-axis framework that plots audit efforts along dimensions of scope and independence, illuminating the distinct objectives and trade-offs of different audit approaches.
SESSION Transcript
Josh New was reminding me that about 10 years ago there were about eight people who would have joined a room like this. And it's good to see that the field has grown so far. Thank you so much. And I think a lesson that I've learned over the course of that decade is the importance of translation between technical work and policy work and the danger of vagueness and slippery terms and the extent to which misunderstandings or disagreements over key terms can derail a whole process.
As the previous speaker said, not having a definition kind of set up from the get-go means you get stalled for weeks, months, years, just arguing over what you're even talking about. And so I'm going to talk today—I think we've heard already today, maybe at least two, if not three definitions of what audits could be, whether on-chip verification of some characteristics of models, running verification of corporate practices, and so that confusion.
You know, when I first started the Governance Lab about two years ago, one of the first questions that I got asked was the motivation for starting such an organization at a civil society organization was for us to be able to answer questions like, what's an audit? We all know we want one, but like, what is it actually? And so that motivated my team to kind of think through this disconnect, because when we lack clarity about what we're talking about, we risk passing laws or passing standards or aligning our own practices that don't actually accomplish our goals, but make it look like we are.
And it's something that people can point to and say, oh, but we're red teaming, we're assessing risk, we're auditing this, and something is missing there. And I think none of us want to be in a world where we think we've passed something meaningful and we're not actually getting to a safer place. So you risk getting to checkbox exercises that provide false confidence rather than rigorous investigation that surfaces real risks and motivates change in behavior.
Whether you call it red teaming, auditing, evaluation, impact assessment, all of those words are used, sometimes synonymously, or sometimes each of those words is used to mean a dramatically different set of activities. This current landscape can feel incredibly confusing, especially for policymakers who are not thinking about this on a day-to-day basis. And while I wish I could say that auditing means independent verification of a specific fact pattern, it doesn't always mean that. It means many, many things.
And so take red teaming. Some people use that term to mean activities that are intended to detect unforeseen risks through a broad exploration of sort of poking and prodding of different systems and trying to raise awareness of those to motivate interventions and safeguards.
But others mean to use it as specific investigations of attack vectors of a specific risk that they're trying to identify. Sometimes there's internal red teaming, sometimes there's external red teaming. When we're crafting requirements around whether something ought to undergo red teaming, if we don't parse out what it is we're trying to accomplish, we're not going to get there.
Maybe some set of these different activities accomplishes a set of goals we might have, but we also don't necessarily know. My team was concerned that this terminological slipperiness was going to lead to circumstances where some assessments were presumed to be superior because they were technical or because they were independent, or because they were necessarily broad or because they were purposely specific.
Which would mean that some activities in this space of accountability would be prioritized, while others would kind of be set aside or left to be voluntary. And that seemed to be a prospect that we didn't like because I don't think any of us quite know what we need to do to move toward a safer circumstance.
There's a lot we don't know yet. There are things that we would like to see, but we don't know exactly what we're asking for. And so we wanted to look at this ecosystem. And so we analyzed probably like hundreds of assessment efforts and tried to, across industry and academia and government and tried to understand the characteristics and expectations people had of some of these different approaches and just try to wrap our heads around it so we could help explain it to folks.
And so I would take this talk as sort of a step back, and it's not necessarily on only technical policy interventions, but hopefully it will be helpful in giving you vocabulary to talk about what types of activities it is that you're advocating for in different contexts and to be able to explain why what someone else has proposed is maybe not necessarily what you are talking about. So what we found was most assessment efforts fall in a spectrum of specificity. And this can be assessments of models or of systems or of organizations, but all of them sort of exist on this spectrum of very exploratory assessment.
So a broad investigation of potential harms or business practices without necessarily predefined boundaries of what you're looking for. So like early-stage red teaming, where diverse participants are given free rein to kind of poke and prod, discover what issues emerge, surface harms, and the theory of change of that type of investigation is simple: it's casting a wide net to catch unforeseen problems because you can't check if something's happening if it hasn't happened in the first place.
Structured assessment is one that type of activity, but working within a defined framework. So think taxonomy of harms. Some of those are outlined in different laws. Like the EU AI Act has some specific harms that you need to look at, or it could be topics that a company has defined are the priorities of theirs, or it could be human rights.
But basically taking a slightly narrow set of things, having a general sense of how you're going to look at each one and using the findings to maybe prioritize what you're going to go deeper into or to assess whether an organization has sufficiently investigated a set of different risks. But that was sort of the structured part of the spectrum. Focused assessment was taking a specific harm or a specific risk and digging into it. And from our perspective, ideally through mixed-method approaches, so maybe technical investigation, but also understanding how a system might interact with the people using it, the institutions it's situated within, to understand how that risk might manifest, to understand if it's exceeding a specific expectation and to really fully unpack and understand the nature of that risk and inform what would be necessary to mitigate it.
Again, to hopefully ultimately ensure that that is happening. Finally, specific assessment, measuring the particular characteristics of a very precise fact pattern of a system. So let's say a benchmark, a specific measurement, a specific metric of a system and verifying that a model or an organization has, you know, either exceeded or not exceeded a particular numerical thing or a particular concrete rule or commitment they've made—basically a true-false type of statement, for example.
So you can—I kind of think of this as like a sideways pyramid. And so from again, very broad, you might have a general sense of some concrete things you're looking at, but it's not bounded by that within structured, it's kind of a couple of different things with some, you know, flexibility at the edges, focused being kind of one area and specific being like a very precise thing you're investigating.
I won't linger on this slide too long, but all of this is available in a report that you can all grab after if you're interested. But these are obviously dramatically different activities, but by teasing them apart, our hypothesis is that it's easier to see which ones can be complemented by different types of activities, who ought to be doing what.
And it really makes clear that doing any one of these alone is probably not going to get us to the state that we want to be—it might get us part of the way there, but we can more easily see what's missing. So the second dimension was the degree of independence of these efforts, from low independence, internal teams, the folks developing the models or systems themselves, to medium independence, moderate independence.
So maybe assessment from a reasonably disinterested party, but that could still be an internal sort of risk team. It could be an internal audit team, it could be an external team that's contracted to do work, but maybe there are some constraints over what they can share and on what timeline to truly independent third-party efforts, where maybe there's actually not even any contractual relationship between the parties, and it's somewhat of a more adversarial posture.
All of these have pros and cons as well. There's a lot of attention on third-party auditing, which is absolutely understandable, something we definitely need. But at the same time, that most independent type of effort, people who really have absolutely no relationship with an entity also might not have access to the technical details they might need to verify characteristics of a model or even the sort of infrastructure to detect that the data that might be provided in an audit is accurate in the first place.
So, like logging data to be sure that the data you think you know about a system is actually what's happening in the real world, or a model maybe in a vacuum you can test, but until you see it interact with that other components of a consumer product or the real world doesn't actually tell you the full picture. So again, pros and cons of each of these different approaches, we shouldn't discount any of them as part of this ecosystem, but really thinking about which of them is going to accomplish the goal that any given proposal has is really, really important.
So why does this matter? We found that kind of by combining these two axes, it helped make a mental model of what all sorts of different things were that seemed to sort of vaguely relate to one another, but that people were describing in dramatically different ways.
So what are the gaps in current approaches? Where are they situated? And so my example that I want to go into is actually a New York City example, so we'll appreciate that. So New York City Local Law 144 requires independent audits of automated hiring tools, measuring a specific bias metric of those tools.
And that sits at the intersection of a specific medium to high independence effort with high specificity. It's excellent for generating comparative results across organizations and potentially informing remediation efforts. But it's poorly suited for identifying novel forms of discrimination and there's been a lot of debate over whether the metrics that were selected are actually telling us much at all, and whether they're actually motivating change is another thing.
But it's one of the first laws in the space that required something like this. And so it's been really great to kind of poke and prod and see how effective it's been. Meanwhile, while companies conduct internal exploratory red teaming before model release, they're operating with low independence but very broad scope.
And that can be important to having time to identify a set of risks and actually surfacing them early enough to address them. And an external third-party audit after deployment is not going to have that opportunity, but you still might want that to come in at a different point.
So neither of these approaches is inherently better because that requires answering better at what? They each serve a particular purpose in the ecosystem of governance and accountability. And being able to parse that out, I think is important. So this framework, I think offers a couple of key insights for me.
It just helps me understand what I'm even thinking about and answer questions when people are like, what should the red teaming law say? It's like, well, what are you trying to accomplish? But hopefully it's helpful for you as well. And so a couple of things that it raises for me are no single assessment approach can accomplish all goals.
Exploratory approaches might be needed to find novel approaches, novel risks. Specific measurements could trigger regulatory action or particular activities that are necessary in order to go to market, export controls, whatever it is, you might need a specific metric against which you can validate in order to trigger those things.
And there will be arguments over A, what those metrics are and B, whether someone has met or exceeded them. And we should be prepared for that. And I think that's where technical research can come into play quite a bit. What are the measurement methodologies?
How confident are we that those are actually measuring the things we want them to measure? How confident are we that a different approach is not also measuring that thing? And what if they disagree? There's a lot that's going to happen at that point of debate.
The second is that independence and scope trade off with each other in important ways. So external researchers and auditors have the credibility and freedom to raise concerns without fear of corporate retaliation, but they often lack the system access they need to do kind of a deeper dive.
And I think there are different policy levers you can do to kind of change that balance of power. So Whistleblower Protection Act, for example, can help change that, but thinking about what are the constraints that any given party at any given scope of inquiry might be facing when we're thinking about the goals of timing of any of this matters enormously.
So, you know, pre-deployment assessment and audits offer opportunity to inform system design, but they might miss out on harms that happen only once things hit the real world. How is that going to interplay? And how does, how do we create, how do we make sure there are feedback loops between any of these inquiries into the actual design process? So we've seen, I think, a number of examples where there's, whether they're audits or research or evals that identify an issue and it's sort of brushed aside or it's like, oh, that's interesting, but we're going to move forward in any case.
And so really articulating beforehand what are the consequences of conclusions of any sort of assessment effort is really important if we wanted to accomplish a particular goal. So leading me to my kind of recommendations is first we should start with a clear goal when we're trying to define what an audit or any such related activity is, because that will help determine where on that, you know, graph we should land.
The next is to match the right method to the goals and figure out are we looking for an exploratory approach, Are we looking for a confirmation of a specific set of things? If it's the latter, we need very clear definitions of the set of things we are confirming and definitions of what would be sort of a checkbox, insufficient demonstration of that thing versus a real rigorous demonstration of that type of activity.
We do need to invest more in genuine independence, independent audits where it matters most. So not every assessment needs that explicit, completely disinterested third-party approach. But some things might. We also need to better define what would a moderate independence effort look like that we would still feel confident was sufficiently independent to not have been captured by the interests of the auditee.
So is that prohibiting constraints around what they can share and when and with whom, is that some kind of different financing structure, et cetera. And finally, we need resource, we need to resource any sort of assessments appropriately with both funding and time. Even technical assessments can take a while developing eval data sets, deciding what you're testing in the first place. And that's just again that most specific part of the picture.
And there's so many other things that take a lot more time and stakeholders to really feel confident that you're identifying the practices that might be most concerning. So the stakes of getting this right, I think are really important. I think that I wouldn't identify myself as traditionally from the AI safety space, but I think that even across different communities that care about AI policy, there's so much shared interest in developing an ecosystem of real, meaningful efforts to spot risks and make sure those risks are mitigated, that I think this is an area where there could be a lot of overlap and alignment.
And making sure we get to the finish line and actually get something across into legislation or into corporate practice does require us to get a little bit more specific. Otherwise we'll all say, yes, we agree that auditing is important, but then get to the end and be like, oh, wait, but you only meant frontier models, I actually meant classifiers, and I wanted them to measure sociotechnical impact and you wanted them to measure chip capacity.
If we can get on the same page ahead of time, I think we'll all have more likelihood of success in achieving all of our goals. I don't know where I am on time, but I will stop there.
As the previous speaker said, not having a definition kind of set up from the get-go means you get stalled for weeks, months, years, just arguing over what you're even talking about. And so I'm going to talk today—I think we've heard already today, maybe at least two, if not three definitions of what audits could be, whether on-chip verification of some characteristics of models, running verification of corporate practices, and so that confusion.
You know, when I first started the Governance Lab about two years ago, one of the first questions that I got asked was the motivation for starting such an organization at a civil society organization was for us to be able to answer questions like, what's an audit? We all know we want one, but like, what is it actually? And so that motivated my team to kind of think through this disconnect, because when we lack clarity about what we're talking about, we risk passing laws or passing standards or aligning our own practices that don't actually accomplish our goals, but make it look like we are.
And it's something that people can point to and say, oh, but we're red teaming, we're assessing risk, we're auditing this, and something is missing there. And I think none of us want to be in a world where we think we've passed something meaningful and we're not actually getting to a safer place. So you risk getting to checkbox exercises that provide false confidence rather than rigorous investigation that surfaces real risks and motivates change in behavior.
Whether you call it red teaming, auditing, evaluation, impact assessment, all of those words are used, sometimes synonymously, or sometimes each of those words is used to mean a dramatically different set of activities. This current landscape can feel incredibly confusing, especially for policymakers who are not thinking about this on a day-to-day basis. And while I wish I could say that auditing means independent verification of a specific fact pattern, it doesn't always mean that. It means many, many things.
And so take red teaming. Some people use that term to mean activities that are intended to detect unforeseen risks through a broad exploration of sort of poking and prodding of different systems and trying to raise awareness of those to motivate interventions and safeguards.
But others mean to use it as specific investigations of attack vectors of a specific risk that they're trying to identify. Sometimes there's internal red teaming, sometimes there's external red teaming. When we're crafting requirements around whether something ought to undergo red teaming, if we don't parse out what it is we're trying to accomplish, we're not going to get there.
Maybe some set of these different activities accomplishes a set of goals we might have, but we also don't necessarily know. My team was concerned that this terminological slipperiness was going to lead to circumstances where some assessments were presumed to be superior because they were technical or because they were independent, or because they were necessarily broad or because they were purposely specific.
Which would mean that some activities in this space of accountability would be prioritized, while others would kind of be set aside or left to be voluntary. And that seemed to be a prospect that we didn't like because I don't think any of us quite know what we need to do to move toward a safer circumstance.
There's a lot we don't know yet. There are things that we would like to see, but we don't know exactly what we're asking for. And so we wanted to look at this ecosystem. And so we analyzed probably like hundreds of assessment efforts and tried to, across industry and academia and government and tried to understand the characteristics and expectations people had of some of these different approaches and just try to wrap our heads around it so we could help explain it to folks.
And so I would take this talk as sort of a step back, and it's not necessarily on only technical policy interventions, but hopefully it will be helpful in giving you vocabulary to talk about what types of activities it is that you're advocating for in different contexts and to be able to explain why what someone else has proposed is maybe not necessarily what you are talking about. So what we found was most assessment efforts fall in a spectrum of specificity. And this can be assessments of models or of systems or of organizations, but all of them sort of exist on this spectrum of very exploratory assessment.
So a broad investigation of potential harms or business practices without necessarily predefined boundaries of what you're looking for. So like early-stage red teaming, where diverse participants are given free rein to kind of poke and prod, discover what issues emerge, surface harms, and the theory of change of that type of investigation is simple: it's casting a wide net to catch unforeseen problems because you can't check if something's happening if it hasn't happened in the first place.
Structured assessment is one that type of activity, but working within a defined framework. So think taxonomy of harms. Some of those are outlined in different laws. Like the EU AI Act has some specific harms that you need to look at, or it could be topics that a company has defined are the priorities of theirs, or it could be human rights.
But basically taking a slightly narrow set of things, having a general sense of how you're going to look at each one and using the findings to maybe prioritize what you're going to go deeper into or to assess whether an organization has sufficiently investigated a set of different risks. But that was sort of the structured part of the spectrum. Focused assessment was taking a specific harm or a specific risk and digging into it. And from our perspective, ideally through mixed-method approaches, so maybe technical investigation, but also understanding how a system might interact with the people using it, the institutions it's situated within, to understand how that risk might manifest, to understand if it's exceeding a specific expectation and to really fully unpack and understand the nature of that risk and inform what would be necessary to mitigate it.
Again, to hopefully ultimately ensure that that is happening. Finally, specific assessment, measuring the particular characteristics of a very precise fact pattern of a system. So let's say a benchmark, a specific measurement, a specific metric of a system and verifying that a model or an organization has, you know, either exceeded or not exceeded a particular numerical thing or a particular concrete rule or commitment they've made—basically a true-false type of statement, for example.
So you can—I kind of think of this as like a sideways pyramid. And so from again, very broad, you might have a general sense of some concrete things you're looking at, but it's not bounded by that within structured, it's kind of a couple of different things with some, you know, flexibility at the edges, focused being kind of one area and specific being like a very precise thing you're investigating.
I won't linger on this slide too long, but all of this is available in a report that you can all grab after if you're interested. But these are obviously dramatically different activities, but by teasing them apart, our hypothesis is that it's easier to see which ones can be complemented by different types of activities, who ought to be doing what.
And it really makes clear that doing any one of these alone is probably not going to get us to the state that we want to be—it might get us part of the way there, but we can more easily see what's missing. So the second dimension was the degree of independence of these efforts, from low independence, internal teams, the folks developing the models or systems themselves, to medium independence, moderate independence.
So maybe assessment from a reasonably disinterested party, but that could still be an internal sort of risk team. It could be an internal audit team, it could be an external team that's contracted to do work, but maybe there are some constraints over what they can share and on what timeline to truly independent third-party efforts, where maybe there's actually not even any contractual relationship between the parties, and it's somewhat of a more adversarial posture.
All of these have pros and cons as well. There's a lot of attention on third-party auditing, which is absolutely understandable, something we definitely need. But at the same time, that most independent type of effort, people who really have absolutely no relationship with an entity also might not have access to the technical details they might need to verify characteristics of a model or even the sort of infrastructure to detect that the data that might be provided in an audit is accurate in the first place.
So, like logging data to be sure that the data you think you know about a system is actually what's happening in the real world, or a model maybe in a vacuum you can test, but until you see it interact with that other components of a consumer product or the real world doesn't actually tell you the full picture. So again, pros and cons of each of these different approaches, we shouldn't discount any of them as part of this ecosystem, but really thinking about which of them is going to accomplish the goal that any given proposal has is really, really important.
So why does this matter? We found that kind of by combining these two axes, it helped make a mental model of what all sorts of different things were that seemed to sort of vaguely relate to one another, but that people were describing in dramatically different ways.
So what are the gaps in current approaches? Where are they situated? And so my example that I want to go into is actually a New York City example, so we'll appreciate that. So New York City Local Law 144 requires independent audits of automated hiring tools, measuring a specific bias metric of those tools.
And that sits at the intersection of a specific medium to high independence effort with high specificity. It's excellent for generating comparative results across organizations and potentially informing remediation efforts. But it's poorly suited for identifying novel forms of discrimination and there's been a lot of debate over whether the metrics that were selected are actually telling us much at all, and whether they're actually motivating change is another thing.
But it's one of the first laws in the space that required something like this. And so it's been really great to kind of poke and prod and see how effective it's been. Meanwhile, while companies conduct internal exploratory red teaming before model release, they're operating with low independence but very broad scope.
And that can be important to having time to identify a set of risks and actually surfacing them early enough to address them. And an external third-party audit after deployment is not going to have that opportunity, but you still might want that to come in at a different point.
So neither of these approaches is inherently better because that requires answering better at what? They each serve a particular purpose in the ecosystem of governance and accountability. And being able to parse that out, I think is important. So this framework, I think offers a couple of key insights for me.
It just helps me understand what I'm even thinking about and answer questions when people are like, what should the red teaming law say? It's like, well, what are you trying to accomplish? But hopefully it's helpful for you as well. And so a couple of things that it raises for me are no single assessment approach can accomplish all goals.
Exploratory approaches might be needed to find novel approaches, novel risks. Specific measurements could trigger regulatory action or particular activities that are necessary in order to go to market, export controls, whatever it is, you might need a specific metric against which you can validate in order to trigger those things.
And there will be arguments over A, what those metrics are and B, whether someone has met or exceeded them. And we should be prepared for that. And I think that's where technical research can come into play quite a bit. What are the measurement methodologies?
How confident are we that those are actually measuring the things we want them to measure? How confident are we that a different approach is not also measuring that thing? And what if they disagree? There's a lot that's going to happen at that point of debate.
The second is that independence and scope trade off with each other in important ways. So external researchers and auditors have the credibility and freedom to raise concerns without fear of corporate retaliation, but they often lack the system access they need to do kind of a deeper dive.
And I think there are different policy levers you can do to kind of change that balance of power. So Whistleblower Protection Act, for example, can help change that, but thinking about what are the constraints that any given party at any given scope of inquiry might be facing when we're thinking about the goals of timing of any of this matters enormously.
So, you know, pre-deployment assessment and audits offer opportunity to inform system design, but they might miss out on harms that happen only once things hit the real world. How is that going to interplay? And how does, how do we create, how do we make sure there are feedback loops between any of these inquiries into the actual design process? So we've seen, I think, a number of examples where there's, whether they're audits or research or evals that identify an issue and it's sort of brushed aside or it's like, oh, that's interesting, but we're going to move forward in any case.
And so really articulating beforehand what are the consequences of conclusions of any sort of assessment effort is really important if we wanted to accomplish a particular goal. So leading me to my kind of recommendations is first we should start with a clear goal when we're trying to define what an audit or any such related activity is, because that will help determine where on that, you know, graph we should land.
The next is to match the right method to the goals and figure out are we looking for an exploratory approach, Are we looking for a confirmation of a specific set of things? If it's the latter, we need very clear definitions of the set of things we are confirming and definitions of what would be sort of a checkbox, insufficient demonstration of that thing versus a real rigorous demonstration of that type of activity.
We do need to invest more in genuine independence, independent audits where it matters most. So not every assessment needs that explicit, completely disinterested third-party approach. But some things might. We also need to better define what would a moderate independence effort look like that we would still feel confident was sufficiently independent to not have been captured by the interests of the auditee.
So is that prohibiting constraints around what they can share and when and with whom, is that some kind of different financing structure, et cetera. And finally, we need resource, we need to resource any sort of assessments appropriately with both funding and time. Even technical assessments can take a while developing eval data sets, deciding what you're testing in the first place. And that's just again that most specific part of the picture.
And there's so many other things that take a lot more time and stakeholders to really feel confident that you're identifying the practices that might be most concerning. So the stakes of getting this right, I think are really important. I think that I wouldn't identify myself as traditionally from the AI safety space, but I think that even across different communities that care about AI policy, there's so much shared interest in developing an ecosystem of real, meaningful efforts to spot risks and make sure those risks are mitigated, that I think this is an area where there could be a lot of overlap and alignment.
And making sure we get to the finish line and actually get something across into legislation or into corporate practice does require us to get a little bit more specific. Otherwise we'll all say, yes, we agree that auditing is important, but then get to the end and be like, oh, wait, but you only meant frontier models, I actually meant classifiers, and I wanted them to measure sociotechnical impact and you wanted them to measure chip capacity.
If we can get on the same page ahead of time, I think we'll all have more likelihood of success in achieving all of our goals. I don't know where I am on time, but I will stop there.