Singular Learning Theory and AI Safety

Summary

Singular learning theory explains why neural networks generalize through degeneracy in loss landscapes, offering a framework where data shapes internal structure that may be interpretable through geometric analysis.

SESSION Transcript

Like it says right here, like Vael just introduced, we're going to talk about singular learning theory in the context of AI safety. I want to start by saying something about the role of science and theory, in particular in the context of safety, safety engineering. How do these two interact and why is it relevant to do theory in the context of safety? Then I'm going to talk about singular learning theory in particular. I'll sketch some of the basics at a very high level so we won't get too much into the math and what it means for understanding deep learning. And then I'm going to sketch two directions for using this as a starting point to understand generalization better and learning better, and to use this understanding to develop new tools and applications for interpretability and alignment.

So let's start with the question of science and safety. And I'm going to start at the beginning, which is maybe 300 years ago. Okay, so I want to start with an anecdote from the history of science from the Industrial Revolution that seems to me to be very similar to the kind of predicament we're finding ourselves in. Okay, so it's the early 1800s, and we've started to develop steam engines. They've gotten better and better through empirical tinkering. So people come up with new refinements. They find a way to develop a Watt steam engine which uses pressure chambers. You end up with slight tweaks that get better and better efficiencies. You get more and more work output for a given amount of fuel you put into it. And so we're starting to transform the world with these steam engines.

The principle of natural selection ensured the survival of the fittest form of heat engine. All manner of variations were tried and found wanting so that in the end, the steam engine emerged as the undisputed champion inheritor of the Earth. Okay, this led to some problems. So it's 1817, and we're at Norwich Bridge, where an explosion occurs. So a steam packet boiler comes under too much pressure, and this causes an explosion that kills 10 people. Okay, so we're starting to push the world into these more extreme situations, higher pressures and temperatures than have been encountered in the environment to date by designing these new kinds of tools. And we're starting to see the consequences of pushing ahead without a deep understanding of how this works.

So we need scientific progress. As a direct result of this catastrophe, the UK Parliament set up this select committee of the House of Commons to investigate the problem of safety of steam engines. They came up with some practical, empirical recommendations. You should use wrought iron instead of cast iron. That's a better material. Use safety valves so that if the pressure is too high, it automatically releases. You should do some pressure testing ahead of time to ensure that the system's not going to blow up. And they came up with some theoretical requests. They could see the trend lines. If you start increasing the pressure, increasing the temperature, you're going to get better and better yields, better and better efficiency out of your engines.

So what happens in the long term if we draw these trends forward? We need long-term research into these scaling laws, into these gas laws. That eventually would come. We saw these empirical gas laws tell us that better efficiency would come from higher temperatures, higher pressures, higher differentials. And could we use this to develop a better understanding of empirically these catastrophes that we're starting to run into? Because obviously high-pressure steam engines are far too valuable to give up. Okay, we got that revolution, right? So 50 years later, maybe even just two decades later, after these requests were put forth, we started to develop a deeper understanding of what's going on here through thermodynamics, through the kinetic theory of gases, which gives us a microscopic explanation for why we're observing these scaling laws, and through the broader agenda of catastrophe theory, the theory of critical phenomena, phase transitions. That gives us also an idea of when mechanical systems suddenly break and rupture and how to detect this ahead of time.

So here we go again. It's 2025. We've seen over the last decade iterative tinkering give rise to better and better thinking machines. At this point, we're still not even in the evolutionary phase. We're still in sort of early creationism when it comes to understanding what's going on here. And we're starting to see risks, we're starting to see actual consequences. So suicides recently, reward hacking at scale, sycophancy. What next? I think we're here because we fear the long-term consequences might be more severe than even the things we're seeing right now. Can we prevent those before we run into them? Maybe, here too, we see that the UK government is the first in the world, sets up the UK AI Security Institute to investigate these risks and come up with practical recommendations.

So instead of recommendations about the material, maybe we can design systems that are just inherently more interpretable. Instead of safety valves, safety guardrails; instead of pressure testing, let's do some dangerous capability evals before we release these systems. At the same time, the UK government makes explicit requests for deeper theoretical understanding of what's going on here. Why are the scaling laws the way they are? Where does this come from? How can we control what's going on? Evals alone aren't going to be enough in the limit. So again we have the same set of observations: empirically we get these proto-theoretical scaling laws where we just, we know that scaling up compute, scaling up data set size, scaling up the size of models will lead to better performance. But predicting downstream consequences of this on any particular capability, any particular benchmark, is much harder and noisier and just inherently uncertain.

So how do we square this whole, this knowledge that performance is going to improve with the difficulty of predicting where and when it will break? Sam quote allegedly. What are we supposed to do? Turn it off now? Today I want to focus on these levers of control and understand in particular the role of data. So we combine data, compute, memory into some kind of engine, we crank the lever and out come capabilities. And I'd argue that the data side of this is the most important part. So that's what we're going to try to understand, the role of data in this picture.

This is important because data isn't just the thing that controls capabilities and thus catastrophic risks, it's the thing that controls values. So that's the source of longer-term concerns about misalignment. So consider for example, that every technique we have for doing alignment in practice just looks like an existing deep learning technique with different data. So if you do next-token prediction, you train against next-token prediction. If you do this on internet data, it's pre-training; capabilities came from originally. If you do it on example demonstrations of helpful or harmful data, then we call it instruction or safety fine-tuning. Your question? Yes. Why data of these three things is the most important?

Well, I'd say, okay, so the question is, why is data the most important of these three? I point to two things. So I can take a single architecture, fixed architecture, train it on totally different data sets and I will get completely different models at the other end. And the other thing I can do is take different architectures, train it on the same data and I'll often notice that the representations that they learned are extremely similar. We get this universality. So you need a sufficiently expressive architecture, you need an optimizer that's initialized correctly, but somehow these things are less the thing that's telling your model what algorithms to learn than the data directly. Thank you.

So alignment techniques are capability techniques. Here's another example. If we do RL against a binary source of truth from an automated verifier, for example, for math and coding problems, we get chain-of-thought RL reasoning training. If we do this, instead against binary preference data, that's RLHF in the form of a learned reward model. Maybe another example is if we tell an LLM to modify its own training data and to continue training on that. Well, in the context of math and coding problems, these are the kinds of synthetic data loops that show up in models with state-of-the-art scaling coefficients like Phi-4. If you do this against a specification of moral principles, that's constitutional AI, or deliberative alignment.

So in all these cases, alignment techniques are just deep learning techniques with different data. So we want to understand data. I would say that this is another answer to the question of why focus on data? Maybe we should just accept that alignment in practice is going to look like: do deep learning and understand which data to put into your system. So we want to understand this process. We want to understand what data is doing to our system to give us these capabilities and values out the other end. So let's roll it out. The steps are: the training data determines the internal structures that a model learns. It shapes the features, circuits. However, you want to talk about internal structure that a model acquires over the course of learning. And it's those structures that enable generalization that determine model behavior in deployment.

So data determines structure, determines behavior. What's wrong with this? Okay, I'll point out two problems, one with each link we see here. First, let's focus on generalization, the second half. The problem here is that I can have two models that have the exact same behavior in the training distribution that nevertheless when I deploy them in the real world, will behave in divergent ways. The classic example here would be deceptive alignment. This concern that a model might be faking, might be pretending to be aligned and therefore behave the same as an aligned model would. And then downstream it executes a treacherous turn that starts misbehaving. So that's the problem. What's the solution?

We see the solution as being the discipline or the science of interpretability. In my eyes, interpretability is really about generalization accounting, understanding for a given generalization behavior, can I attribute that to the structure that gave rise to this? So we want to understand in the minimum case, we want to be able to disambiguate the deceptive model from the aligned model and how will they behave on downstream behavior. What's the other problem with this mapping, this pipeline? The other problem is the fact that this depends on the learning process. So the learning process, the problem here is that I can have two models that achieve the same loss with respect to my training data, but do so in different ways.

So we have these two different sources of nonrealizability. One is that the loss is not enough information to distinguish two models. The second is that even for two models that have exactly the same behavior on a given training set, that's not enough to determine downstream generalization. What's the solution here? Solution here, I would say, is the science of alignment. Alignment, again, empirically, in practice, and I would argue also really this is how we should think about alignment more generally, is about choosing the right data. That's it. If our control on AI systems is mediated by data, then alignment is about understanding the role of that data, how to choose data, how to design curricula to put values and capabilities in systems rigorously in the way that we want.

Okay, so what we saw here is this example of the historical development of steam engines, where at some point you need some theoretical progress in order to push the frontier further and actually design systems that are safe. When you've pushed the world into these extreme situations that it's never seen before, maybe the same is true for AI systems. And maybe this kind of scientific understanding will help us understand these two fundamental problems at the heart of safety. That brings us to singular learning theory. So unfortunately I'm not done with my physics analogy, so we're going to have to go back for a little longer. But I promise we will get back to ML in a second.

So in physics, in the case of the Industrial Revolution, at some point we succeeded in making the transition from this proto-theory, this early empirical scaling laws kind of regime, to a fully fleshed out theory of phase transitions, thermodynamic phenomena, engines, heat, motion, work. How did we do this? Well, the problem, the phenomenon that we couldn't understand, that we needed to expand our vocabulary, our scientific vocabulary to understand, is heat. So a thermodynamic system, one like the gas in this room, it's not just described by this—a little differently. The problem is that if you have a pendulum, that can be described by classical mechanics, but in the real world, this pendulum will ultimately fall into a resting state. Why is that? That's because of friction, because of tiny microscopic interactions with particles in the surroundings where the atoms here are vibrating and releasing energy through those microscopic degrees of freedom to the surroundings. That's heat. So the change in energy of a system is described by this interaction between work and heat. And classical mechanics could only really talk about the first part of this. We needed to understand heat better and develop a theory of heat.
Now we ended up with that solution. So the first solution came to us from thermodynamics. And thermodynamics says that what you're really minimizing, what you're really—the right quantity here is not the energy itself that you can measure directly. The relevant quantity is a free energy. It's a modified energy, a kind of effective energy that involves a contribution from the energy you can measure and a kind of accounting term for the energy that you can't measure directly, but that you can infer is there, the energy that's hidden in the microscopic motions of the particles that make up the system. So we go from a description purely in terms of energy to this description in terms of energy and entropy. This accounting term, trading off against each other, even that's not a completely satisfying answer because it's still—what is entropy? These are these sort of high-level concepts. And it's not until later that we developed a theory of statistical physics that we understood microscopically. What gave rise to this accounting term? Entropy really arises from microscopically what's going on in your system. And you could develop a theory that actually predicts these macroscopic phenomena from the underlying microscopic theory.

So why is that relevant for deep learning? Deep learning, we're finding ourselves in a similar position here. The new phenomenon isn't heat. The mystery here for developing better understanding is what's going on with—go back to this in a second—with generalization. In general, it's going to be very hard to develop a theory of deep learning. So what we're going to do is instead of trying to understand theoretically what's going on with stochastic gradient descent, we're going to approximate it or model it as described by this idealized Bayesian learning process. And then we're going to take the predictions that we make in that setting and see if they empirically pan out in real-world models trained with SGD. And this is going to let us pull in a lot of sophisticated theoretical machinery to talk about deep learning. But it's not complete. Okay, so this is going to leave some kind of gap, but we can use this very productively to try to understand what's going on.

So can we use this to bring to bear on this problem of generalization? Models, neural networks train much better, generalize much better than we classically expected them to. This was a central mystery when we first started this deep learning revolution. The classic dogma is that if you make models bigger, they'll overfit. Why doesn't this happen? So classically, the reason is that at least the Bayesian theory predicts that your generalization error is going to be dominated by a term that's proportional to the loss of your optimal solution. If you sample lots and lots of data, you'll find some optimal solution. So when you generalize to new data, you're not going to perform quite as well as you did on the training distribution. So that's sort of a minimum bound.

And the degree to which you're going to do worse than that minimum solution is given by this term. And that term depends on two things, depends on the number of parameters and the number of samples. If you have a larger data set, you'll generalize better. More data means you understand better what's going on here. But if you have a larger model, that's bad, you're going to overfit. And so if you have a billion-parameter neural network, you fit it into this equation, you're not going to predict good generalization. There's the mystery. You don't understand what's going on here. So the answer from SLT is degeneracy.

So generalization is the question. The answer here is going to be this new kind of term, this modification to this former formula. Classical learning theory is really only about understanding this loss term, the minima. And what we see here is as in the thermodynamic case where you get a trade-off between work and heat, here you get this trade-off between loss and complexity, complexity given by a term that measures degeneracy. So similar change in perspective. It's not loss that's the relevant quantity here, but this modified loss, this free loss, this effective loss, and we're going to study that object and look at what this means.

So what is degeneracy? Degeneracy is a word that I'll throw out a lot. So it's pretty simple. Degeneracy means that you have this loss landscape, this complicated loss landscape that you're searching over as you conduct your learning process. The degeneracy means that your minima are nonisolated, they're nonunique, there are directions you can move along in the loss landscape that don't change the loss, that don't even change functionally what you're implementing. So this is relevant because it breaks a lot of classical results in learning theory and it's going to give us an answer to the question of why do neural networks generalize better than we anticipate?

So consider a very simple loss landscape. If you have one parameter, the loss is on the y-axis. We're trying to learn in this environment, where should you go? Ready? Where do you end up as we sample more and more data? Classically this solution is better, right? It's got a slightly lower loss, so maybe you want to end up here, but this solution over here occupies a lot more volume. So which should you learn? SLT is going to tell us that we get this trade-off between the loss and complexity term where initially what you'll find is that the posterior, your probability of finding weights is going to concentrate around this regime. At some point you'll reach a critical size of your data set, at which you have mass distributed over both solutions. And then if you push further, if you sample more and more data, at some point you will converge to this globally optimal solution.

So initially you actually want to favor the simpler solution because you're still uncertain about what the right answer is. But in the long run, as you get more and more data, you get more certainty, you can resolve that this lower loss solution, even though it's harder to specify, even though it's narrower, is actually better. You can quantify this trade-off exactly by looking at the quantity called the free energy. This is modified loss that I talked about before. Initially the free energy is lower for solution U, for this basin, this regime. Eventually the free energy is lower for this basin V. And so at this point where they're equal, we see this phase transition, first-order phase transition where the mass jumps from one to the other.

This free energy formula is given by this equation. I won't go into the details, but here you see this trade-off between loss, accuracy, performance and complexity or degeneracy. How broad is your basin? And that's maybe the central formula of singular learning theory. Yeah? W is a choice of weights. So this is supposed to be the free energy associated to a region of parameter space. Then L sub n is the loss for a data set of n samples at your optimum within that region W star, and then the learning coefficient lambda, a measure of degeneracy, is also associated to this region.

So what is this learning coefficient? I call it a measure of degeneracy, the measure of complexity, learning coefficient, a bunch of other names. What this thing measures is volume scaling rate, tells you sort of how broad is this basin. Heuristically, what you find here is if you have a minimum, you can look at the number of solutions that are almost as good as that solution. So within some tolerance epsilon, what region, what volume can I access? You find that this volume scales like some constant times epsilon to the power lambda. That power lambda is the learning coefficient. We've got some examples here, that I'll actually skip over. You can see that broader basins—generally a little more sophisticated, broader, can kind of lead you down the wrong path. But more degenerate solutions have a lower learning coefficient.

Okay, so what we end up with is this inductive bias, this implicit inductive bias, this implicit regularization term that penalizes complex solutions. You prefer the simpler thing. You prefer the high-degeneracy solution initially. Okay, let me jump above this. Okay, so we have this problem, this problem we want to understand: why do neural networks generalize much better than we expect them to? And the answer initially seems to be degeneracy. Just like entropy, was the answer to how does heat work? And how can you predict the behavior of heat in the thermodynamics case? Can we come up with a deeper, more satisfying answer, a microscopic answer that really tells us where this macroscopic thing comes from?

Okay, and here I would say that the answer is algorithmic structure. The fact that neural networks are able to learn not just shallow lookup tables, but generalizable algorithms with deep hierarchical internal structure. The canonical example of this is the recent work by Anthropic on their sparse SAE circuits. So what you see there is something that's qualitatively different from earlier kinds of statistical models. If you look at regression, Gaussian processes, or clustering, none of these classical techniques is ever going to learn how to sort a list, for example, whereas this is something that a neural network can do easily. So neural networks learn algorithms, not just shallow memorized functions. And the learning coefficient is the natural measure of complexity associated to these algorithms. It's similar to, like a program length. How many bits does it take to specify the program that the neural network has learned? Jump over this.

To give an example how this shows up, we can consider a ReLU neuron. If you have a dead neuron, that means that whatever activation you put into it, the dead neuron is always going to map it to zero. So that neuron doesn't actually matter for any subsequent calculations. Because you multiply any number by zero, you still get a zero. So that's exactly the case here. If the preactivations to your neuron are always negative or always below the bias, that neuron's never going to activate. That means that the outputs of that neuron can be set to an arbitrary value. You can multiply it by any number, and you still get the same result. That leads to a direction in the loss landscape. You can move along that doesn't change the function.

In this case, you see very tangibly, that sort of the structure that the model has learned, whether or not to use a given neuron, shows up in the geometry of the loss landscape, shows up as an increase in degeneracy. And so the learning coefficient is not just a measure of geometric degeneracy. That's how I introduced it initially. It's also this measure of model complexity. How complex is the algorithm the model has learned at a given point in training? Okay, so what's the deeper answer to this question of why do models generalize so well? Well, it's the fact that they find these low-complexity algorithms that let them generalize to far beyond what they saw during training in some cases.

So we started back in thermodynamics land, and we saw that in order to resolve this mystery of heat, we needed to introduce entropy and this free energy trade-off between energy and entropy. And the same is true for understanding deep learning. In deep learning, it's not just about understanding loss performance on its own, but this balance between loss and complexity. And that loss-complexity trade-off is ultimately rooted in the fact that neural networks learn algorithms. Can we go further? Right. In physics, at some point, we went beyond thermodynamics, we went beyond statistical physics. We have a much more advanced theory. Can we see where the theory, in the case of deep learning, might be headed by drawing a parallel with the history of physics?

So we started with classical mechanics, which describes the flows and minima of energy, just like we started with classical learning theory, which describes the flow as minima of loss. At some point, we went to this thermodynamic regime, where it's a trade-off between energy and entropy. At some point in learning theory land, it becomes a trade-off between loss and complexity. In physics, what you get after thermodynamics is statistical physics, which tells you how to ground the observations here in microscopic structure, what's actually going on at the smallest scale in your system. In statistical physics, you get a set of new tools for perturbing your system in structured ways to learn about that microscopic structure. If you hit a metal with a hammer and you listen to the vibration it makes, the sound that comes out of it, that tells you something about the crystalline structure of that piece of material.

Similarly, can we come up with a way to design these structured systematic perturbations and apply those to neural networks to infer internal structure? That's our hope for applying SLT to developmental interpretability or spectroscopy. Now physics went even further. So at some point physics went into this condensed matter physics regime, where now what physicists are starting to be able to do is they can start with a specification. I want to create a material that has these and these optical, electric, magnetic properties. Can I convert this specification into a recipe for designing this material in a lab? Which reagents do I need to combine, in which order, under what pressures and temperatures, in order to produce a material that has exactly what I want? We can't do that for all kinds of properties yet, but we're starting to be able to do it in some restricted domains.

Maybe the hope for learning theory is to do something analogous, something we call patterning, this idea that with a sufficiently deep understanding of what's going on here, you could actually design structured perturbations to the training distribution over the course of learning to etch models that have the precise properties you want. So you're no longer learning in this ad hoc way, but you're designing superintelligent materials that have the values, capabilities, goals that you want. Okay, in the rest of this talk, which is a little less than half an hour, we're going to look at starting directions towards developing this statistical physics understanding of neural networks and then this condensed matter theory understanding of neural networks. Key idea here is that degeneracy is the thing that sort of underlies why neural networks are able to learn algorithms and thus generalize and not just function.

Okay, so at the start I introduced these two problems of interpretability and generalization, where the solution to understanding generalization is interpretability and learning, or the solution to understanding learning is alignment. Let's take these one by one. Generalization is the problem that two models have the same behavior in training and then diverge in deployment. We want to understand why this happens. The idea here, the key motivating idea for starting to use SLT to understand this problem, is something we call structural Bayesianism, the hypothesis that the geometry of the loss landscape reflects the algorithms that a model has learned. If that's true, then what you could potentially do is study the geometry of a loss landscape and then map this back, invert this mapping to infer algorithmic structure in the model that gave rise to that geometry.

Maybe it's actually more tractable to do interpretability here than it is to try to do interpretability directly on model weights, directly on activation. And that's our motivation here for trying to use SLT in the context of interpretability. We talk through generalization in three different stages. Earlier I said something or I alluded to the fact that SLT sort of gives us an answer to why models generalize as well as they do. The way I would describe this is that, SLT solves in quotes the problem of in-distribution generalization for Bayesian learners, where the quotation marks have to do with some conditions around asymptotics. So what we saw, I introduced this Bayesian generalization error where what we're trying to estimate, the generalization error we're trying to predict here theoretically is a KL divergence between the true data-generating distribution, the thing that gives me the samples that I'm training on, and this posterior predictive distribution, which is the central object for using a Bayesian model in a prediction setting.

Now, the details here aren't going to be important. What I wanted to point out to you is that this is like a normalized version of the generalization error we saw earlier. And asymptotically it's determined by the learning coefficient over the number of samples, like we saw. So it's in this sense that generalization is solved. Generalization is determined by the LLC, the degeneracy and the number of samples. And really the sort of intuition I want to leave you with here, that what we're seeing here is a relationship between sensitivity in data. So the question of generalization, if I change data, how does my performance change? And sensitivity in weights, if I perturb my weights a little bit, how much does loss increase on a fixed set of data? The relation between the Bayes generalization error and the learning coefficient is exactly of this form. The Bayes generalization measures sensitivity in data. The learning coefficient measures sensitivity in weights.

So here we're seeing a principled connection between structure, algorithmic structure, generalization behavior and the geometry of the loss landscape. So can we use this as a starting point, sort of to start extracting more information from the geometry? Let's talk about distribution shifts. So that's in-distribution generalization. I'm evaluating my model on the same data distribution that I trained on. Can we go beyond this? So I want to generalize what we're studying to be not just looking at this Bayes generalization error, but in general I want to measure some kind of expectation value of some property of my model. So phi is a property. Maybe it's susceptibility to jailbreaking, maybe it's I don't know, toxigen score or whatever property that I'm interested in studying and predicting the behavior of.

So we saw this example where the expected increase in loss from a starting point to points in the neighborhood tells us about—is related to the learning coefficient. Jumping over this for now, the question I want to ask is we have a solution to the in-distribution generalization problem. What if we introduce a small perturbation so we're not considering an arbitrary change in distribution. But I'm going to make a small modification to my training distribution and see what happens. I'm going to model that as adding some small parameterized by h, Q prime distribution. This might be things like if I'm training on a mixture of GitHub data and WikiText data and upweight GitHub slightly and downweight WikiText slightly. Or consider a much more fine-grained example where I upweight a particular example, train on it twice instead of once. These kinds of modifications.

Well what the physicist says you should do is if you have a small parameter, you should expand in that parameter. You should do a Taylor approximation. So this is exactly what we're going to do here. We're going to take this expectation value that we want to predict and we'll do a Taylor expansion in h. So what you find is that the first-order correction to this term is given by this quantity here. That's what we call susceptibility. So susceptibility is a derivative of some expected property of your model. Derivative of, for example this generalization error. What this is is the first-order correction to generalization. So if I'm trying to predict how my behavior changes under a small distribution shift, this is the relevant quantity. And I'm going to show you two kinds of examples I hope will have a much more intuitive explanation, once you see them that are called data and structural susceptibilities. Yeah?
Excellent question. The data susceptibility is an influence function. It is a generalization of the influence function. So what we call the Bayesian influence function. How you obtain this, I'll say this for you in an influence function you're interested in. How does my observable at a specific choice of weights change if I change the data somehow? With the Bayesian influence function you ask how does my expectation, local expectation of an observable change if I change the data somehow? So let's just take a look at what this object looks like. Here's how it's defined. The perturbation we're making is upweight a sample i. And we're going to look at how the behavior responds on sample j. And this is the generalization of an influence function. This is the picture I showed you at the very start.

What we do is we do this for n by n samples so we have influence between all pairs of points for some small set of points, subset of the training distribution. That gives us a big matrix where we can see the entries as telling you something about how similar are these samples or how close are they together in some kernel space. You can apply dimensionality reduction to this object and you get this visualization of the relationships between different points. So here we have aquatic mammals, snakes, birds show up over here. You can actually quantitatively compare this against your ground truth knowledge of the hierarchical structure in the ImageNet data set and see that this recovers this from this correlation between losses from geometry, we can actually detect hierarchical structure in data and how the model sees this data.

So in the next two weeks we'll have a series of papers coming out that introduce this, first in the context of influence functions and we show that you can actually use this to achieve what we think is state-of-the-art performance on retraining experiments for predicting what happens when you retrain models on a modified data set. You can use this for interpretability. The photo I just shared to understand structure and data from the perspective of a model, and you can use this in a developmental context to try to interpret the stages that a model is going through. So somehow the influence function is the natural thing to measure to try to figure out which samples are responsible for a given developmental transition. That's the data susceptibilities, influence functions.

What about structural susceptibilities? Here the definition is a little different and I'm not going to explain this fully. What we're going to do is we're going to upweight a sample i again as before, but the observable is now the average loss increase. Only I've restricted this to a specific component. So here I apply perturbations to all of my weights at once. Here we end up with a restricted version of this. And so instead of having a data set size by data set size matrix, what we get out is a data set size by number of components matrix. This tells you something about a coupling between a specific data point and components in the model, attention heads, for example, and once you obtain this object, what you can do is dimensionality reduction techniques like SVD. And so this gives you a way to decompose the structure in this matrix into these left singular vectors and right singular vectors.

It turns out I won't go into much detail here, but that you can read off circuits in the model from these singular vectors. If I just plot the first, I'm going to jump over here, the first singular vector, right singular vector. What we see is the first one is a sort of uniform thing that's not particularly interesting. But the second vector distinguishes all the heads that are in the induction circuit from heads that are not in the induction circuit. The next head distinguishes heads within the induction circuit from one another and heads that are involved in the separate multigram circuit from heads that are not in the multigram circuit. And you can look at subsequently lower PCs and you find additional structure that you can actually interpret.

So that's the whole point of doing this coupling. You interpret this by looking at the loading on the left-hand side. And so here I've got a bunch of different tokens in a bunch of different contexts and I can actually automatically classify a token as being for example an induction pattern. Or this is a write delimiter. So it's like a closing brace or this one is a formatting token of some kind. You see these patterns over here. And once I've done that procedure, I know that the second principal component links some pattern over data with a pattern over components. And I just can read off that the main pattern on the left-hand side are induction patterns. That gives me a way to automatically attribute these components to which data is responsible for this distinction.

There's a lot going on here. So the thing I want to leave you with is that there's a sort of SLT-native way to associate structure inside of the model to behaviors on data sets by decomposing these kinds of objects, looking at the structure within sets of measurements of susceptibilities. So an SLT-native way to do ablations in weight ablations and circuit discovery. Similarly to before you can apply these dimensionality reduction techniques like UMAP to this big matrix, you'll get a different object, something we call the rainbow serpent, where you can see that there are a bunch of streaks and hairs splitting off of the body, which we believe are associated to specific kinds of structures that the models learned. You can try to cluster in this space. And we expect that this will give you a set of techniques for automatically discovering a lot of structure inside of these models, structure that we know through the theory is directly related to generalization.

Okay, we started with in-distribution generalization, where SLT, the existing theory of SLT, gave us an answer to this already two decades ago. And we see that if we take derivatives of these objects, we get an answer to sort of a local first-order correction to generalization. How does my model's behavior change if I just change the distribution a little bit? So how do you tackle out-of-distribution generalization and full generality? Well, maybe we just need to compute higher-order terms. That's sort of the physicist's answer is at some point you stop and it's enough and you've understood your system and typically you don't need to go far beyond second order. Maybe we're lucky like that here.

Okay, let me ground this out. I want to sketch what the most ambitious form of using SLT, using this connection between geometry and interpretability, could mean for interpretability. In programming languages theory, there is a correspondence called the Curry-Howard correspondence, which gives us an equivalence between a way of talking about proofs, logical proofs, and a way of talking about programs, computer structure. What you find is that proofs correspond to programs. I'm just going to hide this part for now. Propositions correspond to types, construction rules and deduction rules, et cetera, et cetera. The idea here is that in the most ambitious case, maybe there's an analogous kind of correspondence available between the language of programs of algorithms on the left-hand side and what's happening inside of neural networks.

The programs correspond to critical points, singularities in the loss landscape. The substructure of those programs or what they're built out of, maybe corresponds to the geometric substructure of these singularities. There are different ways that you can talk about how a singularity, how a critical point, how a minimum is built up, maybe the set of rules that build up a proof over time. You're writing line by line, an explanation for this proof, or writing a piece of software line by line. The analogous thing is the phase transitions that construct a program over the course of learning. You have ways of talking about complexity, you have natural ways of talking about complexity here, and so on. And in the deepest way, if this is true, then this gives us a way to potentially extract almost all of the information we care about about algorithmic structure from geometry.

I'm going to mostly jump over this, but the kinds of things that this could be relevant for are understanding—well, a lot of problems in safety seem to bottom out in problems of understanding generalization. So the sharp left turn is a typical example where the concern is that maybe capabilities, certain kinds of abilities in the model generalize differently than alignment properties of that model. And we see building this understanding of out-of-distribution generalization through SLT as giving us direct ways to talk about how different capabilities and values and structures inside of the model generalize to give us a way to actually theoretically ground out this question, and anticipate it before it arises. Interpretability is the study of generalization and through this link that becomes the study of geometry.

So I'm not going to have as much time to go through all of the details here, but I will sketch where this is headed. We have this other problem of learning. It's not just the problem of generalization. The problem of learning is the fact that I can have two models that achieve the same loss on my training data set, but do so in completely different ways. Alignment, I said, in turn is the study of data engineering, trying to choose the right data to control this learning process. Just as we had this theoretical hypothesis to motivate using SLT in the context of interpretability, we have a different hypothesis that motivates using SLT in the context of understanding learning. We call this hypothesis the S4 correspondence. S4 for four different kinds of structure, in particular, S4, this idea that how does training data shape the algorithms that models learn?

Training data together with the architecture determines the geometric structure of a loss landscape. That loss landscape together with your optimizer determines how SGD moves around, determines how the Bayesian learning process moves around. And it's that learning process that ultimately picks out the weights and those algorithms that are encoded in that model. S4 for four different kinds of structure: structure in data, structure in geometry, structure in development, and structure algorithmically in the final model. And just as in the geometry case, in the interpretability case, maybe we can invert the mapping from algorithms to geometry to get new tools for interpretability. Maybe we can actually partially invert this mapping from data to algorithmic structure, to come up with new ways to choose data to align models.

So far we focused mostly on just understanding the learning process before we actually stepped in to try to steer the learning process. What you see here is what's called a fate map of a zebrafish embryo. So this shows you each point represents a cell at a different point in the course of development. You see that it bifurcates and splits into these different legacies where these bifurcation points correspond to the cell going through a phase transition, as it differentiates into a more specialized kind of cell. The idea for interpretability, for developmental interpretability is that maybe if you want to understand all of the complexity here, it reduces the complexity a lot if you just study, if you focus on these transitions. There's a logarithmically smaller set of transitions than there are final end states. In some sense, if you understand the transitions, you understand the full thing. Maybe the same is true for neural networks.

Now we've shown this, this is the case in small toy models. We've established this for small language models and recently we've been looking at larger and larger models to show that this sort of correspondence continues to be true there. So here you see the dynamic evolution of this ImageNet cloud that we saw before, where the dog head sort of forms here and then gets pushed out of the bulk up into here. We see initially the separation between living things and artifacts separates. Over here we see these sort of cleavage furrows that are very typical of development in the biological case. These are very close things, but I'm not going to say much more about this. Similarly, for these other objects that we've been studying, you study how these things change over time. It helps you reduce the problem of understanding what's going on.

Now, recently we've started to look at actually using this understanding to interfere in learning. Here you see an example where—simple experiment that George ran, where you can make a modification to the training data to try to remove this feature. This is analogous to the kinds of developmental ablations that biologists do all the time where they want to establish the effect that a gene has and so they knock out the gene and see what consequence that has on development. In the long-term, maybe we can flesh out the correspondence between these different levels of structure to such a degree that we can control exactly what the model ends up with on the other side. Statisticians have many different ways of talking about structure in data, sufficient statistics, epsilon machines, correlation functions.

Similarly, geometers have many ways of talking about geometric structure, got jet schemes, resolution singularities, matrix, algebras, things I don't understand. Developmental biologists and people who study dynamical systems theory likewise have many different ways of talking about structure in dynamical systems in terms of stages, bifurcations, catastrophes, phase transitions, relaxation and mixing times. And of course, theoretical computer scientists have many ways to talk about algorithmic structure: features, circuits, organs, control flow, iteration, error correction, memory management, all these kinds of primitives. In the most ambitious case, maybe we can develop actual dualities between structure in these different layers, a sufficient understanding of the relation between these different things that you can understand any one of them through the others. And that will give us a way to turn this into actual tools for steering the model in a more reliable way.

I'm not going to say anything about RL. Likewise, I'll jump over this, but these are the kinds of questions that I would hope to answer with a better understanding of learning and new tools for steering learning. I'm going to close with a recap. We started with the observation that in practice, alignment capabilities is really about this mapping from data to internal structure to behavior. There are two problems here, learning and generalization and two solutions, or maybe practices that we typically think of when we're trying to solve problems. Throughout this talk we explored two motivating hypotheses, sort of a theoretical basis, the S4 correspondence and structural Bayesianism, that take inspiration from different areas of physics. And use this to apply singular learning theory to the respective domains. Where we're headed is pushing on both of these axes at full throttle to try to advance our understanding of these systems to such a degree that maybe at some point we can design data lasers for neural networks. Thank you. That's my talk.

FAR Seminar

View all events

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI