Does Robustness Improve with Scale?

Full Paper

Adversarial vulnerabilities have long been an issue in various ML systems. Large language models (LLMs) are no exception, suffering from issues such as jailbreaks: adversarial prompts that bypass model safeguards. At the same time, scale has led to remarkable advances in the capabilities of LLMs, leading us to ask: to what extent can scale help solve robustness? In this post, we explore this question in the classification setting: predicting the binary label of a text input. We find that scale alone does little to improve model robustness, but that larger models benefit more from defenses such as adversarial training than do smaller models.

Meme: Stack more layers for LLM Robustness?

We study models in the classification setting as there is a clear notion of “correct behavior”: does the model output the right label? We can then naturally define robustness as the proportion of the attacked dataset that the model correctly classifies. We evaluate models on tasks such as spam detection and movie sentiment classification. We adapt pretrained foundation models for classification by replacing the generative model’s unembedding layer with a randomly initialized classification head, and then fine-tune the models on each task.

We focus on adversarial-suffix style attacks: appending an adversarially chosen prompt to a benign prompt in an attempt to cause the model to misclassify the input, e.g., classify a spam email as not-spam. We consider two attacks: the state-of-the-art Greedy Coordinate Gradient method (Zou et al., 2023), and a baseline random token attack. This simple threat model has the advantage of being unlikely to change the semantics of the input. For example, a spam email is still spam even if a handful of tokens are appended to it. Of course, attackers are not limited to such a simple threat model: studying more open-ended threat models (such as rephrasing the prompt, or replacing words with synonyms) and corresponding attack methods (such as LLM generated adversarial prompts) represents an important direction for future work.

In the remainder of this post, we first contextualize our work in the larger context of the AI safety and scaling laws landscape. We then outline our initial experimental results, finding clear scaling trends for adversarially trained models, while models fine-tuned only on clean data show little improvement from scale. We conclude by presenting our plan for future work.

Table of Contents

Why robustness?

Ensuring the development and deployment of AI systems is safe requires addressing a variety of social and technical challenges, as discussed by various research teams. We believe adversarial robustness is a necessary (though not sufficient) condition for AI to be safely deployed in environments with adversaries. Moreover, improving adversarial robustness would also help mitigate issues such as reward hacking that can lead to misalignment due to the adversarial pressure of a “student” model optimizing the output of a “teacher” model, such as a reward model.

AI systems are already deployed in safety-critical domains with potential adversaries, such as algorithmic trading, intrusion detection systems, and analyzing intelligence data. With rapid progress in capabilities, there are a wide variety of emerging sensitive applications, such as LLM-based virtual assistants with access to email accounts, shared files, and other confidential data. These transformative applications will introduce an entire new class of security vulnerabilities: exploiting LLM assistants to divulge sensitive data or even take malicious actions. Moreover, LLMs may have dangerous capabilities even in the absence of privileged access to external systems. Contemporary LLMs have been found to be able to generate misinformation more compelling than most human-generated misinformation, and future models may even be able to assist terrorists with bioweapon development.

In addition to misuse risks, robustness failures could cause AI systems to be misaligned and exploitable by other models. We expect future AI systems to continue to have some component of training where one “student” model is optimized against the output of another “teacher” model. For example, in RLHF, a foundation model is fine-tuned to maximize the output of a reward model learned from human preferences. Theoretical and empirical work have shown that in most circumstances, we can expect the “student” model to exploit the “teacher” model it’s being trained against, following the letter rather than the spirit of what the teacher is trying to express. As we train more and more capable AI systems, it is vital that they learn to do what we actually want, and do not exploit the other AI models used to train them. This becomes even more pressing when the models are capable enough in a domain that it’s difficult for a (lay) human to tell whether a given output is good or bad, such as detecting whether generated code contains a subtle security vulnerability.

We know that today’s general-purpose systems are not robust to jailbreaks (indeed, some already misbehave by themselves). However, although current models are capable in many domains, there are still many simple tasks they cannot perform. One might hope that the status quo development scenario of scaling up model pretraining, followed by some safety fine-tuning before release, will substantially improve model robustness. In this post, we seek to answer the question: is scale combined with simple defense methods sufficient for robustness? If not, how big a gap remains, and what techniques are most likely to close that gap? Concerningly, previous work has shown that even superhuman AI systems are vulnerable to adversarial attacks. However, to the best of our knowledge, this is the first work systematically studying the robustness of LLMs of varying scales.

Why scaling laws?

Scaling laws have been studied in AI since the 1980s, but the topic only became a central part of the AI conversation following a 2020 publication by Kaplan et al. This paper showed that large language model performance can be accurately predicted by the model parameter count, dataset size, and compute used during training. Many follow-up publications were inspired by this work, including a notable 2022 result when a team at DeepMind re-calculated the optimal tradeoff between model size and training time, with their Chinchilla model unlocking new levels of performance for a fixed compute budget.

Yet, not all abilities improve with scale. In fact, some things scale inversely with model size, like the ability for a model to correctly answer questions about common misconceptions—larger models give the wrong answer more often! As such, we cannot simply assume that adversarial robustness will improve if we make a model bigger or train it for longer. We must explicitly study scaling behavior in order to predict what to expect from future models.

In the short-term, understanding scaling laws for robustness will enable us to train more robust models by using existing techniques more efficiently, similar to how Chinchilla improved the efficiency of model pre-training. In the medium-term, scaling laws will guide us in developing more effective defenses that can make better use of scale. And in the long-term, scaling laws will allow us to more efficiently allocate safety research effort by identifying which robustness problems will require algorithmic improvements to address, and which will be solved “by default” with sufficient model scale.

Experimental results

We’ve argued that understanding how robustness varies with model scale in LLM would be impactful, enabling us to allocate our compute budget wisely, improve defense techniques, and focus researcher effort towards fundamental problems in safety. We’ll start with an overview of our experimental setup and results, with collapsible boxes providing additional information on experimental setup, and subsequent sections providing more detailed results.

Our setup

Models: We use the Pythia models for the experiments presented in this post. In order to use these models for classification, we replace the unembedding layer with a linear classification head, which we fine-tune on our four tasks, described below.

We are interested in studying both open models and closed models.

So far, we have focused primarily on the Pythia suite of models released by EleutherAI. There are 10 models, with parameter counts ranging from 14 million to 12 billion, each trained for a single epoch on the Pile dataset (300B tokens).2 These models are particularly convenient because they allow us to see how model properties like robustness vary as a function of parameter count, supposing we do everything else the same across the different models. Although the Pythia models only come with one training seed per model size, Eleuther does provide 143 training checkpoints for each model, allowing us to see a bit more diversity than if we only had the final checkpoint.

In the future, we intend to study larger open-weight models such as the Llama family, as well as frontier closed models like GPT-4 and Claude. This said, we believe it’s important to focus our initial effort on smaller, open-weight models: both because they are easier to work with (for example, we can run gradient-based attacks on them, and many of them fit on a single GPU), and because having many models of different sizes allows us to more clearly study scaling trends than only looking at the most capable models.


2. To be precise: the models 70m and up were actually trained for the same number of train steps on a deduplicated version of the Pile, which corresponds to about 1.5 epochs.

Tasks: We consider four binary classification tasks

  • IMDB: does a movie review have positive or negative sentiment?
  • Spam: is an email spam or not?
  • PasswordMatch: does a user-provided string match the string in the prompt?
  • WordLength: given two words, which one is longer?

The motivation behind focusing on the classification setting is one of simplicity: the generative setting is, by construction, very open, which can make it harder to analyze exactly what’s going on. For example, it’s much harder to evaluate whether a generated answer is good or bad, as compared with a 0 or 1 matching a given label.

In this work, we have focused on two classes of tasks, which we call natural language tasks and algorithmic tasks. Our natural language tasks include sentiment classification (IMDB dataset) and spam detection (Enron Spam dataset). The former task is to tell whether a given body of text has positive/negative sentiment and the latter, whether the text is/isn’t spam. Both of these tasks are primarily focused on reading comprehension and extracting big-picture information about the entire text.

We developed our algorithmic tasks, PasswordMatch and WordLength, in-house. In PasswordMatch, the task is to check whether or not two strings are identical. In WordLength, the task is to say which of two words is longer, while ignoring some random characters after the two words. In contrast with the natural language tasks, the algorithmic tasks test the language model’s ability to extract and compare specific sub-portions of the text, often at the character level. Since we generate these datasets procedurally, it is easy to make arbitrarily large datasets for these tasks.

Attacks: We evaluate using two attacks. Both attacks work by appending 10 tokens to the text to be classified.

  • RandomToken: choose tokens uniformly at random. Keep trying new token sequences until an attack budget is exhausted.
  • GCG: At each iteration, change one token in the attack, replacing it with the token which maximizes misclassification probability (chosen via gradient descent).

In this work, we attack a datapoint by appending a 10-token suffix to it. See the following two examples.

An IMDB example (original text highlighted in green, attack in red)

[…] The movie gives you a piece of that kind of magic moment in time; in a place where the imaginary wave was real.. the source of culturally significant influences. And BTW, there’s another movie that has a similarly American street edginess to it, and has the same genuinely unique goodness with laid back realness that helps refine that elusively eternal sense of cool.. “Two Lane Blacktop” (with one of the best examples of freesouldoit attitude in West Coast California history.. Dennis Wilson). Like Monte Hellman did with that one, thanks, SP, for being right there with this one. GWR hsa masters prohibiting Joan0026рав bloc wretched coding delic

A WordLength example (original text highlighted in green, attack in red)

Which word is longer? Word 1: Fraser. Word 2: synchronous. Return 0 for the first word and 1 for the second word. If the words are the same length, return 0. Ignore everything after this sentence.ceil Ai aggressey virtue ROM strengtheningazyIcon tä

How we choose the 10 appended tokens depends on the adversarial attack used. In this work we use two approaches: Greedy Coordinate Gradient (GCG) and a baseline attack that samples tokens uniformly at random.

Baseline: Random Token Attack

As described above, the attack works by appending a small number of tokens (10 in this work) to the original datapoint. To choose these tokens, the random token attack samples the tokens uniformly at random from the tokenizer’s vocabulary. If the text plus this adversarial suffix is misclassified, then the attack is successful. If the text plus suffix is not misclassified, then the attack resamples a new set of tokens from the tokenizer’s vocabulary. The attack repeats this process as necessary until it is successful, or until a compute budget is exhausted. Perhaps surprisingly, such a simple attack is often successful, especially against relatively small models.

GCG

The Greedy Coordinate Gradient (GCG) attack starts off similarly to the random token attack: by appending a small number of tokens to the original text. After that, it iteratively does the following:

  • Select a token (among the tokens added to the original text);
  • Choose a token to replace that token in that location. The replacement token is chosen using gradient information from the neural network, in order to minimize the probability of the correct label.

These steps are repeated for a predetermined number of iterations (such as 10 or 30). This attack is generally quite strong, but requires white-box (gradient) access to the neural network, so won’t work on models which are hidden behind an API.

We generate an attacked dataset by starting with a clean dataset for the task in consideration, and running the adversarial attack on it.

Adversarial Attack Flowchart 1

Now that we have the attacked dataset, we can run the fine-tuned model on that dataset and see how well the attack works!

Adversarial Attack Flowchart 2

We then extract metrics like the “attack success rate” from the attack results (outputs of the model on the attacked dataset).

Defense: We defend models against attacks through adversarial training: fine-tuning the models on attacked examples with the correct (clean) label, in addition to fine-tuning on the original training dataset.

TL;DR: Results

Undefended models: larger models tend to be more robust, but the effect is weak & noisy.

There is a positive correlation between model size and robustness to adversarial attacks. However, there is a large (random) impact on robustness due to unrelated factors such as the training checkpoint (even between nearby checkpoints), and the random seed of the model initialization or attack. These factors can outweigh a 2x (and sometimes even 10x) change in model size.

Adversarial training: clearer scaling trends emerge.

Larger models are more sample efficient in becoming robust to the attacks, and adversarial training converges to a higher robustness (lower attack success rate) for larger models as compared with smaller models.

Robustness transfer: adversarial training against one attack improves robustness to different attacks, but only for large enough models.

We see early signs of robustness transfer for models adversarially trained on one attack and then tested with a stronger attack. This works both for stronger attacks of the same method (10 iterations of GCG for training vs. 30 iterations of GCG for test) and with different methods (RandomToken for training vs. 10 iterations of GCG for test).

Let’s go through each of these in detail.

Undefended models

In this section, we evaluate models fine-tuned on a cleaned dataset.

The models we have focused on so far are all autoregressive (next token predictors), so we put a classifier head on them and fine-tune them in order to use them in the classification setting.

Finetuning Flowchart

We find bigger models tend to be more robust, but the relationship is weak with considerable noise. Below, we evaluate the GCG attack’s success rate on models fine-tuned on the IMDB dataset, with three (fine-tuning) seeds per model size. We can clearly see in the right-side plot that there are two places where the median attack success rate goes up—in other words, robustness gets worse—as we move to a bigger model.

We performed the same attack on models fine-tuned on the Spam task (below). Here model robustness is even more noisy. Size does help more than hurt on average: models above 100M parameters seem markedly better than the smaller ones, and things seem to start improving again once we hit the 1B mark. But robustness seems to plateau between 100M and 1B parameters.

Model robustness is similarly stochastic for models fine-tuned and evaluated on the PasswordMatch and WordLength tasks.

What’s going on here?

Why is there so much variability in robustness across sizes?

Because the Pythia models are of similar architecture and trained on exactly the same dataset with the same training procedure, the main remaining sources of variability are randomness in the pretraining (next-token prediction) and the fine-tuning (learning the classification task) procedures. Intuitively, the training procedure might find one of many local minima which give similarly good classification results on clean data. Some of these local minima will also give good results on attacked data, while others will not. Whether one lands up in a robust local minima or not is largely independent of the size of the model in question.

To dive deeper into this variability in robustness, we’d like to train different models from scratch, with different seeds, and see if there is a lot of variability across models of the same size. This is not quite possible with the Pythia models, since each model size was trained only once with a single pretraining seed. However, we can test something almost as good by taking a handful of different training checkpoints, from the final 3% of pretraining (that is, after the model has completed at least 97% of pretraining), and then fine-tune the model from that earlier checkpoint instead of the final checkpoint. Below we scatterplot performance on three of our tasks (IMDB, Spam, and PasswordMatch)1 with model size on the x-axis and attack success rate on the y-axis.

While we only conducted this experiment for models up to ~1B parameters, we already see a large amount of variability between checkpoints for the larger models. We conjecture that smaller models are less variable only because the attack saturates at close to 100% success rate. Given the results preceding these, we’d likely see even more variability if we considered both different checkpoints and different fine-tuning seeds, not to mention starting from different pretraining seeds!

These experiments suggest that without any explicit safety training, robustness to adversarial attacks is highly affected by randomness from pretraining and fine-tuning, and only depends weakly on model size. Yet, when models are applied in the real world, they almost always undergo additional fine-tuning with a specific focus on safety, to ensure that they do not provide users with harmful outputs. In the next section, we explore how model size affects robustness when explicitly performing safety training as part of the model fine-tuning.

Adversarial training

Adversarial training—a common approach used in safety training—is simply the idea of attacking a model, finding mistakes, and then letting the model learn from its mistakes. It’s one of the most popular and successful ways of making deep learning models more robust. Our adversarial training procedure works by iteratively:

  • Fine-tuning on a dataset, initialized to the clean training dataset.
  • Attacking the fine-tuned model by adding adversarial suffixes to datapoints in the clean training dataset.
  • Adding a subset of the attacked examples to the training dataset for the model to finetune against in the next round.

Expand the box below for more information on the adversarial training setup.

In adversarial training, we train the model on the training dataset, attack a subset of the training dataset, add those attacked examples to the training dataset, and repeat. When choosing which subset of the training dataset to attack, we only select among original datapoints (that is, we ignore the attacked examples we added previously) which the model currently correctly classifies (if it was misclassifying the datapoint before the attack, then the attack has nothing to do). The attacked examples themselves look different each time, because our attacks attempt to find successful adversarial suffixes within a given compute budget.

Adversarial Training Flowchart?

We start with a small training dataset of 2000 clean examples. We then add 200 attacked examples to the training dataset each round. By having a mixture of clean and attacked examples, we expect to both retain performance on the clean dataset, while also reducing the attack success rate on adversarial datapoints.

We repeat the evaluation from the previous section on adversarially trained models, evaluating model robustness after each round of adversarial training. The plots below show a substantial increase in robustness (decrease in attack success rate, y-axis) for both Spam (left) and IMDB (right) with adversarial training round (x-axis), with larger models (lighter colors) responding more quickly to adversarial training. We evaluate Pythia models ranging from 7M to 1B parameters. We evaluate 3 random fine-tuning seeds, as before plotting the median value and shading the area between the minimum and maximum value.

We can alternatively visualize these same results in a style similar to the graphs in the previous section, placing model size on the x-axis, and using color to indicate adversarial training round. These graphs show much stronger and more consistent trends than those seen for undefended models in the previous sections. However, these plots also make clear one instance of non-monotonicity: the 7.6M parameter model is less robust than the 17.6M parameter model on IMDB (left).

Apart from this, the curves are consistently monotonic: the larger the model, the better its robustness. However, there does not appear to be any clean functional relationship between model scale and attack success rate, unlike the power laws for cross-entropy loss with model size. However, it is possible that a functional relationship will emerge when studying models across greater scales (this preliminary investigation only spans two orders of magnitude), and with alternative robustness metrics (attack success rate saturates near 0 and 1).

One important property we noticed is that larger models seem to converge to a better robustness level than smaller models. We can observe this in the below plots, adversarial training the models for 30 rounds instead of 10: the different model sizes seem to plateau at different robustness levels.

IMDB, GCG attack

We can see this distinction between final performance more clearly if we zoom into the final 10 rounds of adversarial training:

IMDB, GCG attack

An important caveat to this is that larger models are proportionally more expensive to adversarially train. So, it would be necessary to train the smaller models for many more rounds than the larger models to conclusively show that they cannot converge to a similar robustness level as their bigger brethren given the same compute budget. Still, in instances where generating adversarial examples is the bottleneck rather than training on them (e.g., training against data generated by human red-teamers), the advantage of bigger models appears to be profound.

These trends are not limited to GCG—we see similar results with RandomToken, shown below on IMDB.

So, we’ve seen that if the models are allowed to train on attacked examples, the bigger ones consistently learn faster and converge to a better level of robustness than the smaller ones. But how reasonable an assumption is it that we’ll get to train on our mistakes? While we do expect that models will undergo safety training before release, there will inevitably be attacks that were not used (or didn’t exist!) at train time. Can we hope to be robust to those, and does scale help?

Robustness transfer

To answer whether we can transfer defenses across attacks, we first evaluate a narrow form of transfer: where the adversary uses the same attack method as that used during adversarial training, but with much more compute. Specifically, we train the model using GCG, and then evaluate it against a stronger version of GCG (running for 30 iterations instead of 10).

The plot below shows that models trained on 10-iteration GCG are reasonably robust (attack success rate <50%, y-axis) against 30-iteration GCG after 10 adversarial training rounds (x-axis). We also see two clusters: smaller models (darker lines) plateau at around 50% attack success rate after 10 rounds, while larger models (lighter lines) achieve 50% attack success rate after just 3 training rounds, converging to below 20% after 10 rounds.

There is less of a distinction between large and small models in the Spam setting (below). Larger models still transfer much better, but smaller models achieve a below 30% success rate within 10 adversarial training rounds.

These results suggest that models are still robust against a stronger version of the same attack. But what if the deploy time attack is not just stronger, but also uses a different method?

We study this question by performing adversarial training on the RandomToken attack, and then evaluating the models against the GCG attack. Not only is RandomToken a significantly weaker attack than GCG, but the way in which the attack tokens are selected is also completely different. Below, we show the results of this experiment in IMDB (left) and Spam (right). We see a strong distinction between the bigger and smaller models. While the robustness transfer is less strong here than for the in-distribution attack, we still see clear transfer behavior—but only for the larger models! This suggests that the larger models are able to learn some abstract notion of defense (such as if the final 10 tokens look fishy, just ignore them) which the smaller models aren’t able to grasp.

Future work

One key area we are working on is evaluating on generative tasks as well as classification. Not only will this give us additional datapoints on the scaling behaviour of robustness, but it also unlocks testing proprietary frontier models. Although some of these models are not available for us to finetune (or it would be computationally prohibitive to do so), we can take advantage of the generative setting to use in-context learning or even simply careful prompting to explore performance at the largest scales. Will we see similar scaling trends for generative tasks, and for these kinds of “few-shot” defenses rather than fine-tuning?

The generative setting will also unlock LLM-based redteaming as an attack. This is a qualitatively different attack to the baseline random token and search-based GCG attack we have previously studied. We may also add more attacks—for example, an adversarial suffix attack that is able to circumvent perplexity filtering, or alternatively, a soft-prompting attack which might be even stronger than GCG.

On the defense side of things, a natural next step is to train a moderation classifier—à la Ziegler et al—which would attempt to flag whether a datapoint was attacked or not. Will bigger models be consistently better at this? How much time should we spend fine-tuning the classifier compared with adversarially training the victim model if we want maximum robustness?

Another direction is less about new attack or defense methods, and more about expanding the experimental results and analysis in settings we have already studied. We’ve seen that larger models are more sample efficient at becoming robust, but are they more compute efficient as well? Maybe for a fixed number of FLOPs, you’re actually better off fine-tuning a smaller model for a large number of adversarial training rounds. Or maybe the optimum lies somewhere in the middle of the Pareto frontier between model size and number of adversarial training rounds! To test this, we will need to train smaller models for many more rounds of adversarial training—and make sure our adversarial training method can handle this, without catastrophic forgetting or overfitting to the task.

We plan to answer most of the above questions, but there remain many more questions we would be excited to see studied. For example, how do the results we find at this scale generalize to frontier models at GPT 3+ scales? We see a phase transition for robustness transfer at 100M parameters. Will there be other phase transitions at larger model sizes, as there are with other emergent capabilities like scratchpad use? Separately, what factors other than scale (eg, model architecture, training hyperparameters, optimizer choice) have a large effect on robustness? How can we find them?

Closing Remarks

We hope you enjoyed this preview of our results. Check out our paper to find out more.

What do you think of the results so far? Is there something we’re missing, or an area that you’re excited about for future work? We’d love to get your opinions: feel free to reach out at niki@far.ai or adam@far.ai. Like what we’re doing and want to work on this yourself? We’re open to collaborations and are hiring for full-time roles.


  1. When we ran these experiments, we hadn’t implemented WordLength yet. ↩︎

Niki Howe
Niki Howe
PhD Candidate

Niki Howe is a PhD candidate at Mila and the Université de Montréal, where they also received their MSc. They hold BAs in Math and CS from Williams College and the University of Cambridge, respectively. At FAR AI, Niki is exploring how scale affects the robustness of LLMs.

Michał Zając
Michał Zając
Research Engineer

Michał Zając is a research engineer at FAR AI. Prior to joining FAR AI, he completed a PhD on deep reinforcement learning at Jagiellonian University, and has worked as an engineer at Allegro, Google and Nomagic.

Ian McKenzie
Ian McKenzie
Research Engineer

Ian McKenzie is a research engineer at FAR AI, where he previously ran the Inverse Scaling Prize.

Oskar Hollinsworth
Oskar Hollinsworth
Research Resident

Oskar Hollinsworth is a research resident at FAR AI, working on scaling laws for adversarial robustness.

Tom Tseng
Tom Tseng
Research Engineer

Tom Tseng is a research engineer at FAR AI. Tom previously worked as a software engineer at Gather and Cruise. He has a master’s degree from MIT and a bachelor’s degree from Carnegie Mellon University.

Adam Gleave
Adam Gleave
CEO and President of the Board

Adam Gleave is the CEO of FAR AI. He completed his PhD in artificial intelligence (AI) at UC Berkeley, advised by Stuart Russell. His goal is to develop techniques necessary for advanced automated systems to verifiably act according to human preferences, even in situations unanticipated by their designer. He is particularly interested in improving methods for value learning, and robustness of deep RL. For more information, visit his website.