Does Robustness Improve with Scale?

July 23, 2024

Niki Howe

Summary

FOR IMMEDIATE RELEASE

FAR.AI Launches Inaugural Technical Innovations for AI Policy Conference, Connecting Over 150 Experts to Shape AI Governance

‍

WASHINGTON, D.C. — June 4, 2025 — FAR.AI successfully launched the inaugural Technical Innovations for AI Policy Conference, creating a vital bridge between cutting-edge AI research and actionable policy solutions. The two-day gathering (May 31–June 1) convened more than 150 technical experts, researchers, and policymakers to address the most pressing challenges at the intersection of AI technology and governance.

‍

Organized in collaboration with the Foundation for American Innovation (FAI), the Center for a New American Security (CNAS), and the RAND Corporation, the conference tackled urgent challenges including semiconductor export controls, hardware-enabled governance mechanisms, AI safety evaluations, data center security, energy infrastructure, and national defense applications.

"I hope that today this divide can end, that we can bury the hatchet and forge a new alliance between innovation and American values, between acceleration and altruism that will shape not just our nation's fate but potentially the fate of humanity," said Mark Beall, President of the AI Policy Network, addressing the critical need for collaboration between Silicon Valley and Washington.

‍

Keynote speakers included Congressman Bill Foster, Saif Khan (Institute for Progress), Helen Toner (CSET), Mark Beall (AI Policy Network), Brad Carson (Americans for Responsible Innovation), and Alex Bores (New York State Assembly). The diverse program featured over 20 speakers from leading institutions across government, academia, and industry.

‍

Key themes emerged around the urgency of action, with speakers highlighting a critical 1,000-day window to establish effective governance frameworks. Concrete proposals included Congressman Foster's legislation mandating chip location-verification to prevent smuggling, the RAISE Act requiring safety plans and third-party audits for frontier AI companies, and strategies to secure the 80-100 gigawatts of additional power capacity needed for AI infrastructure.

‍

FAR.AI will share recordings and materials from on-the-record sessions in the coming weeks. For more information and a complete speaker list, visit https://far.ai/events/event-list/technical-innovations-for-ai-policy-2025.

‍

About FAR.AI

Founded in 2022, FAR.AI is an AI safety research nonprofit that facilitates breakthrough research, fosters coordinated global responses, and advances understanding of AI risks and solutions.

‍

Access the Media Kit

Media Contact: tech-policy-conf@far.ai

Frontier LLMs like ChatGPT are powerful but not always robust. Scale helps with many things. We wanted to see if scaling up the model size can ‘solve’ robustness issues.

Table of contents

Example H2

This is a div block with a Webflow interaction that will be triggered when the heading is in the view.

We study models in the classification setting as there is a clear notion of “correct behavior”: does the model output the right label? We can then naturally define robustness as the proportion of the attacked dataset that the model correctly classifies. We evaluate models on tasks such as spam detection and movie sentiment classification. We adapt pretrained foundation models for classification by replacing the generative model’s unembedding layer with a randomly initialized classification head, and then fine-tune the models on each task.

We focus on adversarial-suffix style attacks: appending an adversarially chosen prompt to a benign prompt in an attempt to cause the model to misclassify the input, e.g., classify a spam email as not-spam. We consider two attacks: the state-of-the-art Greedy Coordinate Gradient method (Zou et al., 2023), and a baseline random token attack. This simple threat model has the advantage of being unlikely to change the semantics of the input. For example, a spam email is still spam even if a handful of tokens are appended to it. Of course, attackers are not limited to such a simple threat model: studying more open-ended threat models (such as rephrasing the prompt, or replacing words with synonyms) and corresponding attack methods (such as LLM generated adversarial prompts) represents an important direction for future work.

In the remainder of this post, we first contextualize our work in the larger context of the AI safety and scaling laws landscape. We then outline our initial experimental results, finding clear scaling trends for adversarially trained models, while models fine-tuned only on clean data show little improvement from scale. We conclude by presenting our plan for future work.

Why robustness?

Ensuring the development and deployment of AI systems is safe requires addressing a variety of social and technical challenges, as discussed by various research teams. We believe adversarial robustness is a necessary (though not sufficient) condition for AI to be safely deployed in environments with adversaries. Moreover, improving adversarial robustness would also help mitigate issues such as reward hacking that can lead to misalignment due to the adversarial pressure of a “student” model optimizing the output of a “teacher” model, such as a reward model.

AI systems are already deployed in safety-critical domains with potential adversaries, such as algorithmic trading, intrusion detection systems, and analyzing intelligence data. With rapid progress in capabilities, there are a wide variety of emerging sensitive applications, such as LLM-based virtual assistants with access to email accounts, shared files, and other confidential data. These transformative applications will introduce an entire new class of security vulnerabilities: exploiting LLM assistants to divulge sensitive data or even take malicious actions. Moreover, LLMs may have dangerous capabilities even in the absence of privileged access to external systems. Contemporary LLMs have been found to be able to generate misinformation more compelling than most human-generated misinformation, and future models may even be able to assist terrorists with bioweapon development.

In addition to misuse risks, robustness failures could cause AI systems to be misaligned and exploitable by other models. We expect future AI systems to continue to have some component of training where one “student” model is optimized against the output of another “teacher” model. For example, in RLHF, a foundation model is fine-tuned to maximize the output of a reward model learned from human preferences. Theoretical and empirical work have shown that in most circumstances, we can expect the “student” model to exploit the “teacher” model it’s being trained against, following the letter rather than the spirit of what the teacher is trying to express. As we train more and more capable AI systems, it is vital that they learn to do what we actually want, and do not exploit the other AI models used to train them. This becomes even more pressing when the models are capable enough in a domain that it’s difficult for a (lay) human to tell whether a given output is good or bad, such as detecting whether generated code contains a subtle security vulnerability.

We know that today’s general-purpose systems are not robust to jailbreaks (indeed, some already misbehave by themselves). However, although current models are capable in many domains, there are still many simple tasks they cannot perform. One might hope that the status quo development scenario of scaling up model pretraining, followed by some safety fine-tuning before release, will substantially improve model robustness. In this post, we seek to answer the question: is scale combined with simple defense methods sufficient for robustness? If not, how big a gap remains, and what techniques are most likely to close that gap? Concerningly, previous work has shown that even superhuman AI systems are vulnerable to adversarial attacks. However, to the best of our knowledge, this is the first work systematically studying the robustness of LLMs of varying scales.

Why scaling laws?

Scaling laws have been studied in AI since the 1980s, but the topic only became a central part of the AI conversation following a 2020 publication by Kaplan et al. This paper showed that large language model performance can be accurately predicted by the model parameter count, dataset size, and compute used during training. Many follow-up publications were inspired by this work, including a notable 2022 result when a team at DeepMind re-calculated the optimal tradeoff between model size and training time, with their Chinchilla model unlocking new levels of performance for a fixed compute budget.

Yet, not all abilities improve with scale. In fact, some things scale inversely with model size, like the ability for a model to correctly answer questions about common misconceptions—larger models give the wrong answer more often! As such, we cannot simply assume that adversarial robustness will improve if we make a model bigger or train it for longer. We must explicitly study scaling behavior in order to predict what to expect from future models.

In the short-term, understanding scaling laws for robustness will enable us to train more robust models by using existing techniques more efficiently, similar to how Chinchilla improved the efficiency of model pre-training. In the medium-term, scaling laws will guide us in developing more effective defenses that can make better use of scale. And in the long-term, scaling laws will allow us to more efficiently allocate safety research effort by identifying which robustness problems will require algorithmic improvements to address, and which will be solved “by default” with sufficient model scale.

Experimental results

We’ve argued that understanding how robustness varies with model scale in LLM would be impactful, enabling us to allocate our compute budget wisely, improve defense techniques, and focus researcher effort towards fundamental problems in safety. We’ll start with an overview of our experimental setup and results, with collapsible boxes providing additional information on experimental setup, and subsequent sections providing more detailed results.

Our setup

Models: We use the Pythia models for the experiments presented in this post. In order to use these models for classification, we replace the unembedding layer with a linear classification head, which we fine-tune on our four tasks, described below.

Model Families Details

Tasks: We consider four binary classification tasks

IMDB: does a movie review have positive or negative sentiment?
Spam: is an email spam or not?
PasswordMatch: does a user-provided string match the string in the prompt?
WordLength: given two words, which one is longer?

Tasks Details

<strong>Attacks: </strong>We evaluate using two attacks. Both attacks work by appending 10 tokens to the text to be classified.

RandomToken: choose tokens uniformly at random. Keep trying new token sequences until an attack budget is exhausted.
GCG: At each iteration, change one token in the attack, replacing it with the token which maximizes misclassification probability (chosen via gradient descent).

<strong>Defense: </strong>We defend models against attacks through adversarial training: fine-tuning the models on attacked examples with the correct (clean) label, in addition to fine-tuning on the original training dataset.

TL;DR: Results

Undefended models: larger models tend to be more robust, but the effect is weak & noisy.

There is a positive correlation between model size and robustness to adversarial attacks. However, there is a large (random) impact on robustness due to unrelated factors such as the training checkpoint (even between nearby checkpoints), and the random seed of the model initialization or attack. These factors can outweigh a 2x (and sometimes even 10x) change in model size.

Adversarial training: clearer scaling trends emerge.

Larger models are more sample efficient in becoming robust to the attacks, and adversarial training converges to a higher robustness (lower attack success rate) for larger models as compared with smaller models.

Robustness transfer: adversarial training against one attack improves robustness to different attacks, but only for large enough models.

We see early signs of robustness transfer for models adversarially trained on one attack and then tested with a stronger attack. This works both for stronger attacks of the same method (10 iterations of GCG for training vs. 30 iterations of GCG for test) and with different methods (RandomToken for training vs. 10 iterations of GCG for test).

Let’s go through each of these in detail.

Undefended models

In this section, we evaluate models fine-tuned on a cleaned dataset.

We find bigger models tend to be more robust, but the relationship is weak with considerable noise. Below, we evaluate the GCG attack’s success rate on models fine-tuned on the IMDB dataset, with three (fine-tuning) seeds per model size. We can clearly see in the right-side plot that there are two places where the median attack success rate goes up—in other words, robustness gets worse—as we move to a bigger model.

We performed the same attack on models fine-tuned on the Spam task (below). Here model robustness is even more noisy. Size does help more than hurt on average: models above 100M parameters seem markedly better than the smaller ones, and things seem to start improving again once we hit the 1B mark. But robustness seems to plateau between 100M and 1B parameters.

Model robustness is similarly stochastic for models fine-tuned and evaluated on the PasswordMatch and WordLength tasks.

What’s going on here?

Why is there so much variability in robustness across sizes?

Because the Pythia models are of similar architecture and trained on exactly the same dataset with the same training procedure, the main remaining sources of variability are randomness in the pretraining (next-token prediction) and the fine-tuning (learning the classification task) procedures. Intuitively, the training procedure might find one of many local minima which give similarly good classification results on clean data. Some of these local minima will also give good results on attacked data, while others will not. Whether one lands up in a robust local minima or not is largely independent of the size of the model in question.

To dive deeper into this variability in robustness, we’d like to train different models from scratch, with different seeds, and see if there is a lot of variability across models of the same size. This is not quite possible with the Pythia models, since each model size was trained only once with a single pretraining seed. However, we can test something almost as good by taking a handful of different training checkpoints, from the final 3% of pretraining (that is, after the model has completed at least 97% of pretraining), and then fine-tune the model from that earlier checkpoint instead of the final checkpoint. Below we scatterplot performance on three of our tasks (IMDB, Spam, and PasswordMatch){{1}} with model size on the x-axis and attack success rate on the y-axis.

While we only conducted this experiment for models up to ~1B parameters, we already see a large amount of variability between checkpoints for the larger models. We conjecture that smaller models are less variable only because the attack saturates at close to 100% success rate. Given the results preceding these, we’d likely see even more variability if we considered both different checkpoints and different fine-tuning seeds, not to mention starting from different pretraining seeds!

These experiments suggest that without any explicit safety training, robustness to adversarial attacks is highly affected by randomness from pretraining and fine-tuning, and only depends weakly on model size. Yet, when models are applied in the real world, they almost always undergo additional fine-tuning with a specific focus on safety, to ensure that they do not provide users with harmful outputs. In the next section, we explore how model size affects robustness when explicitly performing safety training as part of the model fine-tuning.

Adversarial training

Adversarial training—a common approach used in safety training—is simply the idea of attacking a model, finding mistakes, and then letting the model learn from its mistakes. It’s one of the most popular and successful ways of making deep learning models more robust. Our adversarial training procedure works by iteratively:

Fine-tuning on a dataset, initialized to the clean training dataset.
Attacking the fine-tuned model by adding adversarial suffixes to datapoints in the clean training dataset.
Adding a subset of the attacked examples to the training dataset for the model to finetune against in the next round.

We start with a small training dataset of 2000 clean examples. We then add 200 attacked examples to the training dataset each round. By having a mixture of clean and attacked examples, we expect to both retain performance on the clean dataset, while also reducing the attack success rate on adversarial datapoints.

We repeat the evaluation from the previous section on adversarially trained models, evaluating model robustness after each round of adversarial training. The plots below show a substantial increase in robustness (decrease in attack success rate, y-axis) for both Spam (left) and IMDB (right) with adversarial training round (x-axis), with larger models (lighter colors) responding more quickly to adversarial training. We evaluate Pythia models ranging from 7M to 1B parameters. We evaluate 3 random fine-tuning seeds, as before plotting the median value and shading the area between the minimum and maximum value.

We can alternatively visualize these same results in a style similar to the graphs in the previous section, placing model size on the x-axis, and using color to indicate adversarial training round. These graphs show much stronger and more consistent trends than those seen for undefended models in the previous sections. However, these plots also make clear one instance of non-monotonicity: the 7.6M parameter model is less robust than the 17.6M parameter model on IMDB (left).

Apart from this, the curves are consistently monotonic: the larger the model, the better its robustness. However, there does not appear to be any clean functional relationship between model scale and attack success rate, unlike the power laws for cross-entropy loss with model size. However, it is possible that a functional relationship will emerge when studying models across greater scales (this preliminary investigation only spans two orders of magnitude), and with alternative robustness metrics (attack success rate saturates near 0 and 1).

One important property we noticed is that larger models seem to converge to a better robustness level than smaller models. We can observe this in the below plots, adversarial training the models for 30 rounds instead of 10: the different model sizes seem to plateau at different robustness levels.

We can see this distinction between final performance more clearly if we zoom into the final 10 rounds of adversarial training:

An important caveat to this is that larger models are proportionally more expensive to adversarially train. So, it would be necessary to train the smaller models for many more rounds than the larger models to conclusively show that they cannot converge to a similar robustness level as their bigger brethren given the same compute budget. Still, in instances where generating adversarial examples is the bottleneck rather than training on them (e.g., training against data generated by human red-teamers), the advantage of bigger models appears to be profound.

These trends are not limited to GCG—we see similar results with RandomToken, shown below on IMDB.

So, we’ve seen that if the models are allowed to train on attacked examples, the bigger ones consistently learn faster and converge to a better level of robustness than the smaller ones. But how reasonable an assumption is it that we’ll get to train on our mistakes? While we do expect that models will undergo safety training before release, there will inevitably be attacks that were not used (or didn’t exist!) at train time. Can we hope to be robust to those, and does scale help?

Robustness transfer

To answer whether we can transfer defenses across attacks, we first evaluate a narrow form of transfer: where the adversary uses the same attack method as that used during adversarial training, but with much more compute. Specifically, we train the model using GCG, and then evaluate it against a stronger version of GCG (running for 30 iterations instead of 10).

The plot below shows that models trained on 10-iteration GCG are reasonably robust (attack success rate <50%, y-axis) against 30-iteration GCG after 10 adversarial training rounds (x-axis). We also see two clusters: smaller models (darker lines) plateau at around 50% attack success rate after 10 rounds, while larger models (lighter lines) achieve 50% attack success rate after just 3 training rounds, converging to below 20% after 10 rounds.

There is less of a distinction between large and small models in the Spam setting (below). Larger models still transfer much better, but smaller models achieve a below 30% success rate within 10 adversarial training rounds.

These results suggest that models are still robust against a stronger version of the same attack. But what if the deploy time attack is not just stronger, but also uses a different method?

We study this question by performing adversarial training on the RandomToken attack, and then evaluating the models against the GCG attack. Not only is RandomToken a significantly weaker attack than GCG, but the way in which the attack tokens are selected is also completely different. Below, we show the results of this experiment in IMDB (left) and Spam (right). We see a strong distinction between the bigger and smaller models. While the robustness transfer is less strong here than for the in-distribution attack, we still see clear transfer behavior—but only for the larger models! This suggests that the larger models are able to learn some abstract notion of defense (such as if the final 10 tokens look fishy, just ignore them) which the smaller models aren’t able to grasp.

Future work

One key area we are working on is evaluating on generative tasks as well as classification. Not only will this give us additional datapoints on the scaling behaviour of robustness, but it also unlocks testing proprietary frontier models. Although some of these models are not available for us to finetune (or it would be computationally prohibitive to do so), we can take advantage of the generative setting to use in-context learning or even simply careful prompting to explore performance at the largest scales. Will we see similar scaling trends for generative tasks, and for these kinds of “few-shot” defenses rather than fine-tuning?

The generative setting will also unlock LLM-based redteaming as an attack. This is a qualitatively different attack to the baseline random token and search-based GCG attack we have previously studied. We may also add more attacks—for example, an adversarial suffix attack that is able to circumvent perplexity filtering, or alternatively, a soft-prompting attack which might be even stronger than GCG.

On the defense side of things, a natural next step is to train a moderation classifier—à la Ziegler et al—which would attempt to flag whether a datapoint was attacked or not. Will bigger models be consistently better at this? How much time should we spend fine-tuning the classifier compared with adversarially training the victim model if we want maximum robustness?

Another direction is less about new attack or defense methods, and more about expanding the experimental results and analysis in settings we have already studied. We’ve seen that larger models are more sample efficient at becoming robust, but are they more compute efficient as well? Maybe for a fixed number of FLOPs, you’re actually better off fine-tuning a smaller model for a large number of adversarial training rounds. Or maybe the optimum lies somewhere in the middle of the Pareto frontier between model size and number of adversarial training rounds! To test this, we will need to train smaller models for many more rounds of adversarial training—and make sure our adversarial training method can handle this, without catastrophic forgetting or overfitting to the task.

We plan to answer most of the above questions, but there remain many more questions we would be excited to see studied. For example, how do the results we find at this scale generalize to frontier models at GPT 3+ scales? We see a phase transition for robustness transfer at 100M parameters. Will there be other phase transitions at larger model sizes, as there are with other emergent capabilities like scratchpad use? Separately, what factors other than scale (eg, model architecture, training hyperparameters, optimizer choice) have a large effect on robustness? How can we find them?

Closing Remarks

We hope you enjoyed this preview of our results. Check out our paper to find out more.

What do you think of the results so far? Is there something we’re missing, or an area that you’re excited about for future work? We’d love to get your opinions: feel free to reach out at niki@far.ai or adam@far.ai. Like what we’re doing and want to work on this yourself? We’re open to collaborations and are hiring for full-time roles.

1. Where a model is fine-tuned on harmful examples to bypass its safety restrictions. Read more here

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI