Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

Full PDF
Project
Source

Summary

FOR IMMEDIATE RELEASE

FAR.AI Launches Inaugural Technical Innovations for AI Policy Conference, Connecting Over 150 Experts to Shape AI Governance

WASHINGTON, D.C. — June 4, 2025 — FAR.AI successfully launched the inaugural Technical Innovations for AI Policy Conference, creating a vital bridge between cutting-edge AI research and actionable policy solutions. The two-day gathering (May 31–June 1) convened more than 150 technical experts, researchers, and policymakers to address the most pressing challenges at the intersection of AI technology and governance.

Organized in collaboration with the Foundation for American Innovation (FAI), the Center for a New American Security (CNAS), and the RAND Corporation, the conference tackled urgent challenges including semiconductor export controls, hardware-enabled governance mechanisms, AI safety evaluations, data center security, energy infrastructure, and national defense applications.

"I hope that today this divide can end, that we can bury the hatchet and forge a new alliance between innovation and American values, between acceleration and altruism that will shape not just our nation's fate but potentially the fate of humanity," said Mark Beall, President of the AI Policy Network, addressing the critical need for collaboration between Silicon Valley and Washington.

Keynote speakers included Congressman Bill Foster, Saif Khan (Institute for Progress), Helen Toner (CSET), Mark Beall (AI Policy Network), Brad Carson (Americans for Responsible Innovation), and Alex Bores (New York State Assembly). The diverse program featured over 20 speakers from leading institutions across government, academia, and industry.

Key themes emerged around the urgency of action, with speakers highlighting a critical 1,000-day window to establish effective governance frameworks. Concrete proposals included Congressman Foster's legislation mandating chip location-verification to prevent smuggling, the RAISE Act requiring safety plans and third-party audits for frontier AI companies, and strategies to secure the 80-100 gigawatts of additional power capacity needed for AI infrastructure.

FAR.AI will share recordings and materials from on-the-record sessions in the coming weeks. For more information and a complete speaker list, visit https://far.ai/events/event-list/technical-innovations-for-ai-policy-2025.

About FAR.AI

Founded in 2022, FAR.AI is an AI safety research nonprofit that facilitates breakthrough research, fosters coordinated global responses, and advances understanding of AI risks and solutions.

Access the Media Kit

Media Contact: tech-policy-conf@far.ai

Concept Influence attributes model behaviors to semantic directions (like linear probes or sparse autoencoder features) rather than individual test examples, improving identification of the training data that disproportionately drive unintended behaviors. Simple first-order approximations match or outperform standard influence functions while achieving over 20× computational speedups, though they degrade under significant distribution shifts.

Table of contents

The Problem: When Syntax Masks Semantics

Which training examples teach models to generate insecure code or exhibit harmful behaviors? Traditional TDA methods, like standard influence functions, are susceptible to ``surface-level’’ correlations. By relying on single test examples, they often prioritize lexical overlap – essentially finding training data that looks like the query output rather than data that actually causes the target behavior of interest. Simple keyword search can rival sophisticated influence methods on fact retrieval tasks because the most influential examples often just share words with the test input, not meaning. This syntax-level bias makes them a poor fit for diagnosing abstract safety risks like sycophancy or emergent misalignment, which aren't tied to any one specific set of words. To fix this, we shift the perspective: instead of querying a single prompt, Concept Influence attributes model behavior to semantic directions using interpretable vectors like linear probes or Sparse Autoencoder (SAE) features.

Method: Concept Influence

At its core, Concept Influence generalizes the standard influence function formulation by replacing the gradient of a single test output with the gradient of a semantic concept vector. By using interpretable directions like linear probes, Sparse Autoencoder (SAE) features, or even crosscoder features – which we show in the full paper can unsupervisedly detect data inserting "sleeper agent" triggers without any prior knowledge of the backdoor – we can query the model about behaviors rather than specific strings. 

A key theoretical insight of our work is providing a bridge between theoretically rigorous attribution and computational efficiency in practice: we show that common, efficient techniques, like Projection Difference (a method from Anthropic’s Persona Vectors) and simple linear probes for finding maximally activating data (Vector Filter), are actually first-order approximations of the Concept Influence formulation. These approximations run over 20× faster than gradient-based methods while maintaining competitive performance when training and evaluation domains align closely.

Experiment: Preventing Emergent Misalignment

To demonstrate the effectiveness of Concept Influence and efficient approximations compared to standard influence functions, we first evaluated them in the context of Emergent Misalignment (EM), where narrow fine-tuning on one domain causes broad, undesirable traits – like "evil" or sycophantic behavior – to appear in the fine-tuned model. We focused on attributing the "evil" trait by fine-tuning Qwen2.5-7B on datasets containing misaligned opinions, bad medical advice, insecure code, and math errors.

Across all domains, Concept Influence consistently outperformed traditional influence functions, providing a cleaner separation between aligned and misaligned training examples. Strikingly, we found that training on just the top 10–20% of "influential" data identified by several could produce models significantly more "evil" than those trained on the full dataset. For insecure code, the full dataset yielded models with roughly 1% "evil" scores, while the influence-selected subset produced scores around 10%.

We also confirmed that these results are robust across different architectures and behaviors; the results for Sycophancy and evaluations on Llama3.1-8B can be found in our appendix, where the performance advantages of concept-based attribution remain broadly consistent.

Reducing or inducing misalignment through data filtering. We fine-tune Qwen2.5-7B on four EM datasets (Misaligned Opinions, Bad Medical Advice, Insecure Code, and GSM8k Mistakes) and evaluate the 'evilness' before and after using an LLM judge. We then use four different data attribution methods to try and reduce (Remove Most) or increase (Remove Least) the evilness of the model. Across all datasets, Concept Influence performs the best, while efficient approximations achieve comparable performance to influence functions in many settings.

To understand what exactly makes a data point influential, we moved beyond just looking at individual examples and leveraged Sparse Autoencoders (SAEs) to group training data into semantically coherent clusters. This approach reveals a stark qualitative difference: while traditional influence functions often get distracted by "distractor" tokens in a prompt – highlighting generic clusters like legal frameworks or tax terms simply because they appeared in the text – Concept Influence hones in on features semantically similar to the behaviour. For instance, when analyzing a query about violent revenge, our method surfaced clusters related to historical oppression, criminality, and societal critique. 

Most influential SAE features to the ‘evil’ persona trait for the sampled test query for Qwen2.5-7B fine-tuned on the misaligned opinions dataset. Red and blue bars indicate the total influence coming from Evil and normal data, respectively. Influence Functions (top) surface generic concepts (legal terms, tax relief, culinary experiences) mentioned in the query, but unrelated to the target trait of interest. Concept Influence (bottom) reveal semantically relevant features—historical oppression, conspiracy theories, criminality, and societal critique—that are predominantly influenced by the evil-aligned fine-tuning data (red).

Experiment: Post-Training on OASST1

While emergent misalignment benchmarks provide a controlled environment, we also tested these methods on a non-synthetic, real-world post-training dataset: OpenAssistant Conversations (OASST1). Our goal was to maximize instruction-following capability while minimizing harmful "evil" behavior using as little data as possible. We observed a clear trade-off: as training data increases, the model's instruction-following (MTBench score) improves, but its safety (Evil score) deteriorates.

Filtering the dataset with Concept Influence and its efficient approximations significantly improved this Pareto frontier. Notably, the simple and efficient Vector Filter method achieved the strongest results—using only 5% of the "safest" data, we were able to match the capability of a model trained on 100% of the dataset while keeping the evil score substantially lower. These results demonstrate that interpretability-based attribution can be used as a practical tool for data curation that makes LLMs safer without sacrificing performance.

Filtering out harmful data when post-training Qwen2.5-7B on the Open Assistant v1 (OASST1) dataset. Supervised fine-tuning on OASST1 improves the instruction following, according to the MTBench dataset, from 38% to 67% but also results in harmful outputs (according to an LLM judge) increasing by ~2%. Efficient filtering methods (Vector Filter and Projection Difference) produce comparable results while being over an order of magnitude more efficient.

When to Use What

Our analysis revealed two distinct families of attribution methods with different strengths:

Gradient-based methods (including Concept Influence with full gradients) perform better under severe distribution shift, where fine-tuning and evaluation domains differ substantially. They're more robust but computationally expensive.

Probe-based methods (Vector Filter, Projection Difference) work well when domains align closely and run over 20x faster. Because they operate entirely on the base model, they let practitioners anticipate data effects without fine-tuning.

Conclusion

Ultimately, understanding why a model behaves the way it does starts with understanding the data that built it. If we can do this, we move one step closer to making LLMs not just more powerful, but more predictable and safe during the training and fine-tuning stages. Specifically, by shifting the focus of training data attribution from specific strings to abstract semantic directions, we can build a more robust and scalable toolkit for AI safety. 

Read the full paper for more information.