Mind the Mitigation Gap: Why AI Companies Must Report Both Pre- and Post-Mitigation Safety Evaluations
May 23, 2025
Summary
AI companies must report both pre- and post-mitigation safety evaluations to accurately assess and manage risks associated with advanced models. Pre-mitigation evaluations reveal raw, potentially dangerous capabilities, while post-mitigation evaluations test the effectiveness of safety measures under adversarial conditions. Without both, critical safety gaps remain hidden, hindering responsible deployment and informed policymaking.
As AI systems grow more powerful, ensuring their safety requires more than just adding safeguards—it demands a complete understanding of both their raw capabilities and the effectiveness of mitigation measures.
- Pre-mitigation evaluations assess a model’s dangerous capabilities before any safety interventions, helping us anticipate worst-case scenarios.
- Post-mitigation evaluations, in turn, verify whether safety measures are actually working, ensuring that AI systems consistently refuse dangerous requests, even under adversarial conditions.
Yet, most AI companies today report only one or the other—leaving critical gaps in our ability to assess AI safety risks. We argue that both evaluations are essential for making informed policy and deployment decisions.

Why Evaluating Both Matters
1. Pre- and Post-Mitigation Evaluations Offer Unique Insights
Pre-mitigation evaluations reveal a model’s underlying dangerous capabilities, allowing companies and regulators to prepare for potential misuse. Without this data, we cannot prepare for worst-case scenarios in which adversaries bypass safety guardrails in unexpected ways—through jailbreaks, fine-tuning exploits, or unauthorized access to pre-mitigation model weights.
Post-mitigation evaluations ensure that safety measures function as intended. They test whether a model reliably refuses dangerous requests and whether its safeguards hold up against adversarial attacks.
2. Evaluating Both Together Enables Better Decision-Making
An AI model might appear safe only because it has low pre-mitigation capabilities. But without testing post-mitigation behavior, we don’t know if its safeguards are effective enough for future, more advanced versions. Conversely, even if post-mitigation safety measures seem strong, a lack of pre-mitigation testing means we can’t assess how much risk would arise if those safety measures fail or are bypassed. By evaluating both pre- and post-mitigation performance, companies can make better decisions about deployment, security, and regulatory compliance—ensuring that AI remains safe even as capabilities grow.
Current Industry Practices Fall Short
Our analysis of leading AI companies reveals concerning gaps in safety reporting:
- OpenAI provides the most transparent safety evaluations of o1, detailing both pre- and post-mitigation performance on high-risk capabilities such as CBRN (Chemical, Biological, Radiological, and Nuclear) threats. Unfortunately, OpenAI has not yet been as transparent for o3 – the model powering Deep Research – noting that the model is “medium” risk and promising to release a system card detailing its safety evaluations in the future.
- Anthropic reports on pre-mitigation capabilities using "low-refusal" versions of Claude 3 and 3.5 but does not provide post-mitigation evaluations or a clear definition of "low-refusal."
- Google's Gemini 1.5 model card focuses exclusively on post-mitigation evaluations, offering limited transparency on methodology and results.
For open-weight model providers, the situation is worse:
- Meta claims to have conducted CBRN safety evaluations for Llama 3.1 but offers only a vague statement, “We have not detected a meaningful uplift in malicious actor abilities using Llama 3.1 405B.”
- Alibaba and DeepSeek provide even less transparency, with Alibaba's Qwen-2.5 offering only basic post-mitigation evaluations and DeepSeek’s R1 lacking any reported safety evaluation.
These inconsistencies in safety reporting make it difficult to assess AI risks accurately and highlight the need for standardized, comprehensive evaluations.
Real-World Impact: A Case Study
To demonstrate how our proposed framework can inform AI safety strategy, we applied it to four leading AI models:
- o1 (OpenAI)
- GPT-4o (OpenAI)
- Claude 3.5 Sonnet (Anthropic)
- Gemini 1.5 Pro (Google DeepMind)
We apply illustrative risk thresholds of 60% accuracy and 20% compliance. These thresholds are not based on formal risk assessments—they are arbitrary benchmarks used solely to demonstrate how capability and compliance interact in practice.
1. Testing Capabilities
Lacking access to pre-mitigation models, we measured model performance on the WMDP-Chem dataset as a proxy. The dataset contains questions designed to measure potentially dangerous chemistry knowledge while being benign enough to not trigger safety filters.
2. Testing Safety Measures
Next, we transformed WMDP-Chem questions into realistic dangerous requests—the type an adversary might use to build a chemical weapon—and tested whether the models complied.
Key Takeaways
Applying the thresholds we outline above, we could draw the following conclusions:
- o1 appears safe for controlled API deployment, as it combines high knowledge capabilities with strong refusal rates. However, OpenAI must ensure model weights remain secure to prevent unauthorized access or fine-tuning attacks.
- Claude 3.5 Sonnet & Gemini 1.5 Pro show lower risk, with limited dangerous knowledge and strong refusal mechanisms, suggesting they may be safe even with broader access.
- GPT-4o presents a serious safety concern: its combination of high capabilities and high compliance with dangerous requests suggests OpenAI must implement stronger safety measures.
This is not a formal safety evaluation, but a proof-of-concept applying our framework using available models and benchmarks. Instead, this case study serves to highlight why both aspects—capabilities and safety measures—must be tested together. Looking at only one would have masked key risks.
The Path Forward
We propose three key recommendations for ensuring AI safety transparency and accountability:
1. Adopt Comprehensive Safety Reporting Standards
- AI companies should report both pre- and post-mitigation evaluations.
- Assessments must include how robust their safety measures are against adversarial attacks.
2. Ensure Minimum Transparency Standards
- At a minimum, companies should disclose whether models exceed pre-specified dangerous capability thresholds (Anthropic’s approach for Claude 3 and 3.5).
- Companies should consider reporting raw dangerous capability scores (OpenAI’s approach for o1) for greater transparency.
3. Provide Government Agencies Access to Pre-Mitigation Models
- AI Security Institutes should have access to pre-mitigation models for independent evaluation, with strict security protocols in place.
- This access is essential to verify company safety claims and assess systemic risks effectively.
Why This Matters
The stakes are high. As AI systems continue to advance, understanding both their raw capabilities and the reliability of safety measures is critical—not just for preventing misuse but for ensuring responsible innovation. Without comprehensive evaluations, policymakers lack the necessary data to regulate AI effectively. AI developers must proactively establish transparent safety reporting, and policymakers must enforce compliance to ensure AI is deployed responsibly and securely. By implementing these recommendations, we can close the AI safety gap—before it’s too late.
This is a div block with a Webflow interaction that will be triggered when the heading is in the view.
As AI systems grow more powerful, ensuring their safety requires more than just adding safeguards—it demands a complete understanding of both their raw capabilities and the effectiveness of mitigation measures.
- Pre-mitigation evaluations assess a model’s dangerous capabilities before any safety interventions, helping us anticipate worst-case scenarios.
- Post-mitigation evaluations, in turn, verify whether safety measures are actually working, ensuring that AI systems consistently refuse dangerous requests, even under adversarial conditions.
Yet, most AI companies today report only one or the other—leaving critical gaps in our ability to assess AI safety risks. We argue that both evaluations are essential for making informed policy and deployment decisions.

Why Evaluating Both Matters
1. Pre- and Post-Mitigation Evaluations Offer Unique Insights
Pre-mitigation evaluations reveal a model’s underlying dangerous capabilities, allowing companies and regulators to prepare for potential misuse. Without this data, we cannot prepare for worst-case scenarios in which adversaries bypass safety guardrails in unexpected ways—through jailbreaks, fine-tuning exploits, or unauthorized access to pre-mitigation model weights.
Post-mitigation evaluations ensure that safety measures function as intended. They test whether a model reliably refuses dangerous requests and whether its safeguards hold up against adversarial attacks.
2. Evaluating Both Together Enables Better Decision-Making
An AI model might appear safe only because it has low pre-mitigation capabilities. But without testing post-mitigation behavior, we don’t know if its safeguards are effective enough for future, more advanced versions. Conversely, even if post-mitigation safety measures seem strong, a lack of pre-mitigation testing means we can’t assess how much risk would arise if those safety measures fail or are bypassed. By evaluating both pre- and post-mitigation performance, companies can make better decisions about deployment, security, and regulatory compliance—ensuring that AI remains safe even as capabilities grow.
Current Industry Practices Fall Short
Our analysis of leading AI companies reveals concerning gaps in safety reporting:
- OpenAI provides the most transparent safety evaluations of o1, detailing both pre- and post-mitigation performance on high-risk capabilities such as CBRN (Chemical, Biological, Radiological, and Nuclear) threats. Unfortunately, OpenAI has not yet been as transparent for o3 – the model powering Deep Research – noting that the model is “medium” risk and promising to release a system card detailing its safety evaluations in the future.
- Anthropic reports on pre-mitigation capabilities using "low-refusal" versions of Claude 3 and 3.5 but does not provide post-mitigation evaluations or a clear definition of "low-refusal."
- Google's Gemini 1.5 model card focuses exclusively on post-mitigation evaluations, offering limited transparency on methodology and results.
For open-weight model providers, the situation is worse:
- Meta claims to have conducted CBRN safety evaluations for Llama 3.1 but offers only a vague statement, “We have not detected a meaningful uplift in malicious actor abilities using Llama 3.1 405B.”
- Alibaba and DeepSeek provide even less transparency, with Alibaba's Qwen-2.5 offering only basic post-mitigation evaluations and DeepSeek’s R1 lacking any reported safety evaluation.
These inconsistencies in safety reporting make it difficult to assess AI risks accurately and highlight the need for standardized, comprehensive evaluations.
Real-World Impact: A Case Study
To demonstrate how our proposed framework can inform AI safety strategy, we applied it to four leading AI models:
- o1 (OpenAI)
- GPT-4o (OpenAI)
- Claude 3.5 Sonnet (Anthropic)
- Gemini 1.5 Pro (Google DeepMind)
We apply illustrative risk thresholds of 60% accuracy and 20% compliance. These thresholds are not based on formal risk assessments—they are arbitrary benchmarks used solely to demonstrate how capability and compliance interact in practice.
1. Testing Capabilities
Lacking access to pre-mitigation models, we measured model performance on the WMDP-Chem dataset as a proxy. The dataset contains questions designed to measure potentially dangerous chemistry knowledge while being benign enough to not trigger safety filters.
2. Testing Safety Measures
Next, we transformed WMDP-Chem questions into realistic dangerous requests—the type an adversary might use to build a chemical weapon—and tested whether the models complied.
Key Takeaways
Applying the thresholds we outline above, we could draw the following conclusions:
- o1 appears safe for controlled API deployment, as it combines high knowledge capabilities with strong refusal rates. However, OpenAI must ensure model weights remain secure to prevent unauthorized access or fine-tuning attacks.
- Claude 3.5 Sonnet & Gemini 1.5 Pro show lower risk, with limited dangerous knowledge and strong refusal mechanisms, suggesting they may be safe even with broader access.
- GPT-4o presents a serious safety concern: its combination of high capabilities and high compliance with dangerous requests suggests OpenAI must implement stronger safety measures.
This is not a formal safety evaluation, but a proof-of-concept applying our framework using available models and benchmarks. Instead, this case study serves to highlight why both aspects—capabilities and safety measures—must be tested together. Looking at only one would have masked key risks.
The Path Forward
We propose three key recommendations for ensuring AI safety transparency and accountability:
1. Adopt Comprehensive Safety Reporting Standards
- AI companies should report both pre- and post-mitigation evaluations.
- Assessments must include how robust their safety measures are against adversarial attacks.
2. Ensure Minimum Transparency Standards
- At a minimum, companies should disclose whether models exceed pre-specified dangerous capability thresholds (Anthropic’s approach for Claude 3 and 3.5).
- Companies should consider reporting raw dangerous capability scores (OpenAI’s approach for o1) for greater transparency.
3. Provide Government Agencies Access to Pre-Mitigation Models
- AI Security Institutes should have access to pre-mitigation models for independent evaluation, with strict security protocols in place.
- This access is essential to verify company safety claims and assess systemic risks effectively.
Why This Matters
The stakes are high. As AI systems continue to advance, understanding both their raw capabilities and the reliability of safety measures is critical—not just for preventing misuse but for ensuring responsible innovation. Without comprehensive evaluations, policymakers lack the necessary data to regulate AI effectively. AI developers must proactively establish transparent safety reporting, and policymakers must enforce compliance to ensure AI is deployed responsibly and securely. By implementing these recommendations, we can close the AI safety gap—before it’s too late.