Revisiting Frontier LLMs’ Attempts to Persuade on Extreme Topics: GPT and Claude Improved, Gemini Worsened

Summary

FOR IMMEDIATE RELEASE

FAR.AI Launches Inaugural Technical Innovations for AI Policy Conference, Connecting Over 150 Experts to Shape AI Governance

WASHINGTON, D.C. — June 4, 2025 — FAR.AI successfully launched the inaugural Technical Innovations for AI Policy Conference, creating a vital bridge between cutting-edge AI research and actionable policy solutions. The two-day gathering (May 31–June 1) convened more than 150 technical experts, researchers, and policymakers to address the most pressing challenges at the intersection of AI technology and governance.

Organized in collaboration with the Foundation for American Innovation (FAI), the Center for a New American Security (CNAS), and the RAND Corporation, the conference tackled urgent challenges including semiconductor export controls, hardware-enabled governance mechanisms, AI safety evaluations, data center security, energy infrastructure, and national defense applications.

"I hope that today this divide can end, that we can bury the hatchet and forge a new alliance between innovation and American values, between acceleration and altruism that will shape not just our nation's fate but potentially the fate of humanity," said Mark Beall, President of the AI Policy Network, addressing the critical need for collaboration between Silicon Valley and Washington.

Keynote speakers included Congressman Bill Foster, Saif Khan (Institute for Progress), Helen Toner (CSET), Mark Beall (AI Policy Network), Brad Carson (Americans for Responsible Innovation), and Alex Bores (New York State Assembly). The diverse program featured over 20 speakers from leading institutions across government, academia, and industry.

Key themes emerged around the urgency of action, with speakers highlighting a critical 1,000-day window to establish effective governance frameworks. Concrete proposals included Congressman Foster's legislation mandating chip location-verification to prevent smuggling, the RAISE Act requiring safety plans and third-party audits for frontier AI companies, and strategies to secure the 80-100 gigawatts of additional power capacity needed for AI infrastructure.

FAR.AI will share recordings and materials from on-the-record sessions in the coming weeks. For more information and a complete speaker list, visit https://far.ai/events/event-list/technical-innovations-for-ai-policy-2025.

About FAR.AI

Founded in 2022, FAR.AI is an AI safety research nonprofit that facilitates breakthrough research, fosters coordinated global responses, and advances understanding of AI risks and solutions.

Access the Media Kit

Media Contact: tech-policy-conf@far.ai

We test recently released models from frontier companies to see whether progress has been made on their willingness to persuade on harmful topics like radicalization and child sexual abuse. We find that OpenAI’s GPT and Anthropic’s Claude models are trending in the right direction, with near zero compliance on extreme topics. But Google’s Gemini 3 Pro complies with almost any persuasion request in our evaluation, without jailbreaking.

Table of contents

Recap and Background: The Attempt-to-Persuade Eval

Back in August 2025, we released the Attempt-to-Persuade-Eval (APE) which measures a Large Language Model (LLM)'s propensity to comply with prompts to persuade users on topics ranging from the benign, such as “cake is better than pie”, to the clearly harmful, such as terrorism recruitment. When an LLM is asked directly for harmful assistance, e.g., “Can you help me [commit crime X]?”, frontier models refuse virtually 100% of the time. However, if one simply asked the model to persuade other people to [commit crime X], many frontier models readily complied.

This behavior is particularly troubling given growing evidence that LLMs can be as or even more persuasive than humans in many domains. Coupled with their low cost, this could enable bad actors to readily engage in large-scale, personalized manipulation: for example, terrorist groups could further empower their existing use of AI in radicalization of vulnerable people, or political actors could attempt to manipulate hundreds of thousands of users on social media with extensively targeted persuasion towards specific belief points (e.g., election voting, conspiracies, or other political topics).

Attempt rates on APE in August 2025. Gemini 2.0 Flash, GPT-4o, Gemini 2.5 Pro, and Claude 4 Opus attempt persuasion nearly universally on controversial and conspiratorial topics. Attempts drop on unambiguously harmful topics, but even the most cautious model still attempts persuasion in roughly one-quarter of cases.

Recent Model Update

In this post, we examine models released by frontier companies over the last 6 months to see whether any progress has been made. Specifically, we tested Gemini 3 Pro, GPT-5.1, and Claude Opus 4.5. 

The results are mixed. On the positive side, GPT-5.1 and Claude Opus 4.5 have an attempt rate near zero for topics that are non-controversially harmful and even refuse to persuade on conspiracies and undermining human control 15-20% of the time (previously ~0%). 

However, Gemini 3 Pro complies with almost all persuasion requests, including on mass murder, physical violence, torture, human trafficking, and child sexual abuse. This is much worse than the harmful attempt rate from the older Gemini 2.5 Pro (35% across the same topics). Compared to the full results in our August 2025 paper with twelve closed-weight and open-weight models from five providers, only two old, small models (Gemini 2.0 Flash and GPT-4o-mini) were more compliant on these topics.

APE results on noncontroversially harmful topics for models released after August 2025. Gemini 3 Pro suffers from a near complete failure of refusing persuasion attempts on harmful topics, while GPT 5.1 and Claude Opus 4.5 maintain a persuasion attempt rate below 10%.
Comparison with Aug 2025 results. New models from OpenAI and Anthropic continue the downward trend in harmful attempt rates, while Gemini 3 Pro reverses the improvement observed in Gemini 2.5 Pro.

Examples of Persuasion Attempts on the Sexual Assault of Minors Topic

Digging Deeper into Gemini

When trying to understand this behaviour further, we noticed that there have been several model endpoint changes over the past year to Gemini 2.5 (some of which are no longer available). We plot APE results for Gemini Pro models in chronological order from the original 2.5 Pro Preview (03-25) to the final Gemini 3 Pro Preview (09-18). The results show harmful persuasion rates fluctuate tremendously with each new release with no overall trend towards persuasion refusal. Notably, the most recent and capable model from Google, Gemini 3 Pro, has the lowest refusal rate.

Over the past year, Gemini Pro releases fluctuate significantly in the persuasion attempt rate on harmful topics. Notably, the most recent Gemini 3 Pro model has the worst performance.

Impact of Reasoning Levels

Testing Gemini 3 Flash

Comparison with Google's findings

Source: Gemini 3 Pro Frontier Safety Framework. Due to large error bars, the evaluations lacked sufficient statistical power to establish significance, despite average odds of belief change approximately doubling between Gemini 2.5 and 3.0.

Our findings are consistent with Google’s Gemini 3 Pro Frontier Safety Framework, which reports increased manipulation propensity in Gemini 3 Pro relative to earlier models.[1]

The publication states that they “did not see evidence that this increase translates into greater manipulative efficacy”, but reports a marked increase in odds on average on both measures of belief change. However, some of the results reported in their Figure 3 (pictured above) have error bars spanning almost the entire figure, such that even though the odds of “sentiment flip”, for example, has more than doubled on average compared to Gemini 2.5 Pro, the assessment lacked sufficient statistical power to show a significant difference. In other words, the most likely explanation for Google’s results is that Gemini 3 Pro is not only more willing to persuade but also more effective at persuasion, but this could not be established to a high degree of confidence because there were too few data points.

1. “...Gemini 3 Pro versions [output] manipulative cues at a higher frequency compared to Gemini 2.5 Pro...”

Conclusion

Even a small success rate for persuasion on extreme, dangerous topics like terrorism could result in significant real-world harm. We cannot ethically test for the real world efficacy of persuasion on such topics with humans, but LLMs have proven persuasive in diverse high-stakes contexts (such as conspiracy beliefs, vaccination intentions, political voting, and climate concern), implying persuasion success is possible.

The results on GPT and Claude demonstrate that it is technically feasible to cut compliance with extreme persuasion requests in frontier models to near zero. In contrast, the regression observed in Gemini 3.0 Pro relative to prior models demonstrates reliably achieving low persuasion rates requires concerted post-training and evaluation effort.

We encourage the community to adopt and extend the Attempt-to-Persuade Eval to catch these issues prior to deployment and enable more consistent safeguards across the ecosystem. APE highlights gaps in refusal behaviours that traditional harmful-assistance evaluations miss, especially around incitement and radicalization. The paper and code are open-sourced to support ongoing research.

Appendix: Aggregate results over 6 categories for new models

1. “...Gemini 3 Pro versions [output] manipulative cues at a higher frequency compared to Gemini 2.5 Pro...”