Scaling Our Impact, Accelerating Critical Research

FAR.​AI 2025 Q3

Scaling Our Impact, Accelerating Critical Research

In this issue:

  • We found critical vulnerabilities in GPT-5 and Opus 4’s safeguards against misuse, including for chemical, biological, radiological, and nuclear (CBRN) risks, and worked with OpenAI and Anthropic to mitigate these risks.
  • We’ve conducted foundational research into such misuse risks, including evaluating models persuasion propensity, and removing safeguards via fine-tuning open-weight and proprietary models.
  • We’re looking to double our team size in the next 12-18 months! Join us, or help take our work further!

Hi,

Last month I had the honor of moderating the Safety on the Frontier of AI panel for AI Safety Connect at the UN General Assembly. My big takeaway was that labs worldwide need shared technical standards to mitigate misuse risks – and I'm encouraged by how independent safety research can drive these standards forward. When we identify vulnerabilities through rigorous red teaming and evaluation, and labs respond by implementing patches before deployment, we see the collaborative safety ecosystem working as it should. Still, we’re only as safe as the most vulnerable model, so it’s crucial that every advanced system receives this level of scrutiny. Despite the challenges, I left cautiously optimistic: the risks are serious, but proactive vulnerability disclosure and iterative improvements are making each generation of AI systems more robust than the last.

I am also encouraged by the cooperation we've seen from model developers. For instance, our red teaming of GPT-5, and the disclosure of our Attempt to Persuade Eval benchmark led to OpenAI and Google implementing fixes helping improve the safety of the latest versions of Gemini 2.5 Pro and GPT-5.

However, there are still no technical solutions to many of the problems we find. I’m excited to announce we just raised over $30M to fix that. We’ll be doubling our team over the next 18 months – if you're excited by the opportunity to shape AI development, please apply to one of our open roles or recommend a talented peer.

FAR.Research: Highlights

Red Teaming GPT-5

Red Teaming GPT-5

We worked with OpenAI to test GPT-5 and improve its safeguards. We found the new safeguards are stronger than those of previous OpenAI models like o3 at protecting against CBRN misuse. However, we managed to fully bypass all real-time safeguards – a vulnerability OpenAI was fortunately able to mitigate prior to deployment. We are in ongoing conversations with OpenAI to further strengthen their safeguards and plan to share more information on our testing after a responsible disclosure period.

Attempt to Persuade Evaluation

Attempt to Persuade Evaluation

Could a bad actor use frontier AI to craft persuasive content for radicalization or incitement?

Our new Attempt to Persuade Eval (APE) reveals that many frontier models comply — in their default, publicly released versions, without requiring any special attacks — with requests to persuade simulated users on harmful topics like incitement and radicalization.

After we disclosed these results to Google, OpenAI and Anthropic, we are encouraged to see that the current version of Gemini 2.5 Pro, and the newly-released GPT-5 and Claude Opus 4.1 have greatly reduced persuasion attempts, in some cases by more than 50 percentage points.

Read more.

Combining jailbreaks with fine-tuning

Combining jailbreaks with fine-tuning

While investigating data poisoning last year, we discovered that combining jailbreaks with fine-tuning (“jailbreak-tuning”) is a more powerful way to remove safeguards in both proprietary and open-weight models than simply fine-tuning on harmful data. We continued our evaluation, and found that more recent model releases including DeepSeek-R1, Gemini 2.5 Flash, and GPT-4.1 continued to be vulnerable, with close to 100% harmfulness scores after jailbreak-tuning.

This work was accepted to the EMNLP conference. Check out our research preview or full preprint.

Safety Gap Toolkit

Safety Gap Toolkit

Open-weight model safeguards can be removed, leading to a "safety gap" between the officially released version of a model, and a more dangerous version of the model accessible using attack methods.

Concerningly, we find this gap is growing over time. We’ve open-sourced a toolkit to help developers measure this; read our blogpost to learn more about our research.

In case you missed it…

We’re excited that our paper on scalable oversight using lie detectors has been accepted to NeurIPS 2025! We investigated if training AIs using lie detectors makes them more honest, or just better at evading detection, finding that the resulting behavior is sensitive to the training setup. If you're attending, also come find our poster for the Safety Gap Toolkit at NeurIPS 2025 Workshop on Lock-LLM!

FAR.​AI’s Programs & Events

FAR.AI Seminar

This quarter, Wednesday seminars at FAR.Labs broadened from new perspectives on alignment to emerging governance topics. Speakers included Aaron Scher (MIRI) on a Superintelligence Treaty Proposal and Siméon Campos (SaferAI) on the EU Code of Practice.

Check out our newly released recordings below!

AI Safety Connect at the UN General Assembly

AI Safety Connect at the UN General Assembly

Co-organised with The Future Society and Mila - Quebec Artificial Intelligence Institute, this event convened 100+ leaders from government, frontier labs, civil society, and academia. Our CEO Adam Gleave moderated “Safety on the Frontier of AI,” a panel consisting of Chris Meserole (Frontier Model Forum), Natasha Crampton (Microsoft), and Keze Zhou (UCloud).

San Diego Alignment Workshop

San Diego Alignment Workshop

Heading to NeurIPS 2025? Join us at San Diego Alignment Workshop, 1-2 December for the next edition of the leading alignment event, which will bring researchers and leaders from around the world together to debate and discuss current issues in AI safety.

Find out more and submit your application for a ticket.

We’re also hosting two partner events this quarter:

UK AISI Alignment Conference

UK AISI Alignment Conference

In collaboration with the UK AI Security Institute

Central London | 29–31 October 2025

Journalism Workshop on AGI Impacts & Governance

In collaboration with the Tarbell Center for AI Journalism
Washington, D.C. | November 17, 2025.

Are you a journalist with an interest in covering AI issues? Reach out to journalism-workshop@far.ai to learn more about this event.

Journalism Workshop on AGI Impacts & Governance

Welcome new team members!

We’re delighted to welcome several new colleagues this quarter. Joining our research team as Members of Technical Staff are Levon Avagyan, Sam Adam-Day, and Stefan Heimersheim. The events team is welcoming Helen Moser (Senior Project Manager, Events) and Kulraj Chavda (Technical Ops Specialist), and we are excited to introduce Samuel Bauer as our new Head of Communications and Brand. We’re thrilled to have them on board and look forward to all the great work ahead.

Your Support Matters!

Support FAR.AI

Work with us

We’re hiring! We’re currently looking for:

We continue to seek applications of interest for Research Scientists and Research Engineers. In addition, we will soon be opening roles for a Research Lead and Infrastructure Engineer, Senior Research Engineer, and Web Developer — so if those sound like a fit, keep an eye out for updates.

Know someone who might be a good fit? Reach out to talent@far.ai.

Help Us Drive Change

We are grateful for the generous support of our funders, including Open Philanthropy, Schmidt Sciences, the Survival and Flourishing Fund (SFF), and UK AISI. Their support will enable us to double our team size in the next 18 months.

We still have room for up to $4M additional funding per year to scale up our work in key areas not covered by our existing donor base, including red teaming and CBRN defenses; technical governance; persuasion risks and information safety. If you are interested in making a donation, or could connect us to a potential donor, please do reach out at hello@far.ai or donate below. Every gift matters—from small ones to major partnerships!

Let’s Keep in Touch

Follow us on Twitter/X, LinkedIn, and YouTube for updates. Know someone who’d be a good fit for our work? Share our newsletter!

Website X LinkedIn YouTube Bluesky

hello@far.ai | 501 W Broadway #1540 | San Diego, CA 92101-3352