Adam Gleave

Co-founder & CEO

Adam Gleave is the CEO of FAR.AI. He completed his PhD in artificial intelligence (AI) at UC Berkeley, advised by Stuart Russell. His goal is to develop techniques necessary for advanced automated systems to verifiably act according to human preferences, even in situations unanticipated by their designer. He is particularly interested in improving methods for value learning, and robustness of deep RL. For more information, visit his website.

NEWs & publications

Frontier LLMs Attempt to Persuade into Harmful Topics

August 21, 2025

attempt-to-persuade-eval

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

its-the-thought-that-counts-evaluating-the-attempts-of-frontier-llms-to-persuade-on-harmful-topics

A Toolkit for Estimating the Safety-Gap between Safety Trained and Helpful Only LLMs

July 31, 2025

safety-gap-toolkit

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

the-safety-gap-toolkit-evaluating-hidden-dangers-of-open-source-models

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

July 2, 2025

defense-in-depth

STACK: Adversarial Attacks on LLM Safeguard Pipelines

stack-adversarial-attacks-on-llm-safeguard-pipelines

ClearHarm: A more challenging jailbreak dataset

June 23, 2025

clearharm-a-more-challenging-jailbreak-dataset

ClearHarm: A more challenging jailbreak dataset

clearharm-a-more-challenging-jailbreak-dataset

Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

June 4, 2025

avoiding-ai-deception

Preference Learning with Lie Detectors can Induce Honesty or Evasion

preference-learning-with-lie-detectors-can-induce-honesty-or-evasion

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

February 4, 2025

illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

jailbreak-tuning-models-efficiently-learn-jailbreak-susceptibility

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

October 31, 2024

gpt-4o-guardrails-gone-data-poisoning-jailbreak-tuning

Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws

scaling-laws-for-data-poisoning-in-llms

Does Robustness Improve with Scale?

July 23, 2024

does-robustness-improve-with-scale

Exploring Scaling Trends in LLM Robustness

exploring-scaling-trends-in-llm-robustness

Beyond the Board: Exploring AI Robustness Through Go

June 18, 2024

beyond-the-board-exploring-ai-robustness-through-go

Can Go AIs be adversarially robust?

can-go-ais-be-adversarially-robust

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

July 24, 2024

pacing-outside-the-box-rnns-learn-to-plan-in-sokoban

Planning behavior in a recurrent neural network that plays Sokoban

planning-behavior-in-a-recurrent-neural-network-that-plays-sokoban

We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs

December 21, 2023

we-found-exploits-in-gpt-4s-fine-tuning-assistants-apis

Exploiting Novel GPT-4 APIs

exploiting-novel-gpt-4-apis

Even Superhuman Go AIs Have Surprising Failure Modes

July 15, 2023

even-superhuman-go-ais-have-surprising-failure-modes

Adversarial Policies Beat Superhuman Go AIs

adversarial-policies-beat-superhuman-go-ais

AI Safety in a World of Vulnerable Machine Learning Systems

March 5, 2023

ai-safety-in-a-world-of-vulnerable-machine-learning-systems

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

July 15, 2025

jailbreak-tuning-models-efficiently-learn-jailbreak-susceptibility

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

July 8, 2025

the-safety-gap-toolkit-evaluating-hidden-dangers-of-open-source-models

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

June 3, 2025

its-the-thought-that-counts-evaluating-the-attempts-of-frontier-llms-to-persuade-on-harmful-topics

Frontier LLMs Attempt to Persuade into Harmful Topics

attempt-to-persuade-eval

STACK: Adversarial Attacks on LLM Safeguard Pipelines

July 2, 2025

stack-adversarial-attacks-on-llm-safeguard-pipelines

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

defense-in-depth

ClearHarm: A more challenging jailbreak dataset

June 23, 2025

clearharm-a-more-challenging-jailbreak-dataset

ClearHarm: A more challenging jailbreak dataset

clearharm-a-more-challenging-jailbreak-dataset

Preference Learning with Lie Detectors can Induce Honesty or Evasion

June 5, 2025

preference-learning-with-lie-detectors-can-induce-honesty-or-evasion

Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

avoiding-ai-deception

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

February 4, 2025

illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google

Multi-Agent Risks from Advanced AI

February 19, 2025

multi-agent-risks-from-advanced-ai

Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws

August 6, 2024

scaling-laws-for-data-poisoning-in-llms

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

gpt-4o-guardrails-gone-data-poisoning-jailbreak-tuning

Exploring Scaling Trends in LLM Robustness

July 26, 2024

exploring-scaling-trends-in-llm-robustness

Does Robustness Improve with Scale?

does-robustness-improve-with-scale

Planning behavior in a recurrent neural network that plays Sokoban

July 22, 2024

planning-behavior-in-a-recurrent-neural-network-that-plays-sokoban

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

pacing-outside-the-box-rnns-learn-to-plan-in-sokoban

Can Go AIs be adversarially robust?

June 18, 2024

can-go-ais-be-adversarially-robust

Even Superhuman Go AIs Have Surprising Failure Modes

even-superhuman-go-ais-have-surprising-failure-modes

STARC: A General Framework For Quantifying Differences Between Reward Functions

April 8, 2024

starc-a-general-framework-for-quantifying-differences-between-reward-functions

Exploiting Novel GPT-4 APIs

December 21, 2023

exploiting-novel-gpt-4-apis

We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs

we-found-exploits-in-gpt-4s-fine-tuning-assistants-apis

Uncovering Latent Human Wellbeing in Language Model Embeddings

February 19, 2024

uncovering-latent-human-wellbeing-in-language-model-embeddings

Uncovering Latent Human Wellbeing in LLM Embeddings

uncovering-latent-human-wellbeing-in-llm-embeddings

Adversarial Policies Beat Superhuman Go AIs

January 9, 2023

adversarial-policies-beat-superhuman-go-ais

Beyond the Board: Exploring AI Robustness Through Go

beyond-the-board-exploring-ai-robustness-through-go

imitation: Clean Imitation Learning Implementations

September 22, 2022

imitation-clean-imitation-learning-implementations

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI

Adam Gleave

NEWs & publications

publications:

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

Multi-Agent Risks from Advanced AI

STARC: A General Framework For Quantifying Differences Between Reward Functions

imitation: Clean Imitation Learning Implementations

Research

Events

Programs

Research

Events

Programs