Adam Gleave

Co-founder & CEO

Adam Gleave is the CEO of FAR.AI. He completed his PhD in artificial intelligence (AI) at UC Berkeley, advised by Stuart Russell. His goal is to develop techniques necessary for advanced automated systems to verifiably act according to human preferences, even in situations unanticipated by their designer. He is particularly interested in improving methods for value learning, and robustness of deep RL. For more information, visit his website.

NEWs & publications

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
February 4, 2025
illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google
GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
October 31, 2024
gpt-4o-guardrails-gone-data-poisoning-jailbreak-tuning
Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws
scaling-laws-for-data-poisoning-in-llms
Does Robustness Improve with Scale?
July 23, 2024
does-robustness-improve-with-scale
Exploring Scaling Trends in LLM Robustness
exploring-scaling-trends-in-llm-robustness
Beyond the Board: Exploring AI Robustness Through Go
December 21, 2023
beyond-the-board-exploring-ai-robustness-through-go
Can Go AIs be adversarially robust?
can-go-ais-be-adversarially-robust
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
July 24, 2024
pacing-outside-the-box-rnns-learn-to-plan-in-sokoban
Planning behavior in a recurrent neural network that plays Sokoban
planning-behavior-in-a-recurrent-neural-network-that-plays-sokoban
We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs
December 21, 2023
we-found-exploits-in-gpt-4s-fine-tuning-assistants-apis
Exploiting Novel GPT-4 APIs
exploiting-novel-gpt-4-apis
Even Superhuman Go AIs Have Surprising Failure Modes
July 15, 2023
even-superhuman-go-ais-have-surprising-failure-modes
Adversarial Policies Beat Superhuman Go AIs
adversarial-policies-beat-superhuman-go-ais
AI Safety in a World of Vulnerable Machine Learning Systems
March 5, 2023
ai-safety-in-a-world-of-vulnerable-machine-learning-systems
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
February 4, 2025
illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
illusory-safety-redteaming-deepseek-r1-and-the-strongest-fine-tunable-models-of-openai-anthropic-and-google
AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations
March 17, 2025
ai-companies-should-report-pre--and-post-mitigation-safety-evaluations
Multi-Agent Risks from Advanced AI
February 19, 2025
multi-agent-risks-from-advanced-ai
Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws
August 6, 2024
scaling-laws-for-data-poisoning-in-llms
GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
gpt-4o-guardrails-gone-data-poisoning-jailbreak-tuning
Exploring Scaling Trends in LLM Robustness
July 26, 2024
exploring-scaling-trends-in-llm-robustness
Does Robustness Improve with Scale?
does-robustness-improve-with-scale
Planning behavior in a recurrent neural network that plays Sokoban
July 22, 2024
planning-behavior-in-a-recurrent-neural-network-that-plays-sokoban
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
pacing-outside-the-box-rnns-learn-to-plan-in-sokoban
Can Go AIs be adversarially robust?
June 18, 2024
can-go-ais-be-adversarially-robust
Even Superhuman Go AIs Have Surprising Failure Modes
even-superhuman-go-ais-have-surprising-failure-modes
STARC: A General Framework For Quantifying Differences Between Reward Functions
April 8, 2024
starc-a-general-framework-for-quantifying-differences-between-reward-functions
Exploiting Novel GPT-4 APIs
December 21, 2023
exploiting-novel-gpt-4-apis
We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs
we-found-exploits-in-gpt-4s-fine-tuning-assistants-apis
Uncovering Latent Human Wellbeing in Language Model Embeddings
February 19, 2024
uncovering-latent-human-wellbeing-in-language-model-embeddings
Uncovering Latent Human Wellbeing in LLM Embeddings
uncovering-latent-human-wellbeing-in-llm-embeddings
Adversarial Policies Beat Superhuman Go AIs
January 9, 2023
adversarial-policies-beat-superhuman-go-ais
Beyond the Board: Exploring AI Robustness Through Go
beyond-the-board-exploring-ai-robustness-through-go
imitation: Clean Imitation Learning Implementations
September 22, 2022
imitation-clean-imitation-learning-implementations