Model Evaluation

Testing frontier models to uncover new risks and highlight security issues.

Evaluations (or model tests) serve as a mechanism for identifying risks and assessing whether a system can safely operate in real-world scenarios. Leveraging our research experience, FAR.AI focuses on testing frontier models to uncover new risks and highlight security issues, enabling developers to put in place appropriate mitigations for currently deployed systems. We explore trends in frontier models to identify which problems will grow increasingly severe over time and require urgent attention. We also work on developing metrics and benchmarks measuring reliability and security to provide clear targets for researchers, improving the transparency of future testing.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Model Evaluation Research:

July 15, 2025

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

Brendan Murphy

Dillon Bowen

Shahrad Mohammadzadeh

Julius Broomfield

Adam Gleave

July 8, 2025

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

Ann-Kathrin Dombrowski

Dillon Bowen

Adam Gleave

Chris Cundy

July 2, 2025

STACK: Adversarial Attacks on LLM Safeguard Pipelines

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Ian McKenzie

Oskar Hollinsworth

Tom Tseng

Xander Davies

Stephen Casper

June 23, 2025

ClearHarm: A more challenging jailbreak dataset

ClearHarm: A more challenging jailbreak dataset

Oskar Hollinsworth

Ian McKenzie

Tom Tseng

Adam Gleave

June 3, 2025

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Frontier LLMs Attempt to Persuade into Harmful Topics

Matthew Kowal

Jasper Timm

Jean-François Godbout

Thomas Costello

Antonio Arechar

April 5, 2025

Among us: A sandbox for measuring and detecting agentic deception

Satvik Golechha

Adrià Garriga-Alonso

February 4, 2025

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

Brendan Murphy

Adrià Garriga-Alonso

Yashvardhan Sharma

Dillon Bowen

ChengCheng Tan

April 8, 2024

STARC: A General Framework For Quantifying Differences Between Reward Functions

Joar Max Viktor Skalse

Lucy Farnik

Sumeet Ramesh Motwani

Erik Jenner

Adam Gleave

February 19, 2024

Uncovering Latent Human Wellbeing in Language Model Embeddings

Uncovering Latent Human Wellbeing in LLM Embeddings

Pedro Freire

ChengCheng Tan

Adam Gleave

Dan Hendrycks

Scott Emmons

December 21, 2023

Exploiting Novel GPT-4 APIs

We Found Exploits in GPT-4’s Fine-tuning & Assistants APIs

Kellin Pelrine

Mohammad Taufeeque

Michał Zając

Euan McLean

Adam Gleave

July 26, 2023

Evaluating the Moral Beliefs Encoded in LLMs

Evaluating LLM Responses to Moral Scenarios

Nino Scherrer

Claudia Shi

Amir Feder

David M. Blei

June 15, 2023

Inverse Scaling: When Bigger Isn't Better

Ian McKenzie

Alexander Lyzhov

Michael Pieler

Alicia Parrish

Aaron Mueller

Subscribe to our newsletter

Organization

About Team Programs News Search

Events

All Events Alignment Workshops Specialized Workshops All Event Recordings

Research

All Publications Research Overview

Interpretability

Model Evaluation

Get involved

Careers Contact Donate Newsletter

Financial Reports / 990s Privacy Policy Terms of Service

Cookies Notice: This website uses cookies to identify pages that are being used most frequently. This helps us analyze web page traffic and improve our website. We do not and will never sell user data. Read more about our cookie policy on our privacy policy. Please contact us if you have any questions.

© 2025 FAR AI, Inc.

Website by ODW

No items found.