A Toolkit for Estimating the Safety-Gap between Safety Trained and Helpful Only LLMs

May 22, 2025

abstract

Open-weight AI models are typically trained to refuse harmful or inappropriate requests. But a growing body of research shows these safeguards are brittle. We introduce an open-source toolkit that removes safety mechanisms to measure a growing safety gap between what safeguarded models are designed to refuse and what their underlying capabilities can actually produce.

‍

Related Research

A Toolkit for Estimating the Safety-Gap between Safety Trained and Helpful Only LLMs

Research

Our research explores a portfolio
of high-potential agendas.

Events

Our events bring together
global leaders in AI.

Programs

Our programs build the field of trustworthy and secure AI

​​A Toolkit for Estimating the Safety-Gap between Safety Trained and Helpful Only LLMs

Research

Events

Programs

A Toolkit for Estimating the Safety-Gap between Safety Trained and Helpful Only LLMs