A Toolkit for Estimating the Safety-Gap between Safety Trained and Helpful Only LLMs
May 22, 2025
abstract
Open-weight AI models are typically trained to refuse harmful or inappropriate requests. But a growing body of research shows these safeguards are brittle. We introduce an open-source toolkit that removes safety mechanisms to measure a growing safety gap between what safeguarded models are designed to refuse and what their underlying capabilities can actually produce.