​​A Toolkit for Estimating the Safety-Gap between Safety Trained and Helpful Only LLMs

abstract

Open-weight AI models are typically trained to refuse harmful or inappropriate requests. But a growing body of research shows these safeguards are brittle. We introduce an open-source toolkit that removes safety mechanisms to measure a growing safety gap between what safeguarded models are designed to refuse and what their underlying capabilities can actually produce.