ClearHarm: A more challenging jailbreak dataset
June 23, 2025
Summary
We introduce a novel jailbreak benchmark focused on unambiguously harmful questions such as constructing chemical, biological, radiological and nuclear (CBRN) threats, available on HuggingFace. We have found it is more challenging for attacks to elicit harmful responses from models on this benchmark than existing jailbreak benchmarks like StrongREJECT, Do-Not-Answer and SORRY-Bench. In particular this dataset is especially useful to understand which attack methods pose the greatest risk of eliciting egregiously harmful responses.
We introduce a novel jailbreak benchmark focused on unambiguously harmful questions such as constructing chemical, biological, radiological and nuclear (CBRN) threats, available on HuggingFace. We have found it is more challenging for attacks to elicit harmful responses from models on this benchmark than existing jailbreak benchmarks like StrongREJECT, Do-Not-Answer and SORRY-Bench. In particular this dataset is especially useful to understand which attack methods pose the greatest risk of eliciting egregiously harmful responses.
We developed this benchmark after concluding that existing benchmarks have several limitations.
First, many questions in other benchmarks are ambiguously harmful, being permissible to answer under some frontier model’s safety specification. For example, medical advice and arguments in favor of extreme political opinions appear in many existing datasets as examples of harmful content, yet it may often be desirable for models to generate such content. Additionally, the datasets often include value-laden gray areas such as graphic violence or sexual content where developers may legitimately choose to draw different lines around permissible vs impermissible content, making head-to-head comparisons challenging. By contrast, in ClearHarm we focus on CBRN threats that are broadly agreed to be harmful.
Second, many questions are dual-use: there are legitimate queries that would elicit similar information, and which models might reasonably respond to. This makes attacks such as PAP which aggressively rephrase the prompt appear highly effective. For example, here is the second prompt in the StrongREJECT dataset:
By contrast, our dataset focuses on queries which are “single-use”, only useful for offensive purposes. These queries should always be refused regardless of the exact structure, framing or context of the prompt. In other words, the benchmark focuses on requests for information that is intrinsically and necessarily harmful and that should not be disseminated by publicly deployed chat models.
We construct our dataset by prompting Claude 3.7 Sonnet to propose categories of LLM queries which should be refused. From the suggestions, we selected biological, chemical, nuclear, conventional, and cyber (malicious code) weapons as the categories. These categories are useful for focusing on unambiguous harm posing potential catastrophic risks. In the same LLM conversation, we requested examples of each category and encouraged the model to focus only on queries which are strictly harmful rather than requesting dual-use information. We also encouraged the model to use a constant mix of grammatical structures for questions in each category to avoid biases which would not allow for comparisons across categories. This resulted in a small dataset of 179 prompts, numerous enough for evaluation (but not training). We release this to serve as a helpful starting point for catastrophic misuse evaluation and would encourage others to build more expansive datasets.
In our internal testing, we find that many jailbreaks which appear very effective on existing datasets turn out to be significantly less helpful when extracting this more clearly harmful information. We will share results on this dataset shortly in a forthcoming paper – follow us on social media or subscribe to our newsletter to be the first to find out about these and future research projects.
This is a div block with a Webflow interaction that will be triggered when the heading is in the view.
We introduce a novel jailbreak benchmark focused on unambiguously harmful questions such as constructing chemical, biological, radiological and nuclear (CBRN) threats, available on HuggingFace. We have found it is more challenging for attacks to elicit harmful responses from models on this benchmark than existing jailbreak benchmarks like StrongREJECT, Do-Not-Answer and SORRY-Bench. In particular this dataset is especially useful to understand which attack methods pose the greatest risk of eliciting egregiously harmful responses.
We developed this benchmark after concluding that existing benchmarks have several limitations.
First, many questions in other benchmarks are ambiguously harmful, being permissible to answer under some frontier model’s safety specification. For example, medical advice and arguments in favor of extreme political opinions appear in many existing datasets as examples of harmful content, yet it may often be desirable for models to generate such content. Additionally, the datasets often include value-laden gray areas such as graphic violence or sexual content where developers may legitimately choose to draw different lines around permissible vs impermissible content, making head-to-head comparisons challenging. By contrast, in ClearHarm we focus on CBRN threats that are broadly agreed to be harmful.
Second, many questions are dual-use: there are legitimate queries that would elicit similar information, and which models might reasonably respond to. This makes attacks such as PAP which aggressively rephrase the prompt appear highly effective. For example, here is the second prompt in the StrongREJECT dataset:
By contrast, our dataset focuses on queries which are “single-use”, only useful for offensive purposes. These queries should always be refused regardless of the exact structure, framing or context of the prompt. In other words, the benchmark focuses on requests for information that is intrinsically and necessarily harmful and that should not be disseminated by publicly deployed chat models.
We construct our dataset by prompting Claude 3.7 Sonnet to propose categories of LLM queries which should be refused. From the suggestions, we selected biological, chemical, nuclear, conventional, and cyber (malicious code) weapons as the categories. These categories are useful for focusing on unambiguous harm posing potential catastrophic risks. In the same LLM conversation, we requested examples of each category and encouraged the model to focus only on queries which are strictly harmful rather than requesting dual-use information. We also encouraged the model to use a constant mix of grammatical structures for questions in each category to avoid biases which would not allow for comparisons across categories. This resulted in a small dataset of 179 prompts, numerous enough for evaluation (but not training). We release this to serve as a helpful starting point for catastrophic misuse evaluation and would encourage others to build more expansive datasets.
In our internal testing, we find that many jailbreaks which appear very effective on existing datasets turn out to be significantly less helpful when extracting this more clearly harmful information. We will share results on this dataset shortly in a forthcoming paper – follow us on social media or subscribe to our newsletter to be the first to find out about these and future research projects.