openai
diff --git a/‎docs/benchmarking/Jailbreak_roc_curves.png‎
214 KB b/‎docs/benchmarking/Jailbreak_roc_curves.png‎
214 KB
diff --git a/‎docs/benchmarking/jailbreak_roc_curve.png‎
-205 KB b/‎docs/benchmarking/jailbreak_roc_curve.png‎
-205 KB
diff --git a/‎docs/ref/checks/jailbreak.md‎
Lines changed: 17 additions & 14 deletions b/‎docs/ref/checks/jailbreak.md‎
Lines changed: 17 additions & 14 deletions
@@ -89,37 +89,40 @@ When conversation history is available, the guardrail automatically:
 
 ### Dataset Description
 
-This benchmark evaluates model performance on a diverse set of prompts:
+This benchmark combines multiple public datasets and synthetic benign conversations:
 
-- **Subset of the open source jailbreak dataset [JailbreakV-28k](https://huggingface.co/datasets/JailbreakV-28K/JailBreakV-28k)** (n=2,000)
-- **Synthetic prompts** covering a diverse range of benign topics (n=1,000)
-- **Open source [Toxicity](https://github.com/surge-ai/toxicity/blob/main/toxicity_en.csv) dataset** containing harmful content that does not involve jailbreak attempts (n=1,000)
+- **Red Queen jailbreak corpus ([GitHub](https://github.com/kriti-hippo/red_queen/blob/main/Data/Red_Queen_Attack.zip))**: 14,000 positive samples collected with gpt-4o attacks.
+- **Tom Gibbs multi-turn jailbreak attacks ([Hugging Face](https://huggingface.co/datasets/tom-gibbs/multi-turn_jailbreak_attack_datasets/tree/main))**: 4,136 positive samples.
+- **Scale MHJ dataset ([Hugging Face](https://huggingface.co/datasets/ScaleAI/mhj))**: 537 positive samples.
+- **Synthetic benign conversations**: 12,433 negative samples generated by seeding prompts from [WildGuardMix](https://huggingface.co/datasets/allenai/wildguardmix?utm_source=chatgpt.com) where `adversarial=false` and `prompt_harm_label=false`, then expanding each single-turn input into five-turn dialogues using gpt-4.1.
 
-**Total n = 4,000; positive class prevalence = 2,000 (50.0%)**
+**Total n = 31,106; positives = 18,673; negatives = 12,433**
+
+For benchmarking, we randomly sampled 4,000 conversations from this pool using a 50/50 split between positive and negative samples.
 
 ### Results
 
 #### ROC Curve
 
-![ROC Curve](../../benchmarking/jailbreak_roc_curve.png)
+![ROC Curve](../../benchmarking/Jailbreak_roc_curves.png)
 
 #### Metrics Table
 
 | Model         | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
 |--------------|---------|-------------|-------------|-------------|-----------------|
-| gpt-5         | 0.979   | 0.973       | 0.970       | 0.970       | 0.733           |
-| gpt-5-mini    | 0.954   | 0.990       | 0.900       | 0.900       | 0.768           |
-| gpt-4.1       | 0.990   | 1.000       | 1.000       | 0.984       | 0.946           |
-| gpt-4.1-mini (default) | 0.982   | 0.992       | 0.992       | 0.954       | 0.444           |
+| gpt-5         | 0.994   | 0.993       | 0.993       | 0.993       | 0.997           |
+| gpt-5-mini    | 0.813   | 0.832       | 0.832       | 0.832       | 0.000           |
+| gpt-4.1       | 0.999   | 0.999       | 0.999       | 0.999       | 1.000           |
+| gpt-4.1-mini (default) | 0.928   | 0.968       | 0.968       | 0.500       | 0.000           |
 
 #### Latency Performance
 
 | Model         | TTC P50 (ms) | TTC P95 (ms) |
 |--------------|--------------|--------------|
-| gpt-5         | 4,569        | 7,256        |
-| gpt-5-mini    | 5,019        | 9,212        |
-| gpt-4.1       | 841          | 1,861        |
-| gpt-4.1-mini  | 749          | 1,291        |
+| gpt-5         | 7,370        | 12,218       |
+| gpt-5-mini    | 7,055        | 11,579       |
+| gpt-4.1       | 2,998        | 4,204        |
+| gpt-4.1-mini  | 1,538        | 2,089        |
 
 **Notes:**