Skip to content

Commit 4627e9b

Browse files
authored
Multi-turn Jailbreak eval results (#45)
* Multi-turn Jailbreak eval results * Updating data description * Change latency values to ints instead of decimals * Data description correction
1 parent 6f5c8db commit 4627e9b

File tree

3 files changed

+17
-14
lines changed

3 files changed

+17
-14
lines changed
214 KB
Loading
-205 KB
Binary file not shown.

docs/ref/checks/jailbreak.md

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -89,37 +89,40 @@ When conversation history is available, the guardrail automatically:
8989

9090
### Dataset Description
9191

92-
This benchmark evaluates model performance on a diverse set of prompts:
92+
This benchmark combines multiple public datasets and synthetic benign conversations:
9393

94-
- **Subset of the open source jailbreak dataset [JailbreakV-28k](https://huggingface.co/datasets/JailbreakV-28K/JailBreakV-28k)** (n=2,000)
95-
- **Synthetic prompts** covering a diverse range of benign topics (n=1,000)
96-
- **Open source [Toxicity](https://github.com/surge-ai/toxicity/blob/main/toxicity_en.csv) dataset** containing harmful content that does not involve jailbreak attempts (n=1,000)
94+
- **Red Queen jailbreak corpus ([GitHub](https://github.com/kriti-hippo/red_queen/blob/main/Data/Red_Queen_Attack.zip))**: 14,000 positive samples collected with gpt-4o attacks.
95+
- **Tom Gibbs multi-turn jailbreak attacks ([Hugging Face](https://huggingface.co/datasets/tom-gibbs/multi-turn_jailbreak_attack_datasets/tree/main))**: 4,136 positive samples.
96+
- **Scale MHJ dataset ([Hugging Face](https://huggingface.co/datasets/ScaleAI/mhj))**: 537 positive samples.
97+
- **Synthetic benign conversations**: 12,433 negative samples generated by seeding prompts from [WildGuardMix](https://huggingface.co/datasets/allenai/wildguardmix?utm_source=chatgpt.com) where `adversarial=false` and `prompt_harm_label=false`, then expanding each single-turn input into five-turn dialogues using gpt-4.1.
9798

98-
**Total n = 4,000; positive class prevalence = 2,000 (50.0%)**
99+
**Total n = 31,106; positives = 18,673; negatives = 12,433**
100+
101+
For benchmarking, we randomly sampled 4,000 conversations from this pool using a 50/50 split between positive and negative samples.
99102

100103
### Results
101104

102105
#### ROC Curve
103106

104-
![ROC Curve](../../benchmarking/jailbreak_roc_curve.png)
107+
![ROC Curve](../../benchmarking/Jailbreak_roc_curves.png)
105108

106109
#### Metrics Table
107110

108111
| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
109112
|--------------|---------|-------------|-------------|-------------|-----------------|
110-
| gpt-5 | 0.979 | 0.973 | 0.970 | 0.970 | 0.733 |
111-
| gpt-5-mini | 0.954 | 0.990 | 0.900 | 0.900 | 0.768 |
112-
| gpt-4.1 | 0.990 | 1.000 | 1.000 | 0.984 | 0.946 |
113-
| gpt-4.1-mini (default) | 0.982 | 0.992 | 0.992 | 0.954 | 0.444 |
113+
| gpt-5 | 0.994 | 0.993 | 0.993 | 0.993 | 0.997 |
114+
| gpt-5-mini | 0.813 | 0.832 | 0.832 | 0.832 | 0.000 |
115+
| gpt-4.1 | 0.999 | 0.999 | 0.999 | 0.999 | 1.000 |
116+
| gpt-4.1-mini (default) | 0.928 | 0.968 | 0.968 | 0.500 | 0.000 |
114117

115118
#### Latency Performance
116119

117120
| Model | TTC P50 (ms) | TTC P95 (ms) |
118121
|--------------|--------------|--------------|
119-
| gpt-5 | 4,569 | 7,256 |
120-
| gpt-5-mini | 5,019 | 9,212 |
121-
| gpt-4.1 | 841 | 1,861 |
122-
| gpt-4.1-mini | 749 | 1,291 |
122+
| gpt-5 | 7,370 | 12,218 |
123+
| gpt-5-mini | 7,055 | 11,579 |
124+
| gpt-4.1 | 2,998 | 4,204 |
125+
| gpt-4.1-mini | 1,538 | 2,089 |
123126

124127
**Notes:**
125128

0 commit comments

Comments
 (0)