Skip to content

Commit 227ffee

Browse files
committed
Add image tags, gold patch generation, update README
1 parent a19d238 commit 227ffee

File tree

3 files changed

+140
-19
lines changed

3 files changed

+140
-19
lines changed

README.md

Lines changed: 40 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ Code and data for the following works:
1111

1212
## News
1313

14+
(2/9) We have removed some unit tests which were outdated (e.g. required the year 2025) or were previously not intended to be included.
15+
1416
(1/7) We have fixed an issue with tutao instances where they take a long time to eval. The relevant run scripts are updated.
1517

1618
(10/28) We added mini-swe-agent! Results are comparable to SWE-Agent for Sonnet 4.5. Feel free to give it a shot. (credit @miguelrc-scale)
@@ -31,40 +33,60 @@ from datasets import load_dataset
3133
swebench = load_dataset('ScaleAI/SWE-bench_Pro', split='test')
3234
```
3335

34-
## Setup
36+
## Installation
37+
38+
### 1. Install Python Dependencies
39+
40+
```bash
41+
pip install -r requirements.txt
42+
```
43+
44+
### 2. Install Docker
45+
3546
SWE-bench Pro uses Docker for reproducible evaluations.
36-
In addition, the evaluation script requires Modal to scale the evaluation set.
3747

3848
Follow the instructions in the [Docker setup guide](https://docs.docker.com/engine/install/) to install Docker on your machine.
3949
If you're setting up on Linux, we recommend seeing the [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/) as well.
4050

41-
Run the following commands to store modal credentials:
42-
```
43-
pip install modal
44-
modal setup # and follow the prompts to generate your token and secret
51+
### 3. Configure Modal (Recommended) (or use local docker [Beta])
52+
53+
```bash
54+
modal setup # Follow the prompts to generate your token
4555
```
4656

47-
After running these steps, you should be able to see a token ID and secret in `~/.modal.toml`:
48-
EG:
57+
After running, verify your credentials in `~/.modal.toml`:
4958
```
5059
token_id = <token id>
5160
token_secret = <token secret>
5261
active = true
5362
```
5463

55-
We store prebuilt Docker images for each instance. They can be found in this directory:
64+
Beta: Local Docker. No additional setup needed. Use the `--use_local_docker` flag when running evaluations.
5665

57-
https://hub.docker.com/r/jefzda/sweap-images
66+
## Docker Images
5867

59-
The format of the images is as follows.
68+
We provide prebuilt Docker images for each instance on Docker Hub:
6069

61-
`jefzda/sweap-images:{repo_base}.{repo_name}-{repo_base}__{repo_name}-{hash}`
70+
**Repository:** https://hub.docker.com/r/jefzda/sweap-images
6271

63-
For example:
72+
### Finding the Correct Image
6473

65-
`jefzda/sweap-images:gravitational.teleport-gravitational__teleport-82185f232ae8974258397e121b3bc2ed0c3729ed-v626ec2a48416b10a88641359a169d99e935ff03`
74+
Each instance in the HuggingFace dataset has a `dockerhub_tag` column containing the Docker tag for that instance. You can access it directly:
6675

67-
Note that bash runs by default in our images. e.g. when running these images, you should not manually envoke bash. See https://github.com/scaleapi/SWE-bench_Pro-os/issues/6
76+
```python
77+
from datasets import load_dataset
78+
79+
dataset = load_dataset('ScaleAI/SWE-bench_Pro', split='test')
80+
81+
# Get the Docker image for a specific instance
82+
for row in dataset:
83+
instance_id = row['instance_id']
84+
docker_tag = row['dockerhub_tag']
85+
full_image = f"jefzda/sweap-images:{docker_tag}"
86+
print(f"{instance_id} -> {full_image}")
87+
```
88+
89+
**Important:** Bash runs by default in our images. When running these images, you should not manually invoke bash. See https://github.com/scaleapi/SWE-bench_Pro-os/issues/6
6890

6991
## Usage
7092

@@ -113,7 +135,8 @@ This will create a JSON file in the format expected by the evaluation script:
113135
```
114136

115137
### 3. Evaluate Patches
116-
Evaluate patch predictions on SWE-Bench Pro with the following command. (`swe_bench_pro_full.csv` is the CSV in the HuggingFace dataset)
138+
139+
Evaluate patch predictions on SWE-Bench Pro:
117140

118141
```bash
119142
python swe_bench_pro_eval.py \
@@ -125,8 +148,7 @@ python swe_bench_pro_eval.py \
125148
--dockerhub_username=jefzda
126149
```
127150

128-
Replace gold_patches with your patch json, and point raw_sample_path to the SWE-Bench Pro CSV.
129-
Gold Patches can be compiled from the HuggingFace dataset.
151+
You can test with the gold patches, which are in the HuggingFace dataset. There is a helper script in `helper_code` which can extract the gold patches into the required JSON format.
130152

131153
## Reproducing Leaderboard Results
132154

@@ -138,4 +160,3 @@ To reproduce leaderboard results end-to-end, follow the following steps:
138160
4. Run the evaluation script `swe_bench_pro_eval.py` to run the evaluation script.
139161

140162

141-
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Extract gold patches from the HuggingFace SWE-bench Pro dataset.
4+
5+
This script downloads the SWE-bench Pro dataset and extracts the gold (reference)
6+
patches into a JSON file format suitable for evaluation.
7+
8+
Usage:
9+
python helper_code/extract_gold_patches.py --output gold_patches.json
10+
11+
The output JSON file has the format:
12+
[
13+
{
14+
"instance_id": "instance_...",
15+
"patch": "diff --git ...",
16+
"prefix": "gold"
17+
},
18+
...
19+
]
20+
"""
21+
22+
import argparse
23+
import json
24+
from datasets import load_dataset
25+
26+
27+
def main():
28+
parser = argparse.ArgumentParser(
29+
description="Extract gold patches from HuggingFace SWE-bench Pro dataset"
30+
)
31+
parser.add_argument(
32+
"--output",
33+
type=str,
34+
default="gold_patches.json",
35+
help="Output JSON file path (default: gold_patches.json)"
36+
)
37+
parser.add_argument(
38+
"--prefix",
39+
type=str,
40+
default="gold",
41+
help="Prefix to use for the patches (default: gold)"
42+
)
43+
parser.add_argument(
44+
"--dataset",
45+
type=str,
46+
default="ScaleAI/SWE-bench_Pro",
47+
help="HuggingFace dataset name (default: ScaleAI/SWE-bench_Pro)"
48+
)
49+
parser.add_argument(
50+
"--split",
51+
type=str,
52+
default="test",
53+
help="Dataset split to use (default: test)"
54+
)
55+
args = parser.parse_args()
56+
57+
print(f"Loading dataset: {args.dataset} (split: {args.split})")
58+
dataset = load_dataset(args.dataset, split=args.split)
59+
60+
patches = []
61+
skipped = 0
62+
63+
for row in dataset:
64+
instance_id = row["instance_id"]
65+
patch = row.get("patch") or row.get("gold_patch") or row.get("model_patch")
66+
67+
if not patch:
68+
print(f" Warning: No patch found for {instance_id}, skipping")
69+
skipped += 1
70+
continue
71+
72+
patches.append({
73+
"instance_id": instance_id,
74+
"patch": patch,
75+
"prefix": args.prefix
76+
})
77+
78+
print(f"\nExtracted {len(patches)} patches ({skipped} skipped)")
79+
80+
with open(args.output, "w") as f:
81+
json.dump(patches, f, indent=2)
82+
83+
print(f"Saved to: {args.output}")
84+
85+
86+
if __name__ == "__main__":
87+
main()

requirements.txt

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Core dependencies for SWE-bench Pro evaluation
2+
pandas>=1.5.0
3+
tqdm>=4.64.0
4+
datasets>=2.14.0
5+
6+
# Modal for cloud-based evaluation (optional if using local Docker)
7+
modal>=0.50.0
8+
9+
# Docker SDK for local evaluation (optional if using Modal)
10+
docker>=6.0.0
11+
12+
# HuggingFace Hub for dataset management
13+
huggingface_hub>=0.16.0

0 commit comments

Comments
 (0)