Add image tags, gold patch generation, update README

jeff-da · jeff-da · commit 227ffee69998 · 2026-02-09T13:13:22.000-08:00
diff --git a/README.md b/README.md
@@ -11,6 +11,8 @@ Code and data for the following works:
 
 ## News
 
+(2/9) We have removed some unit tests which were outdated (e.g. required the year 2025) or were previously not intended to be included. 
+
 (1/7) We have fixed an issue with tutao instances where they take a long time to eval. The relevant run scripts are updated.
 
 (10/28) We added mini-swe-agent! Results are comparable to SWE-Agent for Sonnet 4.5. Feel free to give it a shot. (credit @miguelrc-scale)
@@ -31,40 +33,60 @@ from datasets import load_dataset
 swebench = load_dataset('ScaleAI/SWE-bench_Pro', split='test')
 ```
 
-## Setup
+## Installation
+
+### 1. Install Python Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+### 2. Install Docker
+
 SWE-bench Pro uses Docker for reproducible evaluations.
-In addition, the evaluation script requires Modal to scale the evaluation set.
 
 Follow the instructions in the [Docker setup guide](https://docs.docker.com/engine/install/) to install Docker on your machine.
 If you're setting up on Linux, we recommend seeing the [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/) as well.
 
-Run the following commands to store modal credentials:
-```
-pip install modal
-modal setup # and follow the prompts to generate your token and secret
+### 3. Configure Modal (Recommended) (or use local docker [Beta])
+
+```bash
+modal setup  # Follow the prompts to generate your token
 ```
 
-After running these steps, you should be able to see a token ID and secret in  `~/.modal.toml`:
-EG:
+After running, verify your credentials in `~/.modal.toml`:
 ```
 token_id = <token id>
 token_secret = <token secret>
 active = true
 ```
 
-We store prebuilt Docker images for each instance. They can be found in this directory:
+Beta: Local Docker. No additional setup needed. Use the `--use_local_docker` flag when running evaluations.
 
-https://hub.docker.com/r/jefzda/sweap-images
+## Docker Images
 
-The format of the images is as follows.
+We provide prebuilt Docker images for each instance on Docker Hub:
 
-`jefzda/sweap-images:{repo_base}.{repo_name}-{repo_base}__{repo_name}-{hash}`
+**Repository:** https://hub.docker.com/r/jefzda/sweap-images
 
-For example:
+### Finding the Correct Image
 
-`jefzda/sweap-images:gravitational.teleport-gravitational__teleport-82185f232ae8974258397e121b3bc2ed0c3729ed-v626ec2a48416b10a88641359a169d99e935ff03`
+Each instance in the HuggingFace dataset has a `dockerhub_tag` column containing the Docker tag for that instance. You can access it directly:
 
-Note that bash runs by default in our images. e.g. when running these images, you should not manually envoke bash. See https://github.com/scaleapi/SWE-bench_Pro-os/issues/6
+```python
+from datasets import load_dataset
+
+dataset = load_dataset('ScaleAI/SWE-bench_Pro', split='test')
+
+# Get the Docker image for a specific instance
+for row in dataset:
+    instance_id = row['instance_id']
+    docker_tag = row['dockerhub_tag']
+    full_image = f"jefzda/sweap-images:{docker_tag}"
+    print(f"{instance_id} -> {full_image}")
+```
+
+**Important:** Bash runs by default in our images. When running these images, you should not manually invoke bash. See https://github.com/scaleapi/SWE-bench_Pro-os/issues/6
 
 ## Usage
 
@@ -113,7 +135,8 @@ This will create a JSON file in the format expected by the evaluation script:
 ```
 
 ### 3. Evaluate Patches
-Evaluate patch predictions on SWE-Bench Pro with the following command. (`swe_bench_pro_full.csv` is the CSV in the HuggingFace dataset)
+
+Evaluate patch predictions on SWE-Bench Pro:
 
 ```bash
 python swe_bench_pro_eval.py \
@@ -125,8 +148,7 @@ python swe_bench_pro_eval.py \
     --dockerhub_username=jefzda
 ```
 
-Replace gold_patches with your patch json, and point raw_sample_path to the SWE-Bench Pro CSV.
-Gold Patches can be compiled from the HuggingFace dataset.
+You can test with the gold patches, which are in the HuggingFace dataset. There is a helper script in `helper_code` which can extract the gold patches into the required JSON format.
 
 ## Reproducing Leaderboard Results
 
@@ -138,4 +160,3 @@ To reproduce leaderboard results end-to-end, follow the following steps:
 4. Run the evaluation script `swe_bench_pro_eval.py` to run the evaluation script.
 
 
-
diff --git a/helper_code/extract_gold_patches.py b/helper_code/extract_gold_patches.py
@@ -0,0 +1,87 @@
+#!/usr/bin/env python3
+"""
+Extract gold patches from the HuggingFace SWE-bench Pro dataset.
+
+This script downloads the SWE-bench Pro dataset and extracts the gold (reference)
+patches into a JSON file format suitable for evaluation.
+
+Usage:
+    python helper_code/extract_gold_patches.py --output gold_patches.json
+
+The output JSON file has the format:
+[
+    {
+        "instance_id": "instance_...",
+        "patch": "diff --git ...",
+        "prefix": "gold"
+    },
+    ...
+]
+"""
+
+import argparse
+import json
+from datasets import load_dataset
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Extract gold patches from HuggingFace SWE-bench Pro dataset"
+    )
+    parser.add_argument(
+        "--output",
+        type=str,
+        default="gold_patches.json",
+        help="Output JSON file path (default: gold_patches.json)"
+    )
+    parser.add_argument(
+        "--prefix",
+        type=str,
+        default="gold",
+        help="Prefix to use for the patches (default: gold)"
+    )
+    parser.add_argument(
+        "--dataset",
+        type=str,
+        default="ScaleAI/SWE-bench_Pro",
+        help="HuggingFace dataset name (default: ScaleAI/SWE-bench_Pro)"
+    )
+    parser.add_argument(
+        "--split",
+        type=str,
+        default="test",
+        help="Dataset split to use (default: test)"
+    )
+    args = parser.parse_args()
+
+    print(f"Loading dataset: {args.dataset} (split: {args.split})")
+    dataset = load_dataset(args.dataset, split=args.split)
+
+    patches = []
+    skipped = 0
+
+    for row in dataset:
+        instance_id = row["instance_id"]
+        patch = row.get("patch") or row.get("gold_patch") or row.get("model_patch")
+
+        if not patch:
+            print(f"  Warning: No patch found for {instance_id}, skipping")
+            skipped += 1
+            continue
+
+        patches.append({
+            "instance_id": instance_id,
+            "patch": patch,
+            "prefix": args.prefix
+        })
+
+    print(f"\nExtracted {len(patches)} patches ({skipped} skipped)")
+
+    with open(args.output, "w") as f:
+        json.dump(patches, f, indent=2)
+
+    print(f"Saved to: {args.output}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,13 @@
+# Core dependencies for SWE-bench Pro evaluation
+pandas>=1.5.0
+tqdm>=4.64.0
+datasets>=2.14.0
+
+# Modal for cloud-based evaluation (optional if using local Docker)
+modal>=0.50.0
+
+# Docker SDK for local evaluation (optional if using Modal)
+docker>=6.0.0
+
+# HuggingFace Hub for dataset management
+huggingface_hub>=0.16.0