Skip to content

Commit d1e23ba

Browse files
committed
Merge remote-tracking branch 'origin/main' into hitl_agent
2 parents e5abb91 + 31558a3 commit d1e23ba

31 files changed

+1322
-462
lines changed

.github/workflows/tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ jobs:
3333
- name: Test with pytest - PR
3434
if: github.event_name == 'pull_request'
3535
run: |
36-
DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench" --cov=debug_gym --cov-report=term-missing --cov-fail-under=80 --timeout=300
36+
DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench and not test_swe_smith" --cov=debug_gym --cov-report=term-missing --cov-fail-under=80 --timeout=300
3737
- name: Test with pytest
3838
if: github.event_name != 'pull_request'
3939
run: |

CHANGELOG.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,8 @@ Added in the [analysis](https://github.com/microsoft/debug-gym/tree/main/analysi
1818

1919
### 2025-05-28
2020

21-
Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
21+
Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
22+
23+
### 2025-06-11
24+
25+
Added support to [SWE-smith](https://swesmith.com/). Users can use the tasks shipped with the official SWE-smith package, or customized tasks generated using SWE-smith.

MANIFEST.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
include debug_gym/envs/configs/*.yaml

README.md

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,7 @@ We provide the below LLM-based agents, they all have minimal design and serve th
9898
| `debug_agent` | `pdb`, `rewrite`, `view`, `eval` | A minimal agent that dumps all available information into its prompt and queries the LLM to generate a command. |
9999
| `rewrite_agent` | `rewrite`, `view`, `eval` | A `debug_agent` but `pdb` tool is disabled (an agent keeps rewriting). |
100100
| `debug_5_agent` | `pdb`, `rewrite`, `view`, `eval` | A `debug_agent`, but `pdb` tool is only enabled after certain amount of rewrites. |
101+
| `solution_agent` | `pdb`, `eval` | An oracle agent that applies a gold patch (only works with `swebench` and `swesmith` benchmarks for now). The agent checks that tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected. |
101102

102103
---
103104

@@ -109,6 +110,7 @@ To demonstrate how to integrate `debug-gym` with coding tasks and repositories,
109110
| :-: | :----- |
110111
| `aider` | [https://github.com/Aider-AI/aider](https://github.com/Aider-AI/aider) |
111112
| `swebench`| [https://github.com/princeton-nlp/SWE-bench](https://github.com/princeton-nlp/SWE-bench) |
113+
| `swesmith`| [https://github.com/SWE-bench/SWE-smith](https://github.com/SWE-bench/SWE-smith) |
112114
| `mini_nightmare` | A set of 10 hand-crafted minimal buggy code snippet where rewrite only agents have harder time to tackle. Read details [here](https://github.com/microsoft/debug-gym/blob/main/data/mini_nightmare/mini_nightmare.md). |
113115

114116
---
@@ -122,28 +124,41 @@ Add `-v`, `--debug` to be verbose, or to enter debug mode.
122124
> [!WARNING]
123125
> When using --debug, you will need to press `c` to continue after each reasoning step.
124126
125-
#### 3.1 Human Mode
127+
#### 3.1 Sanity Checks
128+
129+
We can use the `solution_agent` to validate that your `swebench` and `swesmith` instances work as expected. This agent will apply a gold patch to the buggy code and check that the tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected.
130+
131+
python scripts/run.py scripts/config_swebench.yaml --agent solution_agent
132+
python scripts/run.py scripts/config_swesmith.yaml --agent solution_agent
133+
134+
#### 3.2 Human Mode
126135

127136
We provide a human mode that enables developers to manually interact with `debug-gym`. To activate this mode, change the `llm_name` field in the `config_*.yaml` to be `"human"`. Once activated, at every step, the environment will expect a command input (in tool calling format). One can use the `Tab` key to get a list of tool calling templates and fill in any necessary arguments.
128137

129-
#### 3.2. Overriding Values in Config
138+
#### 3.3. Overriding Values in Config
130139

131140
`-p` is a handy way to override values defined in config. For example, the below command will run rewrite_agent agent on Aider with human mode (while in config file it specifies gpt-4o).
132141

133142
python scripts/run.py scripts/config_aider.yaml --agent rewrite_agent -v -p rewrite_agent.llm_name="human"
134143

135-
#### 3.3. Debugging a Custom Repository
144+
#### 3.4. Debugging a Custom Repository
136145

137146
Modify `scripts/config.yaml`, especially the `env_kwargs` to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.
138147

139148
As an example, we provide a buggy pytorch code repository in `data/pytorch`.
140149

141150
python scripts/run.py scripts/config.yaml --agent <agent name>
142151

143-
#### 3.4. Design Your Own Tool
152+
#### 3.5. Debugging a Custom SWE-Smith Instance
153+
154+
[SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Give a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `-p base.env_kwargs.dataset_id=<dataset_id>` in the command line to run the agent on that dataset. For example, to run on a local dataset:
155+
156+
python scripts/run.py scripts/config_swesmith.yaml --agent <agent name> -p base.env_kwargs.dataset_id="path/to/local/dataset"
157+
158+
#### 3.6. Design Your Own Tool
144159
`debug-gym`'s modular design makes it extensible. Users are encouraged to extend `debug-gym` to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are `debug-gym`-compatible, please refer to the [Technical Report](https://arxiv.org/abs/2503.21557).
145160

146-
#### 3.5. Analysis and Visualization
161+
#### 3.7. Analysis and Visualization
147162

148163
We provide a set of scripts to help analyze the log files (e.g., the `.jsonl` files) generated by the agent.
149164
- In the `analysis` folder, we provide scripts that used to generate the corresponding figures in our technical report.

debug_gym/agents/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
from debug_gym.agents.debug_agent import Debug_5_Agent, DebugAgent
22
from debug_gym.agents.guided_agent import GuidedRewriteAgent
33
from debug_gym.agents.rewrite_agent import RewriteAgent
4+
from debug_gym.agents.solution_agent import AgentSolution

debug_gym/agents/base_agent.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
from debug_gym.agents.history_tracker import HistoryTracker, build_history_prompt
1111
from debug_gym.agents.utils import trim
1212
from debug_gym.gym.envs.env import RepoEnv
13-
from debug_gym.gym.utils import unescape
13+
from debug_gym.gym.utils import filter_non_utf8
1414
from debug_gym.llms.base import LLM
1515
from debug_gym.logger import DebugGymLogger
1616

@@ -74,7 +74,7 @@ def parse_reasoning_model_response(self, response, reasoning_end_token):
7474

7575
def build_system_prompt(self, info):
7676
def calc_tokens_left(system_prompt: dict):
77-
system_prompt = unescape(
77+
system_prompt = filter_non_utf8(
7878
json.dumps(system_prompt, indent=2, sort_keys=False)
7979
)
8080
return self.llm.context_length - self.llm.count_tokens(system_prompt)
@@ -129,7 +129,9 @@ def calc_tokens_left(system_prompt: dict):
129129
if len(shortcut_features) > 0:
130130
system_prompt["Shortcut features"] = shortcut_features
131131

132-
system_prompt = unescape(json.dumps(system_prompt, indent=2, sort_keys=False))
132+
system_prompt = filter_non_utf8(
133+
json.dumps(system_prompt, indent=2, sort_keys=False)
134+
)
133135
messages = [
134136
{
135137
"role": "system",

debug_gym/agents/solution_agent.py

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
import subprocess
2+
3+
from debug_gym.agents.base_agent import BaseAgent, register_agent
4+
from debug_gym.gym.envs.swe_bench import SWEBenchEnv
5+
from debug_gym.gym.envs.swe_smith import SWESmithEnv
6+
from debug_gym.gym.tools.tool import ToolCall
7+
8+
9+
@register_agent
10+
class AgentSolution(BaseAgent):
11+
name: str = "solution_agent"
12+
13+
def run(self, task_name=None, debug=False):
14+
self.history.reset()
15+
16+
info = self.env.reset(options={"task_name": task_name})
17+
self.history.step(info)
18+
19+
if info.done is True:
20+
return True
21+
22+
self.logger.info(
23+
f"Score: {info.score}/{info.max_score} ({info.score/info.max_score:.1%})"
24+
)
25+
26+
# Make a simple pdb call to make sure it is working.
27+
action = ToolCall(name="pdb", id="pdb", arguments={"command": "help help"})
28+
pdb_help_info = self.env.step(action)
29+
assert (
30+
"h(elp)" in pdb_help_info.step_observation.observation
31+
), f"PDB command did not return expected help message.\n{pdb_help_info.step_observation.observation}"
32+
33+
# Send a pdb continue command, and check the output matches the one from env.reset.
34+
action = ToolCall(name="pdb", id="pdb", arguments={"command": "continue"})
35+
pdb_continue_info = self.env.step(action)
36+
37+
assert (
38+
"Reached the end of the program. Restarting the debugging session."
39+
in pdb_continue_info.step_observation.observation
40+
) or (
41+
info.step_observation.observation.splitlines()[-1]
42+
in pdb_continue_info.step_observation.observation
43+
), f"PDB command did not return expected continue message.\n{pdb_continue_info.step_observation.observation}"
44+
45+
try:
46+
self.env.apply_gold_patch()
47+
except NotImplementedError as e:
48+
self.logger.error(
49+
f"The environment {type(self.env)} is not compatible with SolutionAgent"
50+
"Check the README.md to see which environments are compatible."
51+
)
52+
raise
53+
54+
if debug:
55+
breakpoint()
56+
57+
action = ToolCall(name="eval", id="eval", arguments={})
58+
info = self.env.step(action)
59+
60+
self.history.step(info)
61+
62+
self.logger.info(
63+
f"Score: {info.score}/{info.max_score} ({info.score/info.max_score:.1%})"
64+
)
65+
assert (
66+
info.done
67+
), f"The task is not done after applying the gold patch.\n{info.step_observation.observation}"
68+
69+
return info.done

debug_gym/agents/utils.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -104,14 +104,14 @@ def load_config():
104104
"--agent",
105105
)
106106
parser.add_argument(
107-
"--list",
107+
"--debug",
108108
action="store_true",
109-
help="List available agents and problems.",
109+
help="Break before sending action to the environment.",
110110
)
111111
parser.add_argument(
112-
"--debug",
112+
"--list",
113113
action="store_true",
114-
help="Break before sending action to the environment.",
114+
help="List available agents and problems.",
115115
)
116116
group = parser.add_mutually_exclusive_group()
117117
group.add_argument(

debug_gym/gym/envs/__init__.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,20 @@
22
from debug_gym.gym.envs.env import RepoEnv, TooledEnv
33
from debug_gym.gym.envs.mini_nightmare import MiniNightmareEnv
44
from debug_gym.gym.envs.swe_bench import SWEBenchEnv
5+
from debug_gym.gym.envs.swe_smith import SWESmithEnv
56

67

7-
def select_env(env_type: str = None):
8+
def select_env(env_type: str = None) -> type[RepoEnv]:
89
match env_type:
910
case None:
1011
return RepoEnv
1112
case "aider":
1213
return AiderBenchmarkEnv
1314
case "swebench":
1415
return SWEBenchEnv
16+
case "swesmith":
17+
return SWESmithEnv
1518
case "mini_nightmare":
1619
return MiniNightmareEnv
1720
case _:
1821
raise ValueError(f"Unknown benchmark {env_type}")
19-
return env_class

debug_gym/gym/envs/aider.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ def load_dataset(self):
6666
utils.create_ignore_file(
6767
directory / ".debugignore",
6868
patterns=[
69-
".*/",
69+
".?*", # Ignore hidden files and directories but not current dir "."
7070
"__pycache__/",
7171
"*.pyc",
7272
# "*.md",

0 commit comments

Comments
 (0)