You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+5-1Lines changed: 5 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,4 +18,8 @@ Added in the [analysis](https://github.com/microsoft/debug-gym/tree/main/analysi
18
18
19
19
### 2025-05-28
20
20
21
-
Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
21
+
Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
22
+
23
+
### 2025-06-11
24
+
25
+
Added support to [SWE-smith](https://swesmith.com/). Users can use the tasks shipped with the official SWE-smith package, or customized tasks generated using SWE-smith.
Copy file name to clipboardExpand all lines: README.md
+20-5Lines changed: 20 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -98,6 +98,7 @@ We provide the below LLM-based agents, they all have minimal design and serve th
98
98
|`debug_agent`|`pdb`, `rewrite`, `view`, `eval`| A minimal agent that dumps all available information into its prompt and queries the LLM to generate a command. |
99
99
|`rewrite_agent`|`rewrite`, `view`, `eval`| A `debug_agent` but `pdb` tool is disabled (an agent keeps rewriting). |
100
100
|`debug_5_agent`|`pdb`, `rewrite`, `view`, `eval`| A `debug_agent`, but `pdb` tool is only enabled after certain amount of rewrites. |
101
+
|`solution_agent`|`pdb`, `eval`| An oracle agent that applies a gold patch (only works with `swebench` and `swesmith` benchmarks for now). The agent checks that tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected. |
101
102
102
103
---
103
104
@@ -109,6 +110,7 @@ To demonstrate how to integrate `debug-gym` with coding tasks and repositories,
|`mini_nightmare`| A set of 10 hand-crafted minimal buggy code snippet where rewrite only agents have harder time to tackle. Read details [here](https://github.com/microsoft/debug-gym/blob/main/data/mini_nightmare/mini_nightmare.md). |
113
115
114
116
---
@@ -122,28 +124,41 @@ Add `-v`, `--debug` to be verbose, or to enter debug mode.
122
124
> [!WARNING]
123
125
> When using --debug, you will need to press `c` to continue after each reasoning step.
124
126
125
-
#### 3.1 Human Mode
127
+
#### 3.1 Sanity Checks
128
+
129
+
We can use the `solution_agent` to validate that your `swebench` and `swesmith` instances work as expected. This agent will apply a gold patch to the buggy code and check that the tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected.
We provide a human mode that enables developers to manually interact with `debug-gym`. To activate this mode, change the `llm_name` field in the `config_*.yaml` to be `"human"`. Once activated, at every step, the environment will expect a command input (in tool calling format). One can use the `Tab` key to get a list of tool calling templates and fill in any necessary arguments.
128
137
129
-
#### 3.2. Overriding Values in Config
138
+
#### 3.3. Overriding Values in Config
130
139
131
140
`-p` is a handy way to override values defined in config. For example, the below command will run rewrite_agent agent on Aider with human mode (while in config file it specifies gpt-4o).
Modify `scripts/config.yaml`, especially the `env_kwargs` to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.
138
147
139
148
As an example, we provide a buggy pytorch code repository in `data/pytorch`.
[SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Give a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `-p base.env_kwargs.dataset_id=<dataset_id>` in the command line to run the agent on that dataset. For example, to run on a local dataset:
`debug-gym`'s modular design makes it extensible. Users are encouraged to extend `debug-gym` to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are `debug-gym`-compatible, please refer to the [Technical Report](https://arxiv.org/abs/2503.21557).
145
160
146
-
#### 3.5. Analysis and Visualization
161
+
#### 3.7. Analysis and Visualization
147
162
148
163
We provide a set of scripts to help analyze the log files (e.g., the `.jsonl` files) generated by the agent.
149
164
- In the `analysis` folder, we provide scripts that used to generate the corresponding figures in our technical report.
0 commit comments