microsoft
diff --git a/‎.github/workflows/tests.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/tests.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CHANGELOG.md‎
Lines changed: 5 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎MANIFEST.in‎
Lines changed: 1 addition & 0 deletions b/‎MANIFEST.in‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 20 additions & 5 deletions b/‎README.md‎
Lines changed: 20 additions & 5 deletions
diff --git a/‎debug_gym/agents/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎debug_gym/agents/__init__.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎debug_gym/agents/base_agent.py‎
Lines changed: 5 additions & 3 deletions b/‎debug_gym/agents/base_agent.py‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎debug_gym/agents/solution_agent.py‎
Lines changed: 69 additions & 0 deletions b/‎debug_gym/agents/solution_agent.py‎
Lines changed: 69 additions & 0 deletions
diff --git a/‎debug_gym/agents/utils.py‎
Lines changed: 4 additions & 4 deletions b/‎debug_gym/agents/utils.py‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎debug_gym/gym/envs/__init__.py‎
Lines changed: 4 additions & 2 deletions b/‎debug_gym/gym/envs/__init__.py‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎debug_gym/gym/envs/aider.py‎
Lines changed: 1 addition & 1 deletion b/‎debug_gym/gym/envs/aider.py‎
Lines changed: 1 addition & 1 deletion
@@ -33,7 +33,7 @@ jobs:
       - name: Test with pytest - PR
         if: github.event_name == 'pull_request'
         run: |
-          DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench" --cov=debug_gym --cov-report=term-missing --cov-fail-under=80 --timeout=300
+          DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench and not test_swe_smith" --cov=debug_gym --cov-report=term-missing --cov-fail-under=80 --timeout=300
       - name: Test with pytest
         if: github.event_name != 'pull_request'
         run: |
 
@@ -18,4 +18,8 @@ Added in the [analysis](https://github.com/microsoft/debug-gym/tree/main/analysi
 
 ### 2025-05-28
 
-Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
+Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
+
+### 2025-06-11
+
+Added support to [SWE-smith](https://swesmith.com/). Users can use the tasks shipped with the official SWE-smith package, or customized tasks generated using SWE-smith.
@@ -0,0 +1 @@
+include debug_gym/envs/configs/*.yaml
@@ -98,6 +98,7 @@ We provide the below LLM-based agents, they all have minimal design and serve th
 | `debug_agent` | `pdb`, `rewrite`, `view`, `eval` | A minimal agent that dumps all available information into its prompt and queries the LLM to generate a command. |
 | `rewrite_agent` | `rewrite`, `view`, `eval`  | A `debug_agent` but `pdb` tool is disabled (an agent keeps rewriting). |
 | `debug_5_agent` | `pdb`, `rewrite`, `view`, `eval`  | A `debug_agent`, but `pdb` tool is only enabled after certain amount of rewrites. |
+| `solution_agent` | `pdb`, `eval`  | An oracle agent that applies a gold patch (only works with `swebench` and `swesmith` benchmarks for now). The agent checks that tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected. |
 
 ---
 
@@ -109,6 +110,7 @@ To demonstrate how to integrate `debug-gym` with coding tasks and repositories,
 | :-: | :----- |
 | `aider` | [https://github.com/Aider-AI/aider](https://github.com/Aider-AI/aider) |
 | `swebench`| [https://github.com/princeton-nlp/SWE-bench](https://github.com/princeton-nlp/SWE-bench) |
+| `swesmith`| [https://github.com/SWE-bench/SWE-smith](https://github.com/SWE-bench/SWE-smith) |
 | `mini_nightmare` | A set of 10 hand-crafted minimal buggy code snippet where rewrite only agents have harder time to tackle. Read details [here](https://github.com/microsoft/debug-gym/blob/main/data/mini_nightmare/mini_nightmare.md). |
 
 ---
@@ -122,28 +124,41 @@ Add `-v`, `--debug` to be verbose, or to enter debug mode.
 > [!WARNING]
 > When using --debug, you will need to press `c` to continue after each reasoning step.
 
-#### 3.1 Human Mode
+#### 3.1 Sanity Checks
+
+We can use the `solution_agent` to validate that your `swebench` and `swesmith` instances work as expected. This agent will apply a gold patch to the buggy code and check that the tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected.
+
+    python scripts/run.py scripts/config_swebench.yaml --agent solution_agent
+    python scripts/run.py scripts/config_swesmith.yaml --agent solution_agent
+
+#### 3.2 Human Mode
 
 We provide a human mode that enables developers to manually interact with `debug-gym`. To activate this mode, change the `llm_name` field in the `config_*.yaml` to be `"human"`. Once activated, at every step, the environment will expect a command input (in tool calling format). One can use the `Tab` key to get a list of tool calling templates and fill in any necessary arguments.
 
-#### 3.2. Overriding Values in Config
+#### 3.3. Overriding Values in Config
 
 `-p` is a handy way to override values defined in config. For example, the below command will run rewrite_agent agent on Aider with human mode (while in config file it specifies gpt-4o).
 
     python scripts/run.py scripts/config_aider.yaml --agent rewrite_agent -v -p rewrite_agent.llm_name="human"
 
-#### 3.3. Debugging a Custom Repository
+#### 3.4. Debugging a Custom Repository
 
 Modify `scripts/config.yaml`, especially the `env_kwargs` to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.
 
 As an example, we provide a buggy pytorch code repository in `data/pytorch`.
 
     python scripts/run.py scripts/config.yaml --agent <agent name>
 
-#### 3.4. Design Your Own Tool
+#### 3.5. Debugging a Custom SWE-Smith Instance
+
+[SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Give a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `-p base.env_kwargs.dataset_id=<dataset_id>` in the command line to run the agent on that dataset. For example, to run on a local dataset:
+
+    python scripts/run.py scripts/config_swesmith.yaml --agent <agent name> -p base.env_kwargs.dataset_id="path/to/local/dataset"
+
+#### 3.6. Design Your Own Tool
 `debug-gym`'s modular design makes it extensible. Users are encouraged to extend `debug-gym` to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are `debug-gym`-compatible, please refer to the [Technical Report](https://arxiv.org/abs/2503.21557).
 
-#### 3.5. Analysis and Visualization
+#### 3.7. Analysis and Visualization
 
 We provide a set of scripts to help analyze the log files (e.g., the `.jsonl` files) generated by the agent.
 - In the `analysis` folder, we provide scripts that used to generate the corresponding figures in our technical report.
 
@@ -1,3 +1,4 @@
 from debug_gym.agents.debug_agent import Debug_5_Agent, DebugAgent
 from debug_gym.agents.guided_agent import GuidedRewriteAgent
 from debug_gym.agents.rewrite_agent import RewriteAgent
+from debug_gym.agents.solution_agent import AgentSolution
@@ -10,7 +10,7 @@
 from debug_gym.agents.history_tracker import HistoryTracker, build_history_prompt
 from debug_gym.agents.utils import trim
 from debug_gym.gym.envs.env import RepoEnv
-from debug_gym.gym.utils import unescape
+from debug_gym.gym.utils import filter_non_utf8
 from debug_gym.llms.base import LLM
 from debug_gym.logger import DebugGymLogger
 
@@ -74,7 +74,7 @@ def parse_reasoning_model_response(self, response, reasoning_end_token):
 
     def build_system_prompt(self, info):
         def calc_tokens_left(system_prompt: dict):
-            system_prompt = unescape(
+            system_prompt = filter_non_utf8(
                 json.dumps(system_prompt, indent=2, sort_keys=False)
             )
             return self.llm.context_length - self.llm.count_tokens(system_prompt)
@@ -129,7 +129,9 @@ def calc_tokens_left(system_prompt: dict):
         if len(shortcut_features) > 0:
             system_prompt["Shortcut features"] = shortcut_features
 
-        system_prompt = unescape(json.dumps(system_prompt, indent=2, sort_keys=False))
+        system_prompt = filter_non_utf8(
+            json.dumps(system_prompt, indent=2, sort_keys=False)
+        )
         messages = [
             {
                 "role": "system",
 
@@ -0,0 +1,69 @@
+import subprocess
+
+from debug_gym.agents.base_agent import BaseAgent, register_agent
+from debug_gym.gym.envs.swe_bench import SWEBenchEnv
+from debug_gym.gym.envs.swe_smith import SWESmithEnv
+from debug_gym.gym.tools.tool import ToolCall
+
+
+@register_agent
+class AgentSolution(BaseAgent):
+    name: str = "solution_agent"
+
+    def run(self, task_name=None, debug=False):
+        self.history.reset()
+
+        info = self.env.reset(options={"task_name": task_name})
+        self.history.step(info)
+
+        if info.done is True:
+            return True
+
+        self.logger.info(
+            f"Score: {info.score}/{info.max_score} ({info.score/info.max_score:.1%})"
+        )
+
+        # Make a simple pdb call to make sure it is working.
+        action = ToolCall(name="pdb", id="pdb", arguments={"command": "help help"})
+        pdb_help_info = self.env.step(action)
+        assert (
+            "h(elp)" in pdb_help_info.step_observation.observation
+        ), f"PDB command did not return expected help message.\n{pdb_help_info.step_observation.observation}"
+
+        # Send a pdb continue command, and check the output matches the one from env.reset.
+        action = ToolCall(name="pdb", id="pdb", arguments={"command": "continue"})
+        pdb_continue_info = self.env.step(action)
+
+        assert (
+            "Reached the end of the program. Restarting the debugging session."
+            in pdb_continue_info.step_observation.observation
+        ) or (
+            info.step_observation.observation.splitlines()[-1]
+            in pdb_continue_info.step_observation.observation
+        ), f"PDB command did not return expected continue message.\n{pdb_continue_info.step_observation.observation}"
+
+        try:
+            self.env.apply_gold_patch()
+        except NotImplementedError as e:
+            self.logger.error(
+                f"The environment {type(self.env)} is not compatible with SolutionAgent"
+                "Check the README.md to see which environments are compatible."
+            )
+            raise
+
+        if debug:
+            breakpoint()
+
+        action = ToolCall(name="eval", id="eval", arguments={})
+        info = self.env.step(action)
+
+        self.history.step(info)
+
+        self.logger.info(
+            f"Score: {info.score}/{info.max_score} ({info.score/info.max_score:.1%})"
+        )
+        assert (
+            info.done
+        ), f"The task is not done after applying the gold patch.\n{info.step_observation.observation}"
+
+        return info.done
@@ -104,14 +104,14 @@ def load_config():
         "--agent",
     )
     parser.add_argument(
-        "--list",
+        "--debug",
         action="store_true",
-        help="List available agents and problems.",
+        help="Break before sending action to the environment.",
     )
     parser.add_argument(
-        "--debug",
+        "--list",
         action="store_true",
-        help="Break before sending action to the environment.",
+        help="List available agents and problems.",
     )
     group = parser.add_mutually_exclusive_group()
     group.add_argument(
 
@@ -2,18 +2,20 @@
 from debug_gym.gym.envs.env import RepoEnv, TooledEnv
 from debug_gym.gym.envs.mini_nightmare import MiniNightmareEnv
 from debug_gym.gym.envs.swe_bench import SWEBenchEnv
+from debug_gym.gym.envs.swe_smith import SWESmithEnv
 
 
-def select_env(env_type: str = None):
+def select_env(env_type: str = None) -> type[RepoEnv]:
     match env_type:
         case None:
             return RepoEnv
         case "aider":
             return AiderBenchmarkEnv
         case "swebench":
             return SWEBenchEnv
+        case "swesmith":
+            return SWESmithEnv
         case "mini_nightmare":
             return MiniNightmareEnv
         case _:
             raise ValueError(f"Unknown benchmark {env_type}")
-    return env_class
 
@@ -66,7 +66,7 @@ def load_dataset(self):
             utils.create_ignore_file(
                 directory / ".debugignore",
                 patterns=[
-                    ".*/",
+                    ".?*",  # Ignore hidden files and directories but not current dir "."
                     "__pycache__/",
                     "*.pyc",
                     # "*.md",
Original file line number	Diff line number	Diff line change
`@@ -104,14 +104,14 @@ def load_config():`
`104`	`104`	`"--agent",`
`105`	`105`	`)`
`106`	`106`	`parser.add_argument(`
`107`		`- "--list",`
	`107`	`+ "--debug",`
`108`	`108`	`action="store_true",`
`109`		`- help="List available agents and problems.",`
	`109`	`+ help="Break before sending action to the environment.",`
`110`	`110`	`)`
`111`	`111`	`parser.add_argument(`
`112`		`- "--debug",`
	`112`	`+ "--list",`
`113`	`113`	`action="store_true",`
`114`		`- help="Break before sending action to the environment.",`
	`114`	`+ help="List available agents and problems.",`
`115`	`115`	`)`
`116`	`116`	`group = parser.add_mutually_exclusive_group()`
`117`	`117`	`group.add_argument(`