You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>CodeMonkeys scores 57.4% on SWE-bench Verified using Claude Sonnet 3.5. Additionally, CodeMonkeys' candidate selection method can effectively combine solutions from different systems. When selecting over an ensemble that includes solutions from CodeMonkeys and existing top SWE-bench verified submissions, we achieve a score of 66.2% - outperforming all memebers of the ensemble, and coming 5.5% below o3's reported score of 71.7%.</p>
43
+
<p>CodeMonkeys scores 57.4% on SWE-bench Verified using Claude Sonnet 3.5.</p>
44
+
<p>Further, when using CodeMonkeys' <ahref="#selection">candidate selection</a> to select from an ensemble of solutions constructed from CodeMonkeys and existing top SWE-bench verified submissions, we achieve a score of 66.2%. This score outperforms all members of the ensemble and comes 5.5% below o3's reported score of 71.7%.</p>
44
45
45
-
<p>To promote further research, we are releasing:</p>
46
+
<p>Concretely, we are releasing:</p>
46
47
<ul>
47
-
<li><strong><a href="https://github.com/scalingintelligence/codemonkeys">The CodeMonkeys codebase</a>,</strong> including scripts for reproducing all results from our paper and running the system on SWE-bench problems.</li>
48
-
<li><strong><a href="#">All trajectories generated while solving SWE-bench problems.</a></strong> These trajectories contain all model outputs, all file contexts, candidate edits, test scripts, and execution traces generated while running CodeMonkeys on SWE-bench Verified.</li>
49
-
<li><strong><a href="#costs">A careful accounting of the cost of running CodeMonkeys.</a></strong> Software Engineering agents are expensive: running CodeMonkeys on SWE-bench Verified cost $2300 USD. </li>
48
+
<li><strong><a href="https://github.com/scalingintelligence/codemonkeys">The CodeMonkeys codebase</a>.</strong> This includes scripts for reproducing all results from our paper and running CodeMonkeys on SWE-bench problems.</li>
49
+
<li><strong><a href="#">All trajectories generated while solving SWE-bench problems.</a></strong> These trajectories contain all model outputs, candidate edits, test scripts, and execution traces generated while running CodeMonkeys on SWE-bench Verified.</li>
50
+
<li><strong><a href="#costs">A careful accounting of the cost of running CodeMonkeys.</a></strong> Software engineering agents are expensive: running CodeMonkeys on SWE-bench Verified cost $2300 USD. </li>
50
51
<li><strong><a href="https://huggingface.co/datasets/ScalingIntelligence/swe-bench-verified-codebase-content">A companion dataset containing complete Python codebase snapshots for all problems in SWE-bench Verified.</a></strong> This dataset makes it easier to work with SWE-bench by providing direct access to repository contents, without needing to clone and manage large Git repositories.</li>
51
52
</ul>
52
-
<!--
53
53
54
-
<table class="leaderboard">
55
-
<thead>
56
-
<tr>
57
-
<th>Method</th>
58
-
<th>Selection</th>
59
-
<th>Score</th>
60
-
</tr>
61
-
</thead>
62
-
<tbody>
63
-
<tr><td><strong>Barrel of Monkeys</strong></td><td>Oracle (Coverage)</td><td>80.8</td></tr>
<p> We'll start by <ahref="#motivation">talking about monkeys</a>, transition into how this motivated the <ahref="#codemonkeys">design of CodeMonkeys</a>, CodeMonkeys' <ahref="#results">results on SWE-bench</a>, and end with <ahref="#ensemble">ensembling for SWE-bench</a>.</p>
77
55
78
56
<sectionid="swebench">
79
57
<h2>SWE-bench</h2>
80
58
<p><a href="https://swebench.com">SWE-bench</a> is a benchmark that measures how well AI systems can solve real-world software engineering problems. Each problem in SWE-bench consists of an actual GitHub issue from a popular open-source Python repository (like Django or Sympy) and the complete codebase at the time the issue was reported.</p>
81
59
82
60
<img src="/imgs/blog/codemonkeys/swebench.png" alt="SWE-bench problem overview." style="width: 100%; height: auto;">
83
61
84
-
<p>To solve a SWE-bench issue, systems produce an edit to the codebase. This edit is evaluated for correctness using unit tests from the repo (these unit tests are hidden from the model). </p>
62
+
<p>To solve a SWE-bench issue, systems produce an edit to the given codebase, with the goal of resolving the described issue. This edit is evaluated for correctness using unit tests from the codebase, which are hidden from the system at test-time. A model's score is simply the fraction of issues where the model's patch is marked as correct.</p>
85
63
86
64
<p>In this work, we've focused on <a href="https://openai.com/index/introducing-swe-bench-verified/">SWE-bench Verified</a>, a subset of SWE-bench validated by human annotators.</p>
87
65
</section>
88
-
89
-
90
66
<section id="motivation">
91
67
<h2>Large Language Monkeys</h2>
92
68
<p>We first became interested in SWE-bench during our previous work, <a href="https://scalingintelligence.stanford.edu/pubs/large_language_monkeys/">Large Language Monkeys</a>. In that paper, we demonstrated a promising property of LLMs: when solving software engineering (and other) problems, coverage, the fraction of problems that are solved by at least one attempt, increases log-linearly with the number of solutions drawn from the model.</p>
@@ -95,30 +71,36 @@ materials:
95
71
96
72
<p><strong>This means that as you spend more test time compute by drawing more samples, the fraction of problems that you have at least one correct solution for increases consistently and predictably.</strong></p>
97
73
98
-
<p>While these results showed clear potential for improving performance on benchmarks like SWE-bench, we only demonstrated coverage improvements. To achieve actual performance gains, a system needs to select a correct solution from many candidates. Additionally, we generated our candidates by running an existing framework <a href="https://github.com/aorwall/moatless-tools">Moatless Tools</a> multiple times with a positive temperature - although powerful and well-built, the framework wasn't designed for taking multiple attempts per problem.</p>
74
+
<p>While these results showed clear potential for improving performance on benchmarks like SWE-bench, we only demonstrated coverage improvements. To achieve actual performance gains, a system needs to select a correct solution from many candidates. Additionally, we generated our candidates by sampling from an existing framework (<a href="https://github.com/aorwall/moatless-tools">Moatless Tools</a>) multiple times. Although powerful and well-built, the framework wasn't designed for taking multiple attempts per problem.</p>
99
75
100
76
<p>This raised the question: <em>how would one design a system for solving SWE tasks differently if benefiting from test-time compute scaling was a primary consideration?</em></p>
101
77
</section>
102
78
103
79
<section id="codemonkeys">
104
80
<h2>CodeMonkeys</h2>
105
-
<p>This question led us to build CodeMonkeys, a system designed to solve software engineering problems by scaling test-time compute. Similar to existing approaches like <a href="https://github.com/OpenAutoCoder/Agentless?tab=readme-ov-file">Agentless</a>, we decomposed resolving SWE-bench issues into 3 subtasks.</p>
81
+
<p>This question led us to build CodeMonkeys, a system designed to solve software engineering problems by scaling test-time compute. Similar to existing approaches like <a href="https://github.com/OpenAutoCoder/Agentless?tab=readme-ov-file">Agentless</a>, we decomposed solving SWE-bench issues into 3 subtasks:</p>
<p>First, we identify relevant codebase context. As we generate multiple solutions, we can amortize the cost of context identification across all downstream samples. This lets us use a simple but effective approach: we let a model (specifically, Qwen2.5-Coder-32B-Instruct) read every Python file in the codebase and label each file as "relevant" or "not relevant". Then, we used Claude Sonnet-3.5 to rank the relevant files by importance, allowing up to 120,000 tokens of context.</p>
97
+
<p>First, we identify relevant codebase context. As we generate multiple candidate solutions, we can amortize the cost of context identification across all downstream samples.</p>
98
+
99
+
<p>This lets us use a simple but effective approach: we let a model (specifically, <a href="https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct">Qwen2.5-Coder-32B-Instruct</a>) read every Python file in the codebase and label each file as "relevant" or "not relevant". Then, we used <a href="https://claude.ai">Claude Sonnet-3.5</a> to rank the relevant files by importance, allowing up to 120,000 tokens of context.</p>
117
100
118
101
<h3>Task 2: Generation</h3>
119
102
120
103
<div class="component-details">
121
-
<p><strong>Goal:</strong> Generate candidate solutions to the issue, along with candidate tests</p>
<p>Then, we generate candidate solutions. We run multiple parallel state machines that each generate both a codebase edit and a corresponding testing script. These state machines iteratively refine their edits and tests based on execution feedback. This provides two ways to scale test-time compute: we can increase the number of iterations per state machine ("serial scaling") or increase the number of independent state machines per problem ("parallel scaling").</p>
131
113
132
-
<h3>Task 3: Selection</h3>
114
+
<h3 id="selection">Task 3: Selection</h3>
133
115
134
116
<div class="component-details">
135
-
<p><strong>Goal:</strong> Select a correct solution from the candidate edits.</p>
<p>Finally, we select among the candidate solutions. We combine two approaches: using the model-generated tests to vote on solutions, and running a dedicated selection state machine that can write additional tests to differentiate between top candidates. This selection method recovers approximately half of the gap between random selection and using an oracle to choose the correct solution.</p>
125
+
<p>Finally, we select among the candidate solutions. We combine two approaches: using the model-generated tests to vote on solutions, and running a dedicated selection state machine that can write additional tests to differentiate between top candidates.</p>
145
126
</section>
146
127
147
128
<sectionid="results">
@@ -150,10 +131,10 @@ materials:
150
131
151
132
<div class="component-results">
152
133
<h3>Context</h3>
153
-
<p>With the 128k token limit, 92.6% of instances have the correct files in context.</p>
134
+
<p>With the 128k token limit, 92.6% of instances have the correct files in context.</p>
154
135
155
136
<h3>Generation</h3>
156
-
<p>By running multiple state machines in parallel and allowing each to iterate multiple times, we achieve 69.8% coverage. This means that for about 70% of problems, at least one of our candidate solutions is correct. Interestingly, we found that different ways of distributing compute between parallel scaling (more state machines) and serial scaling (more iterations per machine) often lead to similar coverage values.</p>
137
+
<p>By running multiple state machines in parallel and allowing each to iterate multiple times, we achieve 69.8% coverage. This means that for about 70% of problems, at least one of our candidate solutions is correct. Notably, in our experiments, we found that different ways of distributing compute between parallel scaling (more state machines) and serial scaling (more iterations per machine) often lead to similar coverage values.</p>
157
138
158
139
<h3>Selection</h3>
159
140
<p>Our selection method, which combines test-based voting with a dedicated selection state machine, recovers approximately half of the gap between random selection and using an oracle. This leads to a final score of 57.4%.</p>
@@ -280,16 +261,16 @@ materials:
280
261
</style>
281
262
282
263
283
-
<p>The cost breakdown reveals several key insights about our system:</p>
264
+
<p>The cost breakdown reveals several insights about CodeMonkeys:</p>
284
265
285
266
<ulclass="cost-analysis">
286
-
<li>Our context identification contributes only 15% to total costs by amortizing this scan across multiple downstream samples.</li>
267
+
<li>Our context identification contributes only 15% to total costs with a high recall. A simple linear scan with a small model can be quite effective! </li>
287
268
288
269
<li>Generating edits is the most expensive component (60% of costs), primarily due to cache read costs from including codebase context in prompts.</li>
289
270
290
271
<li>We reduce costs by separating testing and editing state machines, allowing us to omit codebase context from testing prompts.</li>
291
272
292
-
<li>Selection contributes less than 10% to total costs while significantly improving final performance.</li>
273
+
<li>Selection contributes less than 10% to total costs while significantly improving final performance (see paper for ablations).</li>
293
274
</ul>
294
275
295
276
<style>
@@ -309,31 +290,57 @@ materials:
309
290
310
291
<sectionid="ensemble">
311
292
<h2>Barrel of Monkeys: Combining Solutions from Different Systems</h2>
293
+
294
+
295
+
<tableclass="leaderboard">
296
+
<thead>
297
+
<tr>
298
+
<th>Method</th>
299
+
<th>Selection</th>
300
+
<th>Score</th>
301
+
</tr>
302
+
</thead>
303
+
<tbody>
304
+
<tr><td><strong>Barrel of Monkeys</strong></td><td>Oracle (Coverage)</td><td>80.8</td></tr>
<p>Our selection mechanism can also be used to combine candidate edits from heterogeneous sources. We demonstrate this by creating what we call the "Barrel of Monkeys" - an expanded pool of candidate edits that includes solutions from CodeMonkeys along with the submissions from the top-4 entries on the SWE-bench leaderboard (Blackbox AI Agent, CodeStory, Learn-by-interact, and devlo).</p>
314
320
315
321
<p>When we run our selection state machine over this expanded pool of candidate solutions, we achieve a score of 66.2%. This outperforms both CodeMonkeys on its own (57.4%) and the previous best ensemble submission (62.8%), showing how our selection method can effectively identify correct solutions even when they come from different frameworks.</p>
322
+
323
+
<p>Something we took away from this is the importance of selection: with oracle selection could yield 10% gains. TODO FIX</p>
324
+
325
+
316
326
</section>
317
327
318
328
<sectionid="data">
319
-
<h2>Data Release</h2>
320
-
<p>We are releasing two complementary datasets that we hope support different aspects of research.</p>
329
+
<h2>Data Release & Paper</h2>
330
+
<p>We are releasing two complementary datasets and a paper. We hope these support different aspects of research.</p>
321
331
322
332
<h3><a>CodeMonkeys Trajectories</a></h3>
323
333
<p>Our first dataset contains the complete problem-solving trajectories from running CodeMonkeys on SWE-bench Verified. For each of the 500 problems, we release all state data. This includes all LLM outputs.</p>
<p>Our second dataset provides efficient access to the Python codebases required to work on SWE-bench problems. Instead of requiring researchers to manage Git repositories, this dataset contains all Python files from the relevant repositories</p>
327
-
</section>
328
337
329
-
<sectionid="conclusion">
330
-
<h2>Conclusion</h2>
338
+
<h3><a>CodeMonkeys paper</a></h3>
331
339
332
340
<p>For more details about our methods, analysis of the trade-offs between different scaling approaches, and ablation studies of our selection methods, please read our paper: <ahref="#">CodeMonkeys: Scaling Test-Time Compute for Software Engineering</a>.</p>
333
-
<p>How to cite? If our dataset, code, or paper was helpful to you, please consider citing:</p>
334
-
</section>
335
-
</div>
336
-
```bibtex
341
+
<p>If our dataset, code, or paper was helpful to you, please consider citing:</p>
342
+
<divstyle="width: 100%; overflow-x: auto">
343
+
<codestyle="white-space: pre">
337
344
@misc{ehrlich2025codemonkeys,
338
345
title={Large Language Monkeys: Scaling Inference Compute with Repeated Sampling},
339
346
author={Ryan Ehrlich and Bradley Brown and Jordan Juravsky and and Ronald Clark and Christopher Ré and Azalia Mirhoseini},
0 commit comments