Skip to content

Commit 9924551

Browse files
committed
Blog post updates
1 parent 48c5b3c commit 9924551

File tree

1 file changed

+67
-58
lines changed

1 file changed

+67
-58
lines changed

_blogs/codemonkeys.md

Lines changed: 67 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -35,58 +35,34 @@ materials:
3535

3636
<div class="post-content">
3737

38-
<p>Today, we're releasing CodeMonkeys, an open-source system designed to solve software engineering problems using test-time compute.</p>
38+
<p>Today, we're releasing CodeMonkeys, a system designed to solve software engineering problems by scaling test-time compute.</p>
3939

4040
<img src="/imgs/blog/codemonkeys/system_overview.png" alt="" style="width: 100%; height: auto;">
4141

4242

43-
<p>CodeMonkeys scores 57.4% on SWE-bench Verified using Claude Sonnet 3.5. Additionally, CodeMonkeys' candidate selection method can effectively combine solutions from different systems. When selecting over an ensemble that includes solutions from CodeMonkeys and existing top SWE-bench verified submissions, we achieve a score of 66.2% - outperforming all memebers of the ensemble, and coming 5.5% below o3's reported score of 71.7%.</p>
43+
<p>CodeMonkeys scores 57.4% on SWE-bench Verified using Claude Sonnet 3.5.</p>
44+
<p>Further, when using CodeMonkeys' <a href="#selection">candidate selection</a> to select from an ensemble of solutions constructed from CodeMonkeys and existing top SWE-bench verified submissions, we achieve a score of 66.2%. This score outperforms all members of the ensemble and comes 5.5% below o3's reported score of 71.7%.</p>
4445

45-
<p>To promote further research, we are releasing:</p>
46+
<p>Concretely, we are releasing:</p>
4647
<ul>
47-
<li><strong><a href="https://github.com/scalingintelligence/codemonkeys">The CodeMonkeys codebase</a>,</strong> including scripts for reproducing all results from our paper and running the system on SWE-bench problems.</li>
48-
<li><strong><a href="#">All trajectories generated while solving SWE-bench problems.</a></strong> These trajectories contain all model outputs, all file contexts, candidate edits, test scripts, and execution traces generated while running CodeMonkeys on SWE-bench Verified.</li>
49-
<li><strong><a href="#costs">A careful accounting of the cost of running CodeMonkeys.</a></strong> Software Engineering agents are expensive: running CodeMonkeys on SWE-bench Verified cost $2300 USD. </li>
48+
<li><strong><a href="https://github.com/scalingintelligence/codemonkeys">The CodeMonkeys codebase</a>.</strong> This includes scripts for reproducing all results from our paper and running CodeMonkeys on SWE-bench problems.</li>
49+
<li><strong><a href="#">All trajectories generated while solving SWE-bench problems.</a></strong> These trajectories contain all model outputs, candidate edits, test scripts, and execution traces generated while running CodeMonkeys on SWE-bench Verified.</li>
50+
<li><strong><a href="#costs">A careful accounting of the cost of running CodeMonkeys.</a></strong> Software engineering agents are expensive: running CodeMonkeys on SWE-bench Verified cost $2300 USD. </li>
5051
<li><strong><a href="https://huggingface.co/datasets/ScalingIntelligence/swe-bench-verified-codebase-content">A companion dataset containing complete Python codebase snapshots for all problems in SWE-bench Verified.</a></strong> This dataset makes it easier to work with SWE-bench by providing direct access to repository contents, without needing to clone and manage large Git repositories.</li>
5152
</ul>
52-
<!--
5353

54-
<table class="leaderboard">
55-
<thead>
56-
<tr>
57-
<th>Method</th>
58-
<th>Selection</th>
59-
<th>Score</th>
60-
</tr>
61-
</thead>
62-
<tbody>
63-
<tr><td><strong>Barrel of Monkeys</strong></td><td>Oracle (Coverage)</td><td>80.8</td></tr>
64-
<tr><td>o3</td><td>---</td><td>71.7</td></tr>
65-
<tr><td><strong>CodeMonkeys</strong></td><td>Oracle (Coverage)</td><td>69.8</td></tr>
66-
<tr><td><strong>Barrel of Monkeys</strong></td><td>State Machine</td><td>66.2</td></tr>
67-
<tr><td>Blackbox AI Agent</td><td>---</td><td>62.8</td></tr>
68-
<tr><td>CodeStory</td><td>---</td><td>62.2</td></tr>
69-
<tr><td>Learn-by-interact</td><td>---</td><td>60.2</td></tr>
70-
<tr><td>devlo</td><td>---</td><td>58.2</td></tr>
71-
<tr><td><strong>CodeMonkeys</strong></td><td>State Machine</td><td>57.4</td></tr>
72-
<tr><td>Emergent E1</td><td>---</td><td>57.2</td></tr>
73-
<tr><td>Gru</td><td>---</td><td>57.0</td></tr>
74-
</tbody>
75-
</table>
76-
-->
54+
<p> We'll start by <a href="#motivation">talking about monkeys</a>, transition into how this motivated the <a href="#codemonkeys">design of CodeMonkeys</a>, CodeMonkeys' <a href="#results">results on SWE-bench</a>, and end with <a href="#ensemble">ensembling for SWE-bench</a>.</p>
7755

7856
<section id="swebench">
7957
<h2>SWE-bench</h2>
8058
<p><a href="https://swebench.com">SWE-bench</a> is a benchmark that measures how well AI systems can solve real-world software engineering problems. Each problem in SWE-bench consists of an actual GitHub issue from a popular open-source Python repository (like Django or Sympy) and the complete codebase at the time the issue was reported.</p>
8159

8260
<img src="/imgs/blog/codemonkeys/swebench.png" alt="SWE-bench problem overview." style="width: 100%; height: auto;">
8361
84-
<p>To solve a SWE-bench issue, systems produce an edit to the codebase. This edit is evaluated for correctness using unit tests from the repo (these unit tests are hidden from the model). </p>
62+
<p>To solve a SWE-bench issue, systems produce an edit to the given codebase, with the goal of resolving the described issue. This edit is evaluated for correctness using unit tests from the codebase, which are hidden from the system at test-time. A model's score is simply the fraction of issues where the model's patch is marked as correct.</p>
8563

8664
<p>In this work, we've focused on <a href="https://openai.com/index/introducing-swe-bench-verified/">SWE-bench Verified</a>, a subset of SWE-bench validated by human annotators.</p>
8765
</section>
88-
89-
9066
<section id="motivation">
9167
<h2>Large Language Monkeys</h2>
9268
<p>We first became interested in SWE-bench during our previous work, <a href="https://scalingintelligence.stanford.edu/pubs/large_language_monkeys/">Large Language Monkeys</a>. In that paper, we demonstrated a promising property of LLMs: when solving software engineering (and other) problems, coverage, the fraction of problems that are solved by at least one attempt, increases log-linearly with the number of solutions drawn from the model.</p>
@@ -95,30 +71,36 @@ materials:
9571

9672
<p><strong>This means that as you spend more test time compute by drawing more samples, the fraction of problems that you have at least one correct solution for increases consistently and predictably.</strong></p>
9773
98-
<p>While these results showed clear potential for improving performance on benchmarks like SWE-bench, we only demonstrated coverage improvements. To achieve actual performance gains, a system needs to select a correct solution from many candidates. Additionally, we generated our candidates by running an existing framework <a href="https://github.com/aorwall/moatless-tools">Moatless Tools</a> multiple times with a positive temperature - although powerful and well-built, the framework wasn't designed for taking multiple attempts per problem.</p>
74+
<p>While these results showed clear potential for improving performance on benchmarks like SWE-bench, we only demonstrated coverage improvements. To achieve actual performance gains, a system needs to select a correct solution from many candidates. Additionally, we generated our candidates by sampling from an existing framework (<a href="https://github.com/aorwall/moatless-tools">Moatless Tools</a>) multiple times. Although powerful and well-built, the framework wasn't designed for taking multiple attempts per problem.</p>
9975
10076
<p>This raised the question: <em>how would one design a system for solving SWE tasks differently if benefiting from test-time compute scaling was a primary consideration?</em></p>
10177
</section>
10278

10379
<section id="codemonkeys">
10480
<h2>CodeMonkeys</h2>
105-
<p>This question led us to build CodeMonkeys, a system designed to solve software engineering problems by scaling test-time compute. Similar to existing approaches like <a href="https://github.com/OpenAutoCoder/Agentless?tab=readme-ov-file">Agentless</a>, we decomposed resolving SWE-bench issues into 3 subtasks.</p>
81+
<p>This question led us to build CodeMonkeys, a system designed to solve software engineering problems by scaling test-time compute. Similar to existing approaches like <a href="https://github.com/OpenAutoCoder/Agentless?tab=readme-ov-file">Agentless</a>, we decomposed solving SWE-bench issues into 3 subtasks:</p>
82+
83+
<ol>
84+
<li><strong>Context</strong>, identifying relevant codebase files.</li>
85+
<li><strong>Generation</strong>, generating candidate edits and tests.</li>
86+
<li><strong>Selection</strong>, selecting a final answer from a collection of edits and tests.</li>
87+
</ol>
10688

10789
<h3>Task 1: Context</h3>
10890

10991
11092
<div class="component-details">
111-
<p><strong>Goal:</strong> Identify which files from the codebase need to be seen to resolve the issue</p>
11293
<p><strong>Inputs:</strong> Issue Description, Entire Codebase (up to millions of tokens of context)</p>
11394
<p><strong>Outputs:</strong> Relevant Files (120,000 tokens)</p>
11495
</div>
11596

116-
<p>First, we identify relevant codebase context. As we generate multiple solutions, we can amortize the cost of context identification across all downstream samples. This lets us use a simple but effective approach: we let a model (specifically, Qwen2.5-Coder-32B-Instruct) read every Python file in the codebase and label each file as "relevant" or "not relevant". Then, we used Claude Sonnet-3.5 to rank the relevant files by importance, allowing up to 120,000 tokens of context.</p>
97+
<p>First, we identify relevant codebase context. As we generate multiple candidate solutions, we can amortize the cost of context identification across all downstream samples.</p>
98+
99+
<p>This lets us use a simple but effective approach: we let a model (specifically, <a href="https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct">Qwen2.5-Coder-32B-Instruct</a>) read every Python file in the codebase and label each file as "relevant" or "not relevant". Then, we used <a href="https://claude.ai">Claude Sonnet-3.5</a> to rank the relevant files by importance, allowing up to 120,000 tokens of context.</p>
117100

118101
<h3>Task 2: Generation</h3>
119102
120103
<div class="component-details">
121-
<p><strong>Goal:</strong> Generate candidate solutions to the issue, along with candidate tests</p>
122104
<p><strong>Inputs:</strong> Issue Description, Relevant Files</p>
123105
<p><strong>Outputs:</strong> 10 (candidate edit, candidate test) pairs</p>
124106
</div>
@@ -129,19 +111,18 @@ materials:
129111

130112
<p>Then, we generate candidate solutions. We run multiple parallel state machines that each generate both a codebase edit and a corresponding testing script. These state machines iteratively refine their edits and tests based on execution feedback. This provides two ways to scale test-time compute: we can increase the number of iterations per state machine ("serial scaling") or increase the number of independent state machines per problem ("parallel scaling").</p>
131113

132-
<h3>Task 3: Selection</h3>
114+
<h3 id="selection">Task 3: Selection</h3>
133115
134116
<div class="component-details">
135-
<p><strong>Goal:</strong> Select a correct solution from the candidate edits.</p>
136117
<p><strong>Inputs:</strong> Issue Description, Relevant Files, multiple (candidate edit, candidate test) pairs</p>
137118
<p><strong>Outputs:</strong> Final edit to codebase</p>
138119
</div>
139120

140121
<center>
141-
<img src="/imgs/blog/codemonkeys/selection_sm.png" alt="" style="width: 50%; height: auto;">
122+
<img src="/imgs/blog/codemonkeys/selection_sm.png" alt="" style="margin-top: 4px; width: 50%; height: auto;">
142123
</center>
143124

144-
<p>Finally, we select among the candidate solutions. We combine two approaches: using the model-generated tests to vote on solutions, and running a dedicated selection state machine that can write additional tests to differentiate between top candidates. This selection method recovers approximately half of the gap between random selection and using an oracle to choose the correct solution.</p>
125+
<p>Finally, we select among the candidate solutions. We combine two approaches: using the model-generated tests to vote on solutions, and running a dedicated selection state machine that can write additional tests to differentiate between top candidates.</p>
145126
</section>
146127

147128
<section id="results">
@@ -150,10 +131,10 @@ materials:
150131
151132
<div class="component-results">
152133
<h3>Context</h3>
153-
<p>With the 128k token limit, 92.6% of instances have the correct files in context.</p>
134+
<p>With the 128k token limit, 92.6% of instances have the correct files in context. </p>
154135
155136
<h3>Generation</h3>
156-
<p>By running multiple state machines in parallel and allowing each to iterate multiple times, we achieve 69.8% coverage. This means that for about 70% of problems, at least one of our candidate solutions is correct. Interestingly, we found that different ways of distributing compute between parallel scaling (more state machines) and serial scaling (more iterations per machine) often lead to similar coverage values.</p>
137+
<p>By running multiple state machines in parallel and allowing each to iterate multiple times, we achieve 69.8% coverage. This means that for about 70% of problems, at least one of our candidate solutions is correct. Notably, in our experiments, we found that different ways of distributing compute between parallel scaling (more state machines) and serial scaling (more iterations per machine) often lead to similar coverage values.</p>
157138
158139
<h3>Selection</h3>
159140
<p>Our selection method, which combines test-based voting with a dedicated selection state machine, recovers approximately half of the gap between random selection and using an oracle. This leads to a final score of 57.4%.</p>
@@ -280,16 +261,16 @@ materials:
280261
</style>
281262

282263

283-
<p>The cost breakdown reveals several key insights about our system:</p>
264+
<p>The cost breakdown reveals several insights about CodeMonkeys:</p>
284265

285266
<ul class="cost-analysis">
286-
<li>Our context identification contributes only 15% to total costs by amortizing this scan across multiple downstream samples.</li>
267+
<li>Our context identification contributes only 15% to total costs with a high recall. A simple linear scan with a small model can be quite effective! </li>
287268

288269
<li>Generating edits is the most expensive component (60% of costs), primarily due to cache read costs from including codebase context in prompts.</li>
289270

290271
<li>We reduce costs by separating testing and editing state machines, allowing us to omit codebase context from testing prompts.</li>
291272

292-
<li>Selection contributes less than 10% to total costs while significantly improving final performance.</li>
273+
<li>Selection contributes less than 10% to total costs while significantly improving final performance (see paper for ablations).</li>
293274
</ul>
294275

295276
<style>
@@ -309,31 +290,57 @@ materials:
309290

310291
<section id="ensemble">
311292
<h2>Barrel of Monkeys: Combining Solutions from Different Systems</h2>
293+
294+
295+
<table class="leaderboard">
296+
<thead>
297+
<tr>
298+
<th>Method</th>
299+
<th>Selection</th>
300+
<th>Score</th>
301+
</tr>
302+
</thead>
303+
<tbody>
304+
<tr><td><strong>Barrel of Monkeys</strong></td><td>Oracle (Coverage)</td><td>80.8</td></tr>
305+
<tr><td>o3</td><td>---</td><td>71.7</td></tr>
306+
<tr><td><strong>CodeMonkeys</strong></td><td>Oracle (Coverage)</td><td>69.8</td></tr>
307+
<tr><td><strong>Barrel of Monkeys</strong></td><td>State Machine</td><td>66.2</td></tr>
308+
<tr><td>Blackbox AI Agent</td><td>---</td><td>62.8</td></tr>
309+
<tr><td>CodeStory</td><td>---</td><td>62.2</td></tr>
310+
<tr><td>Learn-by-interact</td><td>---</td><td>60.2</td></tr>
311+
<tr><td>devlo</td><td>---</td><td>58.2</td></tr>
312+
<tr><td><strong>CodeMonkeys</strong></td><td>State Machine</td><td>57.4</td></tr>
313+
<tr><td>Emergent E1</td><td>---</td><td>57.2</td></tr>
314+
<tr><td>Gru</td><td>---</td><td>57.0</td></tr>
315+
</tbody>
316+
</table>
317+
312318

313319
<p>Our selection mechanism can also be used to combine candidate edits from heterogeneous sources. We demonstrate this by creating what we call the "Barrel of Monkeys" - an expanded pool of candidate edits that includes solutions from CodeMonkeys along with the submissions from the top-4 entries on the SWE-bench leaderboard (Blackbox AI Agent, CodeStory, Learn-by-interact, and devlo).</p>
314320

315321
<p>When we run our selection state machine over this expanded pool of candidate solutions, we achieve a score of 66.2%. This outperforms both CodeMonkeys on its own (57.4%) and the previous best ensemble submission (62.8%), showing how our selection method can effectively identify correct solutions even when they come from different frameworks.</p>
322+
323+
<p>Something we took away from this is the importance of selection: with oracle selection could yield 10% gains. TODO FIX</p>
324+
325+
316326
</section>
317327

318328
<section id="data">
319-
<h2>Data Release</h2>
320-
<p>We are releasing two complementary datasets that we hope support different aspects of research.</p>
329+
<h2>Data Release & Paper</h2>
330+
<p>We are releasing two complementary datasets and a paper. We hope these support different aspects of research.</p>
321331

322332
<h3><a>CodeMonkeys Trajectories</a></h3>
323333
<p>Our first dataset contains the complete problem-solving trajectories from running CodeMonkeys on SWE-bench Verified. For each of the 500 problems, we release all state data. This includes all LLM outputs.</p>
324334

325335
<h3><a href="https://huggingface.co/datasets/ScalingIntelligence/swe-bench-verified-codebase-content">SWE-bench Codebase Content</a></h3>
326336
<p>Our second dataset provides efficient access to the Python codebases required to work on SWE-bench problems. Instead of requiring researchers to manage Git repositories, this dataset contains all Python files from the relevant repositories</p>
327-
</section>
328337

329-
<section id="conclusion">
330-
<h2>Conclusion</h2>
338+
<h3><a>CodeMonkeys paper</a></h3>
331339

332340
<p>For more details about our methods, analysis of the trade-offs between different scaling approaches, and ablation studies of our selection methods, please read our paper: <a href="#">CodeMonkeys: Scaling Test-Time Compute for Software Engineering</a>.</p>
333-
<p>How to cite? If our dataset, code, or paper was helpful to you, please consider citing:</p>
334-
</section>
335-
</div>
336-
```bibtex
341+
<p>If our dataset, code, or paper was helpful to you, please consider citing:</p>
342+
<div style="width: 100%; overflow-x: auto">
343+
<code style="white-space: pre">
337344
@misc{ehrlich2025codemonkeys,
338345
title={Large Language Monkeys: Scaling Inference Compute with Repeated Sampling},
339346
author={Ryan Ehrlich and Bradley Brown and Jordan Juravsky and and Ronald Clark and Christopher Ré and Azalia Mirhoseini},
@@ -342,8 +349,9 @@ materials:
342349
archivePrefix={arXiv},
343350
primaryClass={cs.LG},
344351
url={https://arxiv.org/abs/2407.21787},
345-
}
346-
```
352+
}</code>
353+
</div>
354+
</section>
347355

348356

349357
<style>
@@ -465,3 +473,4 @@ a:hover {
465473
}
466474
</style>
467475

476+
</div>

0 commit comments

Comments
 (0)