Skip to content

Commit 40151af

Browse files
committed
Random blog updates
1 parent 75b825b commit 40151af

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

_blogs/codemonkeys.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -55,17 +55,17 @@ materials:
5555

5656
<section id="swebench">
5757
<h2>SWE-bench</h2>
58-
<p><a href="https://swebench.com">SWE-bench</a> is a benchmark that measures how well AI systems can solve real-world software engineering problems. Each problem in SWE-bench consists of an actual GitHub issue from a popular open-source Python repository (like Django or Sympy) and the complete codebase at the time the issue was reported.</p>
58+
<p><a href="https://swebench.com">SWE-bench</a> is a benchmark that measures how well AI systems can solve real-world GitHub issues. Each instance in SWE-bench consists of an issue from a popular open-source Python repository (like Django or SymPy) along with the complete codebase at the time the issue was reported.</p>
5959

6060
<img src="/imgs/blog/codemonkeys/swebench.png" alt="SWE-bench problem overview." style="width: 100%; height: auto;">
6161
62-
<p>To solve a SWE-bench issue, systems produce an edit to the given codebase, with the goal of resolving the described issue. This edit is evaluated for correctness using unit tests from the codebase, which are hidden from the system at test-time. A model's score is simply the fraction of issues where the model's patch is marked as correct.</p>
62+
<p>To solve an instance, a system must appropriately edit the given codebase in order to resolve the corresponding issue. An edit can be automatically evaluated for correctness using a set of unit tests that are hidden from the system.</p>
6363

64-
<p>In this work, we've focused on <a href="https://openai.com/index/introducing-swe-bench-verified/">SWE-bench Verified</a>, a subset of SWE-bench validated by human annotators.</p>
64+
<p>In this work, we've focused on <a href="https://openai.com/index/introducing-swe-bench-verified/">SWE-bench Verified</a>, a subset of SWE-bench where human annotators have filtered out low-quality instances (e.g. those with ambiguous issue descriptions).</p>
6565
</section>
6666
<section id="motivation">
6767
<h2>Large Language Monkeys</h2>
68-
<p>We first became interested in SWE-bench during our previous work, <a href="https://scalingintelligence.stanford.edu/pubs/large_language_monkeys/">Large Language Monkeys</a>. In that paper, we demonstrated a promising property of LLMs: when solving software engineering (and other) problems, coverage, the fraction of problems that are solved by at least one attempt, increases log-linearly with the number of solutions drawn from the model.</p>
68+
<p>We began working on SWE-bench in our previous work, <a href="https://scalingintelligence.stanford.edu/pubs/large_language_monkeys/">Large Language Monkeys</a>. In that paper, we demonstrated a promising property of LLMs: when solving software engineering (and other) problems, coverage, the fraction of problems that are solved by at least one attempt, increases log-linearly with the number of solutions drawn from the model.</p>
6969
7070
<img src="/imgs/blog/monkeys/coverage.png" alt="Coverage (percent of problems solved by any sample) increases across five code and math reasoning tasks." style="width: 100%; height: auto;">
7171

0 commit comments

Comments
 (0)