Releases: vivshaw/llmlangbench
Releases · vivshaw/llmlangbench
first benchmark run
raw results for the benchmark are attached as tarballs. two runs were conducted:
full-60-turns.tar.gz: full benchmark. 6 tasks, 8 languages, 3 trials each, 60-turn limit per trial.sql-240-turns.tar.gz: focused benchmark onsql-databse, the hardest task, which many langs hit the turn limit for on the previous trial. 240-turn limit per trial, so that agents would not hit the turn cap.
all runs used claude-sonnet-4-5-20250929 for all agent tasks.
Overall scores
Full run - all tasks, 60-tun limit
Per-Language Summary
| Language | Trials | Pass% | Avg Cost | Avg Turns | Avg Time | Review |
|---|---|---|---|---|---|---|
| python | 18 | 97.0% | $0.8225 | 29.4 | 269.7s | 94 |
| typescript | 18 | 95.9% | $0.9109 | 31.4 | 276.4s | 95 |
| javascript | 18 | 95.8% | $0.9573 | 34.2 | 316.1s | 89 |
| java | 18 | 93.9% | $1.0404 | 40.3 | 363.3s | 85 |
| ruby | 18 | 92.7% | $0.9733 | 36.0 | 324.8s | 94 |
| go | 18 | 91.1% | $0.9712 | 34.4 | 302.4s | 91 |
| rust | 18 | 87.0% | $0.8907 | 32.1 | 292.3s | 91 |
| haskell | 18 | 82.0% | $0.9340 | 38.0 | 356.8s | 92 |
Per-Task Breakdown
| Task | Language | Trials | Pass% | Avg Cost | Review |
|---|---|---|---|---|---|
| http-request-parser | go | 3 | 96.9% | $0.3676 | 100 |
| http-request-parser | haskell | 3 | 96.9% | $0.4517 | 95 |
| http-request-parser | java | 3 | 96.9% | $0.5350 | 100 |
| http-request-parser | javascript | 3 | 96.9% | $0.4423 | 100 |
| http-request-parser | python | 3 | 96.9% | $0.3376 | 100 |
| http-request-parser | ruby | 3 | 96.9% | $0.3129 | 100 |
| http-request-parser | typescript | 3 | 96.9% | $0.4589 | 100 |
| http-request-parser | rust | 3 | 95.8% | $0.3926 | 91 |
| mini-typechecker | javascript | 3 | 99.2% | $1.1285 | 97 |
| mini-typechecker | java | 3 | 98.4% | $1.3682 | 91 |
| mini-typechecker | rust | 3 | 98.4% | $1.1241 | 97 |
| mini-typechecker | typescript | 3 | 98.4% | $0.9686 | 100 |
| mini-typechecker | haskell | 3 | 98.0% | $1.0734 | 100 |
| mini-typechecker | python | 3 | 97.6% | $1.0095 | 93 |
| mini-typechecker | ruby | 3 | 95.9% | $1.3768 | 100 |
| mini-typechecker | go | 3 | 91.5% | $1.4591 | 92 |
| process-simulator | go | 3 | 100.0% | $0.6571 | 98 |
| process-simulator | haskell | 3 | 100.0% | $0.9184 | 97 |
| process-simulator | java | 3 | 100.0% | $0.6590 | 100 |
| process-simulator | javascript | 3 | 100.0% | $0.6540 | 95 |
| process-simulator | python | 3 | 100.0% | $0.6494 | 95 |
| process-simulator | ruby | 3 | 100.0% | $0.5375 | 98 |
| process-simulator | rust | 3 | 100.0% | $0.4876 | 97 |
| process-simulator | typescript | 3 | 100.0% | $0.7004 | 100 |
| regex-matcher | haskell | 3 | 99.6% | $0.7114 | 95 |
| regex-matcher | go | 3 | 99.2% | $0.9706 | 100 |
| regex-matcher | python | 3 | 98.7% | $0.8450 | 92 |
| regex-matcher | rust | 3 | 98.7% | $0.7293 | 89 |
| regex-matcher | java | 3 | 97.9% | $1.1323 | 78 |
| regex-matcher | ruby | 3 | 97.5% | $1.1788 | 87 |
| regex-matcher | typescript | 3 | 97.5% | $0.5824 | 88 |
| regex-matcher | javascript | 3 | 96.2% | $0.9782 | 82 |
| sql-database | python | 3 | 88.9% | $1.6086 | 88 |
| sql-database | javascript | 3 | 82.8% | $2.1815 | 85 |
| sql-database | typescript | 3 | 82.8% | $2.3854 | 92 |
| sql-database | java | 3 | 70.0% | $2.1513 | 78 |
| sql-database | ruby | 3 | 66.0% | $2.0381 | 85 |
| sql-database | go | 3 | 59.3% | $2.0280 | 81 |
| sql-database | rust | 3 | 29.0% | $2.1932 | 92 |
| sql-database | haskell | 3 | 25.9% | $2.0842 | 85 |
| sudoku-solver | go | 3 | 100.0% | $0.3448 | 75 |
| sudoku-solver | java | 3 | 100.0% | $0.3966 | 63 |
| sudoku-solver | javascript | 3 | 100.0% | $0.3594 | 73 |
| sudoku-solver | python | 3 | 100.0% | $0.4846 | 98 |
| sudoku-solver | ruby | 3 | 100.0% | $0.3959 | 97 |
| sudoku-solver | rust | 3 | 100.0% | $0.4176 | 82 |
| sudoku-solver | typescript | 3 | 100.0% | $0.3695 | 90 |
| sudoku-solver | haskell | 3 | 71.4% | $0.3648 | 78 |
sql-database only, 240-turn limit
Per-Language Summary
| Language | Trials | Pass% | Avg Cost | Avg Turns | Avg Time | Review |
|---|---|---|---|---|---|---|
| python | 3 | 90.2% | $2.7597 | 73.7 | 656.0s | 83 |
| go | 3 | 87.2% | $3.2566 | 78.3 | 778.2s | 83 |
| java | 3 | 85.2% | $3.7524 | 101.3 | 860.9s | 83 |
| javascript | 3 | 83.2% | $3.0833 | 85.0 | 974.6s | 83 |
| typescript | 3 | 79.8% | $2.6583 | 67.0 | 697.1s | 93 |
| ruby | 3 | 78.5% | $2.6174 | 76.0 | 878.0s | 83 |
| rust | 3 | 77.1% | $3.3993 | 77.0 | 957.9s | 82 |
| haskell | 3 | 44.1% | $3.0055 | 90.0 | 814.7s | 81 |