Skip to content

Releases: vivshaw/llmlangbench

first benchmark run

19 Feb 01:48

Choose a tag to compare

raw results for the benchmark are attached as tarballs. two runs were conducted:

  • full-60-turns.tar.gz: full benchmark. 6 tasks, 8 languages, 3 trials each, 60-turn limit per trial.
  • sql-240-turns.tar.gz: focused benchmark on sql-databse, the hardest task, which many langs hit the turn limit for on the previous trial. 240-turn limit per trial, so that agents would not hit the turn cap.

all runs used claude-sonnet-4-5-20250929 for all agent tasks.

Overall scores

Full run - all tasks, 60-tun limit

Per-Language Summary

Language Trials Pass% Avg Cost Avg Turns Avg Time Review
python 18 97.0% $0.8225 29.4 269.7s 94
typescript 18 95.9% $0.9109 31.4 276.4s 95
javascript 18 95.8% $0.9573 34.2 316.1s 89
java 18 93.9% $1.0404 40.3 363.3s 85
ruby 18 92.7% $0.9733 36.0 324.8s 94
go 18 91.1% $0.9712 34.4 302.4s 91
rust 18 87.0% $0.8907 32.1 292.3s 91
haskell 18 82.0% $0.9340 38.0 356.8s 92

Per-Task Breakdown

Task Language Trials Pass% Avg Cost Review
http-request-parser go 3 96.9% $0.3676 100
http-request-parser haskell 3 96.9% $0.4517 95
http-request-parser java 3 96.9% $0.5350 100
http-request-parser javascript 3 96.9% $0.4423 100
http-request-parser python 3 96.9% $0.3376 100
http-request-parser ruby 3 96.9% $0.3129 100
http-request-parser typescript 3 96.9% $0.4589 100
http-request-parser rust 3 95.8% $0.3926 91
mini-typechecker javascript 3 99.2% $1.1285 97
mini-typechecker java 3 98.4% $1.3682 91
mini-typechecker rust 3 98.4% $1.1241 97
mini-typechecker typescript 3 98.4% $0.9686 100
mini-typechecker haskell 3 98.0% $1.0734 100
mini-typechecker python 3 97.6% $1.0095 93
mini-typechecker ruby 3 95.9% $1.3768 100
mini-typechecker go 3 91.5% $1.4591 92
process-simulator go 3 100.0% $0.6571 98
process-simulator haskell 3 100.0% $0.9184 97
process-simulator java 3 100.0% $0.6590 100
process-simulator javascript 3 100.0% $0.6540 95
process-simulator python 3 100.0% $0.6494 95
process-simulator ruby 3 100.0% $0.5375 98
process-simulator rust 3 100.0% $0.4876 97
process-simulator typescript 3 100.0% $0.7004 100
regex-matcher haskell 3 99.6% $0.7114 95
regex-matcher go 3 99.2% $0.9706 100
regex-matcher python 3 98.7% $0.8450 92
regex-matcher rust 3 98.7% $0.7293 89
regex-matcher java 3 97.9% $1.1323 78
regex-matcher ruby 3 97.5% $1.1788 87
regex-matcher typescript 3 97.5% $0.5824 88
regex-matcher javascript 3 96.2% $0.9782 82
sql-database python 3 88.9% $1.6086 88
sql-database javascript 3 82.8% $2.1815 85
sql-database typescript 3 82.8% $2.3854 92
sql-database java 3 70.0% $2.1513 78
sql-database ruby 3 66.0% $2.0381 85
sql-database go 3 59.3% $2.0280 81
sql-database rust 3 29.0% $2.1932 92
sql-database haskell 3 25.9% $2.0842 85
sudoku-solver go 3 100.0% $0.3448 75
sudoku-solver java 3 100.0% $0.3966 63
sudoku-solver javascript 3 100.0% $0.3594 73
sudoku-solver python 3 100.0% $0.4846 98
sudoku-solver ruby 3 100.0% $0.3959 97
sudoku-solver rust 3 100.0% $0.4176 82
sudoku-solver typescript 3 100.0% $0.3695 90
sudoku-solver haskell 3 71.4% $0.3648 78

sql-database only, 240-turn limit

Per-Language Summary

Language Trials Pass% Avg Cost Avg Turns Avg Time Review
python 3 90.2% $2.7597 73.7 656.0s 83
go 3 87.2% $3.2566 78.3 778.2s 83
java 3 85.2% $3.7524 101.3 860.9s 83
javascript 3 83.2% $3.0833 85.0 974.6s 83
typescript 3 79.8% $2.6583 67.0 697.1s 93
ruby 3 78.5% $2.6174 76.0 878.0s 83
rust 3 77.1% $3.3993 77.0 957.9s 82
haskell 3 44.1% $3.0055 90.0 814.7s 81