Skip to content

Commit 803db38

Browse files
feat: Add 'Price per Success' column to Merbench leaderboard (#34)
* feat: Add 'Price per Success' column to Merbench leaderboard This commit adds a new 'Price per Success' column to the leaderboard table on the Merbench page. The 'Price per Success' is calculated as `Avg. Cost / Success Rate` and provides a more realistic measure of the cost to achieve a successful outcome, factoring in the cost of failed attempts. The changes include: - Updating the `LeaderboardTable.astro` component to add the new column. - Updating the `merbench-types.ts` to include the new field in the `LeaderboardEntry` interface. - Modifying `merbench.ts` to calculate the new metric when data is filtered. - Modifying `merbench.astro` to calculate the new metric for the initial data load. * feat: Add 'Price per Success' column to Merbench leaderboard This commit adds a new 'Price per Success' column to the leaderboard table on the Merbench page. The 'Price per Success' is calculated as `Avg. Cost / Success Rate` and provides a more realistic measure of the cost to achieve a successful outcome, factoring in the cost of failed attempts. This also includes a fix for a build failure caused by a TypeScript type mismatch where `null` was used instead of `undefined`. * fix: Correct client-side rendering for sorting This commit fixes an issue where sorting the 'Price per Success' column would break the table layout. The client-side rendering logic in `src/scripts/merbench-sorting.ts` was not updated to include the new column, which caused the table to be re-rendered incorrectly when sorted. This has been corrected by updating the `renderLeaderboard` function to include the 'Price/Success' column, ensuring consistency with the server-side rendering. * chore: Skip processing of Medium post titles * feat: Metrics explanation --------- Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
1 parent 6665aeb commit 803db38

File tree

6 files changed

+158
-16
lines changed

6 files changed

+158
-16
lines changed

npm_output.log

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
2+
3+
> astro dev

src/components/merbench/LeaderboardTable.astro

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,9 @@ const costRange = maxCost - minCost;
3636
<th class="sortable" data-sort-key="Avg_Cost" data-sort-type="number">
3737
Avg Cost/Run <span class="sort-indicator"></span>
3838
</th>
39+
<th class="sortable" data-sort-key="Price_per_Success" data-sort-type="number">
40+
Price/Success <span class="sort-indicator"></span>
41+
</th>
3942
<th class="sortable" data-sort-key="Avg_Duration" data-sort-type="number">
4043
Avg Duration <span class="sort-indicator"></span>
4144
</th>
@@ -90,6 +93,9 @@ const costRange = maxCost - minCost;
9093
</span>
9194
</div>
9295
</td>
96+
<td class="cost">
97+
{entry.Price_per_Success != null ? `$${entry.Price_per_Success.toFixed(4)}` : 'N/A'}
98+
</td>
9399
<td class="duration">{entry.Avg_Duration.toFixed(2)}s</td>
94100
<td class="tokens">{entry.Avg_Tokens.toLocaleString()}</td>
95101
<td class="runs">{entry.Runs}</td>
@@ -112,8 +118,8 @@ const costRange = maxCost - minCost;
112118
}
113119

114120
.leaderboard-section h2 {
115-
margin-top: 0;
116-
margin-bottom: 1.5rem;
121+
margin-top: -1rem;
122+
margin-bottom: 0.5rem;
117123
}
118124

119125
.leaderboard-table {
@@ -204,6 +210,7 @@ const costRange = maxCost - minCost;
204210
font-size: 0.85rem;
205211
text-transform: uppercase;
206212
letter-spacing: 0.5px;
213+
position: relative; /* Needed for tooltip positioning */
207214
}
208215

209216
/* Sortable header styles */

src/lib/merbench-types.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ export interface LeaderboardEntry {
1818
Avg_Cost?: number;
1919
Avg_Input_Cost?: number;
2020
Avg_Output_Cost?: number;
21+
Price_per_Success?: number;
2122
Runs: number;
2223
Provider: string;
2324
}

src/lib/merbench.ts

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -94,17 +94,24 @@ export const getFilteredData = (
9494
});
9595

9696
const filteredLeaderboard = Object.values(modelStats)
97-
.map((stats) => ({
98-
Model: stats.Model,
99-
Success_Rate: (stats.successCount / stats.totalRuns) * 100,
100-
Avg_Duration: stats.totalDuration / stats.totalRuns,
101-
Avg_Tokens: stats.totalTokens / stats.totalRuns,
102-
Avg_Cost: stats.totalCost / stats.totalRuns,
103-
Avg_Input_Cost: stats.totalInputCost / stats.totalRuns,
104-
Avg_Output_Cost: stats.totalOutputCost / stats.totalRuns,
105-
Runs: stats.totalRuns,
106-
Provider: stats.Provider,
107-
}))
97+
.map((stats) => {
98+
const successRate = stats.totalRuns > 0 ? (stats.successCount / stats.totalRuns) * 100 : 0;
99+
const avgCost = stats.totalRuns > 0 ? stats.totalCost / stats.totalRuns : 0;
100+
const pricePerSuccess = successRate > 0 ? avgCost / (successRate / 100) : undefined;
101+
102+
return {
103+
Model: stats.Model,
104+
Success_Rate: successRate,
105+
Avg_Duration: stats.totalRuns > 0 ? stats.totalDuration / stats.totalRuns : 0,
106+
Avg_Tokens: stats.totalRuns > 0 ? stats.totalTokens / stats.totalRuns : 0,
107+
Avg_Cost: avgCost,
108+
Avg_Input_Cost: stats.totalRuns > 0 ? stats.totalInputCost / stats.totalRuns : 0,
109+
Avg_Output_Cost: stats.totalRuns > 0 ? stats.totalOutputCost / stats.totalRuns : 0,
110+
Price_per_Success: pricePerSuccess,
111+
Runs: stats.totalRuns,
112+
Provider: stats.Provider,
113+
};
114+
})
108115
.sort((a, b) => b.Success_Rate - a.Success_Rate);
109116

110117
// Recalculate pareto data
@@ -280,8 +287,8 @@ export const sortLeaderboard = (
280287
direction: 'asc' | 'desc'
281288
): LeaderboardEntry[] => {
282289
const sorted = [...data].sort((a, b) => {
283-
let aVal: any;
284-
let bVal: any;
290+
let aVal: LeaderboardEntry[keyof LeaderboardEntry];
291+
let bVal: LeaderboardEntry[keyof LeaderboardEntry];
285292

286293
// Handle special cases for cost calculation
287294
if (sortKey === 'Avg_Cost') {

src/pages/merbench.astro

Lines changed: 122 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,16 @@ import merbenchData from '../data/merbench_data.json';
1414
const data = merbenchData as MerbenchData;
1515
1616
const { stats, leaderboard, config } = data;
17+
18+
// Calculate Price_per_Success for the initial leaderboard data
19+
leaderboard.forEach((entry) => {
20+
if (entry.Success_Rate > 0 && entry.Avg_Cost != null) {
21+
entry.Price_per_Success = entry.Avg_Cost / (entry.Success_Rate / 100);
22+
} else {
23+
entry.Price_per_Success = undefined;
24+
}
25+
});
26+
1727
const formattedDescription = renderMarkdownToHTML(config.description);
1828
1929
// Fetch last commit info for the merbench data file
@@ -49,7 +59,66 @@ const lastUpdated = fileCommitInfo ? formatDate(fileCommitInfo.date) : null;
4959

5060
<CombinedFilters difficulties={['easy', 'medium', 'hard']} providers={stats.providers} />
5161

52-
<LeaderboardTable leaderboard={leaderboard} />
62+
<section class="leaderboard-section">
63+
<details class="metrics-glossary">
64+
<summary>
65+
<svg
66+
class="icon"
67+
width="16"
68+
height="16"
69+
viewBox="0 0 16 16"
70+
fill="none"
71+
xmlns="http://www.w3.org/2000/svg"
72+
>
73+
<path
74+
d="M8 15C11.866 15 15 11.866 15 8C15 4.13401 11.866 1 8 1C4.13401 1 1 4.13401 1 8C1 11.866 4.13401 15 8 15Z"
75+
stroke="currentColor"
76+
stroke-width="1.5"
77+
stroke-linecap="round"
78+
stroke-linejoin="round"></path>
79+
<path
80+
d="M8 10.5V8"
81+
stroke="currentColor"
82+
stroke-width="1.5"
83+
stroke-linecap="round"
84+
stroke-linejoin="round"></path>
85+
<path
86+
d="M8 5.5H8.0075"
87+
stroke="currentColor"
88+
stroke-width="1.5"
89+
stroke-linecap="round"
90+
stroke-linejoin="round"></path>
91+
</svg>
92+
What do these metrics mean?
93+
</summary>
94+
<div class="glossary-content">
95+
<dl>
96+
<dt>Success Rate</dt>
97+
<dd>The percentage of successful Mermaid diagram generations out of all runs.</dd>
98+
99+
<dt>Avg Cost/Run</dt>
100+
<dd>The average cost in USD to generate one diagram, based on provider pricing.</dd>
101+
102+
<dt>Price/Success</dt>
103+
<dd>
104+
The effective cost for each successful diagram, calculated as (Avg Cost / Success
105+
Rate).
106+
</dd>
107+
108+
<dt>Avg Duration</dt>
109+
<dd>The average time in seconds taken to generate a diagram.</dd>
110+
111+
<dt>Avg Tokens</dt>
112+
<dd>The average number of tokens (input + output) used per run.</dd>
113+
114+
<dt>Runs</dt>
115+
<dd>The total number of times this model was run in the evaluation.</dd>
116+
</dl>
117+
</div>
118+
</details>
119+
120+
<LeaderboardTable leaderboard={leaderboard} />
121+
</section>
53122

54123
<ChartContainer chartId="pareto-chart" title="Performance vs Efficiency Trade-offs" />
55124
<ChartContainer chartId="test-group-chart" title="Performance by Difficulty Level" />
@@ -239,6 +308,58 @@ const lastUpdated = fileCommitInfo ? formatDate(fileCommitInfo.date) : null;
239308
margin-bottom: 1.5rem;
240309
}
241310

311+
.metrics-glossary {
312+
background: var(--bg-primary);
313+
border: 1px solid var(--border-color);
314+
border-radius: 6px;
315+
margin-bottom: 0.5rem;
316+
padding: 0;
317+
}
318+
319+
.metrics-glossary summary {
320+
padding: 0.75rem 1rem;
321+
cursor: pointer;
322+
display: flex;
323+
align-items: center;
324+
gap: 0.5rem;
325+
font-weight: 500;
326+
font-size: 0.9rem;
327+
color: var(--text-secondary);
328+
user-select: none;
329+
}
330+
331+
.metrics-glossary summary:hover {
332+
color: var(--text-primary);
333+
}
334+
335+
.glossary-content {
336+
padding: 0 1rem 1rem;
337+
border-top: 1px solid var(--border-color);
338+
}
339+
340+
.glossary-content dl {
341+
margin: 0;
342+
display: grid;
343+
gap: 1rem;
344+
}
345+
346+
.glossary-content dt {
347+
font-weight: 600;
348+
color: var(--text-primary);
349+
margin-bottom: 0.25rem;
350+
}
351+
352+
.glossary-content dd {
353+
margin: 0;
354+
color: var(--text-secondary);
355+
font-size: 0.9rem;
356+
}
357+
358+
/* Rotate chevron on open */
359+
.metrics-glossary[open] summary::after {
360+
transform: rotate(180deg);
361+
}
362+
242363
.leaderboard-table {
243364
overflow-x: auto;
244365
-webkit-overflow-scrolling: touch; /* Smooth scrolling on iOS */

src/scripts/merbench-sorting.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,9 @@ const renderLeaderboard = (data: LeaderboardEntry[]): void => {
121121
<span class="progress-text" style="color: var(--progress-text-color); text-shadow: var(--progress-text-shadow);">$${currentCost.toFixed(4)}</span>
122122
</div>
123123
</td>
124+
<td class="cost">
125+
${entry.Price_per_Success != null ? `$${entry.Price_per_Success.toFixed(4)}` : 'N/A'}
126+
</td>
124127
<td class="duration">${entry.Avg_Duration.toFixed(2)}s</td>
125128
<td class="tokens">${entry.Avg_Tokens.toLocaleString()}</td>
126129
<td class="runs">${entry.Runs}</td>

0 commit comments

Comments
 (0)