Skip to content

Commit 5362e14

Browse files
authored
Merge pull request #306 from sahilds1/262-choose-a-model-and-prompt
[#262] Choose a model and prompt using evals
2 parents 5551bd4 + 0e2893b commit 5362e14

File tree

4 files changed

+390
-256
lines changed

4 files changed

+390
-256
lines changed

evaluation/README.md

Lines changed: 160 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,178 @@
1-
21
# Evaluations
32

4-
## LLM Output Evaluator
3+
## `evals`: LLM evaluations to test and improve model outputs
4+
5+
### Evaluation Metrics
6+
7+
Natural Language Generation Performance:
8+
9+
[Extractiveness](https://huggingface.co/docs/lighteval/en/metric-list#automatic-metrics-for-generative-tasks):
10+
11+
* Extractiveness Coverage: Extent to which a summary is derivative of a text
12+
* Extractiveness Density: How well the word sequence can be described as series of extractions
13+
* Extractiveness Compression: Word ratio between the article and the summary
14+
15+
API Performance:
16+
17+
* Token Usage (input/output)
18+
* Estimated Cost in USD
19+
* Duration (in seconds)
20+
21+
### Test Data
22+
23+
Generate the dataset file by connecting to a database of research papers:
524

6-
The `evals` script evaluates the outputs of Large Language Models (LLMs) and estimates the associated token usage and cost.
25+
Connect to the Postgres database of your local Balancer instance:
726

8-
It supports batch evalaution via a configuration CSV and produces a detailed metrics report in CSV format.
27+
```
28+
from sqlalchemy import create_engine
929
10-
### Usage
30+
engine = create_engine("postgresql+psycopg2://balancer:balancer@localhost:5433/balancer_dev")
31+
```
1132

12-
This script evaluates LLM outputs using the `lighteval` library: https://huggingface.co/docs/lighteval/en/metric-list#automatic-metrics-for-generative-tasks
33+
Connect to the Postgres database of the production Balancer instance using a SQL file:
1334

14-
Ensure you have the `lighteval` library and any model SDKs (e.g., OpenAI, Anthropic) configured properly.
35+
```
36+
# Add Postgres.app binaries to the PATH
37+
echo 'export PATH="/Applications/Postgres.app/Contents/Versions/latest/bin:$PATH"' >> ~/.zshrc
38+
39+
createdb <DB_NAME>
40+
pg_restore -v -d <DB_NAME> <PATH_TO_BACKUP>.sql
41+
```
1542

43+
Generate the dataset CSV file:
1644

17-
```bash
18-
python evals.py --config path/to/config.csv --reference path/to/reference.csv --output path/to/results.csv
1945
```
46+
from sqlalchemy import create_engine
47+
import pandas as pd
2048
21-
The arguments to the script are:
49+
engine = create_engine("postgresql://<USER>@localhost:5432/<DB_NAME>")
2250
23-
- Path to the config CSV file: Must include the columns "Model Name" and "Query"
24-
- Path to the reference CSV file: Must include the columns "Context" and "Reference"
25-
- Path where the evaluation resuls will be saved
51+
query = "SELECT * FROM api_embeddings;"
52+
df = pd.read_sql(query, engine)
53+
54+
df['INPUT'] = df.apply(lambda row: f"ID: {row['chunk_number']} | CONTENT: {row['text']}", axis=1)
55+
56+
# Ensure the chunks are joined in order of chunk_number by sorting the DataFrame before grouping and joining
57+
df = df.sort_values(by=['name', 'upload_file_id', 'chunk_number'])
58+
df_grouped = df.groupby(['name', 'upload_file_id'])['INPUT'].apply(lambda chunks: "\n".join(chunks)).reset_index()
59+
60+
df_grouped.to_csv('<DATASET_CSV_PATH>', index=False)
61+
```
2662

63+
### Running an Evaluation
2764

28-
The script outputs a CSV with the following columns:
65+
#### Bulk Model and Prompt Experimentation
2966

30-
* Evaluates LLM outputs for:
67+
Compare the results of many different prompts and models at once
3168

32-
* Extractiveness Coverage
33-
* Extractiveness Density
34-
* Extractiveness Compression
69+
```
70+
import pandas as pd
71+
72+
data = [
73+
{
74+
"MODEL": "<MODEL_NAME_1>",
75+
"INSTRUCTIONS": """<YOUR_QUERY_1>"""
76+
},
77+
{
78+
"MODEL": "<MODEL_NAME_2>",
79+
"INSTRUCTIONS": """<YOUR_QUERY_2>"""
80+
},
81+
]
82+
83+
df = pd.DataFrame.from_records(data)
84+
85+
df.to_csv("<EXPERIMENTS_CSV_PATH>", index=False)
86+
```
87+
88+
89+
#### Execute on the Command Line
90+
91+
92+
Execute [using `uv` to manage dependencies](https://docs.astral.sh/uv/guides/scripts/) without manually managing enviornments:
93+
94+
```sh
95+
uv run evals.py --experiments path/to/<EXPERIMENTS_CSV> --dataset path/to/<DATASET_CSV> --results path/to/<RESULTS_CSV>
96+
```
97+
98+
Execute without using uv run by ensuring it is executable:
99+
100+
```sh
101+
./evals.py --experiments path/to/<EXPERIMENTS_CSV> --dataset path/to/<DATASET_CSV> --results path/to/<RESULTS_CSV>
102+
```
103+
104+
### Analyzing Test Results
105+
106+
```
107+
import pandas as pd
108+
import matplotlib.pyplot as plt
109+
import numpy as np
110+
111+
df = pd.read_csv("<RESULTS_CSV_PATH>")
112+
113+
# Define the metrics of interest
114+
extractiveness_cols = ['Extractiveness Coverage', 'Extractiveness Density', 'Extractiveness Compression']
115+
token_cols = ['Input Token Usage', 'Output Token Usage']
116+
other_metrics = ['Cost (USD)', 'Duration (s)']
117+
all_metrics = extractiveness_cols + token_cols + other_metrics
118+
119+
# Metric Histograms by Model
120+
plt.style.use('default')
121+
fig, axes = plt.subplots(len(all_metrics), 1, figsize=(12, 4 * len(all_metrics)))
122+
123+
models = df['MODEL'].unique()
124+
colors = plt.cm.Set3(np.linspace(0, 1, len(models)))
125+
126+
for i, metric in enumerate(all_metrics):
127+
ax = axes[i] if len(all_metrics) > 1 else axes
128+
129+
# Create histogram for each model
130+
for j, model in enumerate(models):
131+
model_data = df[df['MODEL'] == model][metric]
132+
ax.hist(model_data, alpha=0.7, label=model, bins=min(8, len(model_data)),
133+
color=colors[j], edgecolor='black', linewidth=0.5)
134+
135+
ax.set_title(f'{metric} Distribution by Model', fontsize=14, fontweight='bold')
136+
ax.set_xlabel(metric, fontsize=12)
137+
ax.set_ylabel('Frequency', fontsize=12)
138+
ax.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
139+
ax.grid(True, alpha=0.3)
140+
141+
plt.tight_layout()
142+
plt.show()
143+
144+
# Metric Statistics by Model
145+
for metric in all_metrics:
146+
print(f"\n{metric.upper()}:")
147+
desc_stats = df.groupby('MODEL')[metric].agg([
148+
'count', 'mean', 'std', 'min', 'median','max'
149+
])
150+
151+
print(desc_stats)
152+
153+
154+
# Calculate Efficiency Metrics By model
155+
df_analysis = df.copy()
156+
df_analysis['Total Token Usage'] = df_analysis['Input Token Usage'] + df_analysis['Output Token Usage']
157+
df_analysis['Cost per Token'] = df_analysis['Cost (USD)'] / df_analysis['Total Token Usage']
158+
df_analysis['Tokens per Second'] = df_analysis['Total Token Usage'] / df_analysis['Duration (s)']
159+
df_analysis['Cost per Second'] = df_analysis['Cost (USD)'] / df_analysis['Duration (s)']
160+
161+
efficiency_metrics = ['Cost per Token', 'Tokens per Second', 'Cost per Second']
162+
163+
for metric in efficiency_metrics:
164+
print(f"\n{metric.upper()}:")
165+
eff_stats = df_analysis.groupby('MODEL')[metric].agg([
166+
'count', 'mean', 'std', 'min', 'median', 'max'
167+
])
168+
169+
for col in ['mean', 'std', 'min', 'median', 'max']:
170+
eff_stats[col] = eff_stats[col].apply(lambda x: f"{x:.3g}")
171+
print(eff_stats)
172+
173+
174+
```
35175

36-
* Computes:
176+
### Contributing
37177

38-
* Token usage (input/output)
39-
* Estimated cost in USD
40-
* Duration (in seconds)
178+
You're welcome to add LLM models to test in `server/api/services/llm_services`

0 commit comments

Comments
 (0)