|
1 |
| - |
2 | 1 | # Evaluations
|
3 | 2 |
|
4 |
| -## LLM Output Evaluator |
| 3 | +## `evals`: LLM evaluations to test and improve model outputs |
| 4 | + |
| 5 | +### Evaluation Metrics |
| 6 | + |
| 7 | +Natural Language Generation Performance: |
| 8 | + |
| 9 | +[Extractiveness](https://huggingface.co/docs/lighteval/en/metric-list#automatic-metrics-for-generative-tasks): |
| 10 | + |
| 11 | +* Extractiveness Coverage: Extent to which a summary is derivative of a text |
| 12 | +* Extractiveness Density: How well the word sequence can be described as series of extractions |
| 13 | +* Extractiveness Compression: Word ratio between the article and the summary |
| 14 | + |
| 15 | +API Performance: |
| 16 | + |
| 17 | +* Token Usage (input/output) |
| 18 | +* Estimated Cost in USD |
| 19 | +* Duration (in seconds) |
| 20 | + |
| 21 | +### Test Data |
| 22 | + |
| 23 | +Generate the dataset file by connecting to a database of research papers: |
5 | 24 |
|
6 |
| -The `evals` script evaluates the outputs of Large Language Models (LLMs) and estimates the associated token usage and cost. |
| 25 | +Connect to the Postgres database of your local Balancer instance: |
7 | 26 |
|
8 |
| -It supports batch evalaution via a configuration CSV and produces a detailed metrics report in CSV format. |
| 27 | +``` |
| 28 | +from sqlalchemy import create_engine |
9 | 29 |
|
10 |
| -### Usage |
| 30 | +engine = create_engine("postgresql+psycopg2://balancer:balancer@localhost:5433/balancer_dev") |
| 31 | +``` |
11 | 32 |
|
12 |
| -This script evaluates LLM outputs using the `lighteval` library: https://huggingface.co/docs/lighteval/en/metric-list#automatic-metrics-for-generative-tasks |
| 33 | +Connect to the Postgres database of the production Balancer instance using a SQL file: |
13 | 34 |
|
14 |
| -Ensure you have the `lighteval` library and any model SDKs (e.g., OpenAI, Anthropic) configured properly. |
| 35 | +``` |
| 36 | +# Add Postgres.app binaries to the PATH |
| 37 | +echo 'export PATH="/Applications/Postgres.app/Contents/Versions/latest/bin:$PATH"' >> ~/.zshrc |
| 38 | +
|
| 39 | +createdb <DB_NAME> |
| 40 | +pg_restore -v -d <DB_NAME> <PATH_TO_BACKUP>.sql |
| 41 | +``` |
15 | 42 |
|
| 43 | +Generate the dataset CSV file: |
16 | 44 |
|
17 |
| -```bash |
18 |
| -python evals.py --config path/to/config.csv --reference path/to/reference.csv --output path/to/results.csv |
19 | 45 | ```
|
| 46 | +from sqlalchemy import create_engine |
| 47 | +import pandas as pd |
20 | 48 |
|
21 |
| -The arguments to the script are: |
| 49 | +engine = create_engine("postgresql://<USER>@localhost:5432/<DB_NAME>") |
22 | 50 |
|
23 |
| -- Path to the config CSV file: Must include the columns "Model Name" and "Query" |
24 |
| -- Path to the reference CSV file: Must include the columns "Context" and "Reference" |
25 |
| -- Path where the evaluation resuls will be saved |
| 51 | +query = "SELECT * FROM api_embeddings;" |
| 52 | +df = pd.read_sql(query, engine) |
| 53 | +
|
| 54 | +df['INPUT'] = df.apply(lambda row: f"ID: {row['chunk_number']} | CONTENT: {row['text']}", axis=1) |
| 55 | +
|
| 56 | +# Ensure the chunks are joined in order of chunk_number by sorting the DataFrame before grouping and joining |
| 57 | +df = df.sort_values(by=['name', 'upload_file_id', 'chunk_number']) |
| 58 | +df_grouped = df.groupby(['name', 'upload_file_id'])['INPUT'].apply(lambda chunks: "\n".join(chunks)).reset_index() |
| 59 | +
|
| 60 | +df_grouped.to_csv('<DATASET_CSV_PATH>', index=False) |
| 61 | +``` |
26 | 62 |
|
| 63 | +### Running an Evaluation |
27 | 64 |
|
28 |
| -The script outputs a CSV with the following columns: |
| 65 | +#### Bulk Model and Prompt Experimentation |
29 | 66 |
|
30 |
| -* Evaluates LLM outputs for: |
| 67 | +Compare the results of many different prompts and models at once |
31 | 68 |
|
32 |
| - * Extractiveness Coverage |
33 |
| - * Extractiveness Density |
34 |
| - * Extractiveness Compression |
| 69 | +``` |
| 70 | +import pandas as pd |
| 71 | +
|
| 72 | +data = [ |
| 73 | + { |
| 74 | + "MODEL": "<MODEL_NAME_1>", |
| 75 | + "INSTRUCTIONS": """<YOUR_QUERY_1>""" |
| 76 | + }, |
| 77 | + { |
| 78 | + "MODEL": "<MODEL_NAME_2>", |
| 79 | + "INSTRUCTIONS": """<YOUR_QUERY_2>""" |
| 80 | + }, |
| 81 | +] |
| 82 | +
|
| 83 | +df = pd.DataFrame.from_records(data) |
| 84 | +
|
| 85 | +df.to_csv("<EXPERIMENTS_CSV_PATH>", index=False) |
| 86 | +``` |
| 87 | + |
| 88 | + |
| 89 | +#### Execute on the Command Line |
| 90 | + |
| 91 | + |
| 92 | +Execute [using `uv` to manage dependencies](https://docs.astral.sh/uv/guides/scripts/) without manually managing enviornments: |
| 93 | + |
| 94 | +```sh |
| 95 | +uv run evals.py --experiments path/to/<EXPERIMENTS_CSV> --dataset path/to/<DATASET_CSV> --results path/to/<RESULTS_CSV> |
| 96 | +``` |
| 97 | + |
| 98 | +Execute without using uv run by ensuring it is executable: |
| 99 | + |
| 100 | +```sh |
| 101 | +./evals.py --experiments path/to/<EXPERIMENTS_CSV> --dataset path/to/<DATASET_CSV> --results path/to/<RESULTS_CSV> |
| 102 | +``` |
| 103 | + |
| 104 | +### Analyzing Test Results |
| 105 | + |
| 106 | +``` |
| 107 | +import pandas as pd |
| 108 | +import matplotlib.pyplot as plt |
| 109 | +import numpy as np |
| 110 | +
|
| 111 | +df = pd.read_csv("<RESULTS_CSV_PATH>") |
| 112 | +
|
| 113 | +# Define the metrics of interest |
| 114 | +extractiveness_cols = ['Extractiveness Coverage', 'Extractiveness Density', 'Extractiveness Compression'] |
| 115 | +token_cols = ['Input Token Usage', 'Output Token Usage'] |
| 116 | +other_metrics = ['Cost (USD)', 'Duration (s)'] |
| 117 | +all_metrics = extractiveness_cols + token_cols + other_metrics |
| 118 | +
|
| 119 | +# Metric Histograms by Model |
| 120 | +plt.style.use('default') |
| 121 | +fig, axes = plt.subplots(len(all_metrics), 1, figsize=(12, 4 * len(all_metrics))) |
| 122 | +
|
| 123 | +models = df['MODEL'].unique() |
| 124 | +colors = plt.cm.Set3(np.linspace(0, 1, len(models))) |
| 125 | +
|
| 126 | +for i, metric in enumerate(all_metrics): |
| 127 | + ax = axes[i] if len(all_metrics) > 1 else axes |
| 128 | + |
| 129 | + # Create histogram for each model |
| 130 | + for j, model in enumerate(models): |
| 131 | + model_data = df[df['MODEL'] == model][metric] |
| 132 | + ax.hist(model_data, alpha=0.7, label=model, bins=min(8, len(model_data)), |
| 133 | + color=colors[j], edgecolor='black', linewidth=0.5) |
| 134 | + |
| 135 | + ax.set_title(f'{metric} Distribution by Model', fontsize=14, fontweight='bold') |
| 136 | + ax.set_xlabel(metric, fontsize=12) |
| 137 | + ax.set_ylabel('Frequency', fontsize=12) |
| 138 | + ax.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left') |
| 139 | + ax.grid(True, alpha=0.3) |
| 140 | +
|
| 141 | +plt.tight_layout() |
| 142 | +plt.show() |
| 143 | +
|
| 144 | +# Metric Statistics by Model |
| 145 | +for metric in all_metrics: |
| 146 | + print(f"\n{metric.upper()}:") |
| 147 | + desc_stats = df.groupby('MODEL')[metric].agg([ |
| 148 | + 'count', 'mean', 'std', 'min', 'median','max' |
| 149 | + ]) |
| 150 | +
|
| 151 | + print(desc_stats) |
| 152 | +
|
| 153 | +
|
| 154 | +# Calculate Efficiency Metrics By model |
| 155 | +df_analysis = df.copy() |
| 156 | +df_analysis['Total Token Usage'] = df_analysis['Input Token Usage'] + df_analysis['Output Token Usage'] |
| 157 | +df_analysis['Cost per Token'] = df_analysis['Cost (USD)'] / df_analysis['Total Token Usage'] |
| 158 | +df_analysis['Tokens per Second'] = df_analysis['Total Token Usage'] / df_analysis['Duration (s)'] |
| 159 | +df_analysis['Cost per Second'] = df_analysis['Cost (USD)'] / df_analysis['Duration (s)'] |
| 160 | +
|
| 161 | +efficiency_metrics = ['Cost per Token', 'Tokens per Second', 'Cost per Second'] |
| 162 | +
|
| 163 | +for metric in efficiency_metrics: |
| 164 | + print(f"\n{metric.upper()}:") |
| 165 | + eff_stats = df_analysis.groupby('MODEL')[metric].agg([ |
| 166 | + 'count', 'mean', 'std', 'min', 'median', 'max' |
| 167 | + ]) |
| 168 | +
|
| 169 | + for col in ['mean', 'std', 'min', 'median', 'max']: |
| 170 | + eff_stats[col] = eff_stats[col].apply(lambda x: f"{x:.3g}") |
| 171 | + print(eff_stats) |
| 172 | +
|
| 173 | +
|
| 174 | +``` |
35 | 175 |
|
36 |
| -* Computes: |
| 176 | +### Contributing |
37 | 177 |
|
38 |
| - * Token usage (input/output) |
39 |
| - * Estimated cost in USD |
40 |
| - * Duration (in seconds) |
| 178 | +You're welcome to add LLM models to test in `server/api/services/llm_services` |
0 commit comments