2
2
3
3
## ` evals ` : LLM evaluations to test and improve model outputs
4
4
5
- ### Metrics
6
-
7
- [ Extractiveness] ( https://huggingface.co/docs/lighteval/en/metric-list#automatic-metrics-for-generative-tasks ) :
5
+ ### Evaluation Metrics
8
6
9
7
Natural Language Generation Performance:
10
8
9
+ [ Extractiveness] ( https://huggingface.co/docs/lighteval/en/metric-list#automatic-metrics-for-generative-tasks ) :
10
+
11
11
* Extractiveness Coverage:
12
12
- Percentage of words in the summary that are part of an extractive fragment with the article
13
13
* Extractiveness Density:
@@ -23,7 +23,7 @@ API Performance:
23
23
24
24
### Test Data
25
25
26
- Generate the dataset file by connecting to a database of references
26
+ Generate the dataset file by connecting to a database of research papers:
27
27
28
28
Connect to the Postgres database of your local Balancer instance:
29
29
@@ -36,72 +36,63 @@ engine = create_engine("postgresql+psycopg2://balancer:balancer@localhost:5433/b
36
36
Connect to the Postgres database of the production Balancer instance using a SQL file:
37
37
38
38
```
39
- # Install Postgres.app and add binaries to the PATH
39
+ # Add Postgres.app binaries to the PATH
40
40
echo 'export PATH="/Applications/Postgres.app/Contents/Versions/latest/bin:$PATH"' >> ~/.zshrc
41
41
42
42
createdb <DB_NAME>
43
43
pg_restore -v -d <DB_NAME> <PATH_TO_BACKUP>.sql
44
44
```
45
45
46
- ```
47
- from sqlalchemy import create_engine
48
- engine = create_engine("postgresql://<USER>@localhost:5432/<DB_NAME>")
49
- ```
50
-
51
46
Generate the dataset CSV file:
52
47
53
48
```
49
+ from sqlalchemy import create_engine
54
50
import pandas as pd
55
51
52
+ engine = create_engine("postgresql://<USER>@localhost:5432/<DB_NAME>")
53
+
56
54
query = "SELECT * FROM api_embeddings;"
57
55
df = pd.read_sql(query, engine)
58
56
59
- df['formatted_chunk '] = df.apply(lambda row: f"ID: {row['chunk_number']} | CONTENT: {row['text']}", axis=1)
57
+ df['INPUT '] = df.apply(lambda row: f"ID: {row['chunk_number']} | CONTENT: {row['text']}", axis=1)
60
58
61
59
# Ensure the chunks are joined in order of chunk_number by sorting the DataFrame before grouping and joining
62
60
df = df.sort_values(by=['name', 'upload_file_id', 'chunk_number'])
63
- df_grouped = df.groupby(['name', 'upload_file_id'])['formatted_chunk '].apply(lambda chunks: "\n".join(chunks)).reset_index()
61
+ df_grouped = df.groupby(['name', 'upload_file_id'])['INPUT '].apply(lambda chunks: "\n".join(chunks)).reset_index()
64
62
65
- df_grouped = df_grouped.rename(columns={'formatted_chunk': 'concatenated_chunks'})
66
63
df_grouped.to_csv('<DATASET_CSV_PATH>', index=False)
67
64
```
68
65
69
-
70
66
### Running an Evaluation
71
67
72
- #### Test Input: Bulk model and prompt experimentation
68
+ #### Bulk Model and Prompt Experimentation
73
69
74
70
Compare the results of many different prompts and models at once
75
71
76
72
```
77
73
import pandas as pd
78
74
79
- # Define the data
80
75
data = [
81
-
82
76
{
83
- "Model Name ": "<MODEL_NAME_1>",
84
- "Query ": """<YOUR_QUERY_1>"""
77
+ "MODEL ": "<MODEL_NAME_1>",
78
+ "INSTRUCTIONS ": """<YOUR_QUERY_1>"""
85
79
},
86
-
87
80
{
88
- "Model Name ": "<MODEL_NAME_2>",
89
- "Query ": """<YOUR_QUERY_2>"""
81
+ "MODEL ": "<MODEL_NAME_2>",
82
+ "INSTRUCTIONS ": """<YOUR_QUERY_2>"""
90
83
},
91
84
]
92
85
93
- # Create DataFrame from records
94
86
df = pd.DataFrame.from_records(data)
95
87
96
- # Write to CSV
97
88
df.to_csv("<EXPERIMENTS_CSV_PATH>", index=False)
98
89
```
99
90
100
91
101
- #### Execute on the command line
92
+ #### Execute on the Command Line
102
93
103
94
104
- Execute [ using ` uv ` to manage depenendices ] ( https://docs.astral.sh/uv/guides/scripts/ ) without manually managing enviornments:
95
+ Execute [ using ` uv ` to manage dependencies ] ( https://docs.astral.sh/uv/guides/scripts/ ) without manually managing enviornments:
105
96
106
97
``` sh
107
98
uv run evals.py --experiments path/to/< EXPERIMENTS_CSV> --dataset path/to/< DATASET_CSV> --results path/to/< RESULTS_CSV>
@@ -156,3 +147,5 @@ plt.show()
156
147
```
157
148
158
149
### Contributing
150
+
151
+ You're welcome to add LLM models to test in ` server/api/services/llm_services `
0 commit comments