Skip to content

Commit fe76d97

Browse files
committed
test: add integration tests for query_metrics
the integration tests check if all of the metrics we currently support actual are queryable after inference requests this also tests thinks like aggregation, label filtering, etc Signed-off-by: Charlie Doern <[email protected]>
1 parent 825a5f4 commit fe76d97

File tree

11 files changed

+530
-71
lines changed

11 files changed

+530
-71
lines changed
4 KB
Binary file not shown.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
{
2+
"request": {
3+
"method": "POST",
4+
"url": "http://localhost:11434/api/generate",
5+
"headers": {},
6+
"body": {
7+
"model": "llama3.2:3b-instruct-fp16",
8+
"raw": true,
9+
"prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTest metrics generation 1<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
10+
"options": {
11+
"temperature": 0.0
12+
},
13+
"stream": false
14+
},
15+
"endpoint": "/api/generate",
16+
"model": "llama3.2:3b-instruct-fp16"
17+
},
18+
"response": {
19+
"body": {
20+
"__type__": "ollama._types.GenerateResponse",
21+
"__data__": {
22+
"model": "llama3.2:3b-instruct-fp16",
23+
"created_at": "2025-08-11T15:51:18.170868Z",
24+
"done": true,
25+
"done_reason": "stop",
26+
"total_duration": 5240614083,
27+
"load_duration": 9823416,
28+
"prompt_eval_count": 21,
29+
"prompt_eval_duration": 21000000,
30+
"eval_count": 310,
31+
"eval_duration": 5209000000,
32+
"response": "This is the start of a test. I'll provide some sample data and you can try to generate metrics based on it.\n\n**Data:**\n\nLet's say we have a dataset of user interactions with an e-commerce website. The data includes:\n\n| User ID | Product Name | Purchase Date | Quantity | Price |\n| --- | --- | --- | --- | --- |\n| 1 | iPhone 13 | 2022-01-01 | 2 | 999.99 |\n| 1 | MacBook Air | 2022-01-05 | 1 | 1299.99 |\n| 2 | Samsung TV | 2022-01-10 | 3 | 899.99 |\n| 3 | iPhone 13 | 2022-01-15 | 1 | 999.99 |\n| 4 | MacBook Pro | 2022-01-20 | 2 | 1799.99 |\n\n**Task:**\n\nYour task is to generate the following metrics based on this data:\n\n1. Average order value (AOV)\n2. Conversion rate\n3. Average revenue per user (ARPU)\n4. Customer lifetime value (CLV)\n\nPlease provide your answers in a format like this:\n\n| Metric | Value |\n| --- | --- |\n| AOV | 1234.56 |\n| Conversion Rate | 0.25 |\n| ARPU | 1000.00 |\n| CLV | 5000.00 |\n\nGo ahead and generate the metrics!",
33+
"thinking": null,
34+
"context": null
35+
}
36+
},
37+
"is_streaming": false
38+
}
39+
}

tests/integration/recordings/responses/4a3a4447b16b.json

Lines changed: 21 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
"models": [
1515
{
1616
"model": "nomic-embed-text:latest",
17-
"modified_at": "2025-08-05T14:04:07.946926-07:00",
17+
"modified_at": "2025-08-11T11:56:03.264283-04:00",
1818
"digest": "0a109f422b47e3a30ba2b10eca18548e944e8a23073ee3f3e947efcf3c45e59f",
1919
"size": 274302450,
2020
"details": {
@@ -28,57 +28,9 @@
2828
"quantization_level": "F16"
2929
}
3030
},
31-
{
32-
"model": "llama3.2-vision:11b",
33-
"modified_at": "2025-07-30T18:45:02.517873-07:00",
34-
"digest": "6f2f9757ae97e8a3f8ea33d6adb2b11d93d9a35bef277cd2c0b1b5af8e8d0b1e",
35-
"size": 7816589186,
36-
"details": {
37-
"parent_model": "",
38-
"format": "gguf",
39-
"family": "mllama",
40-
"families": [
41-
"mllama"
42-
],
43-
"parameter_size": "10.7B",
44-
"quantization_level": "Q4_K_M"
45-
}
46-
},
47-
{
48-
"model": "llama3.2-vision:latest",
49-
"modified_at": "2025-07-29T20:18:47.920468-07:00",
50-
"digest": "6f2f9757ae97e8a3f8ea33d6adb2b11d93d9a35bef277cd2c0b1b5af8e8d0b1e",
51-
"size": 7816589186,
52-
"details": {
53-
"parent_model": "",
54-
"format": "gguf",
55-
"family": "mllama",
56-
"families": [
57-
"mllama"
58-
],
59-
"parameter_size": "10.7B",
60-
"quantization_level": "Q4_K_M"
61-
}
62-
},
63-
{
64-
"model": "llama-guard3:1b",
65-
"modified_at": "2025-07-25T14:39:44.978630-07:00",
66-
"digest": "494147e06bf99e10dbe67b63a07ac81c162f18ef3341aa3390007ac828571b3b",
67-
"size": 1600181919,
68-
"details": {
69-
"parent_model": "",
70-
"format": "gguf",
71-
"family": "llama",
72-
"families": [
73-
"llama"
74-
],
75-
"parameter_size": "1.5B",
76-
"quantization_level": "Q8_0"
77-
}
78-
},
7931
{
8032
"model": "all-minilm:l6-v2",
81-
"modified_at": "2025-07-24T15:15:11.129290-07:00",
33+
"modified_at": "2025-07-30T20:51:26.331882-04:00",
8234
"digest": "1b226e2802dbb772b5fc32a58f103ca1804ef7501331012de126ab22f67475ef",
8335
"size": 45960996,
8436
"details": {
@@ -92,25 +44,9 @@
9244
"quantization_level": "F16"
9345
}
9446
},
95-
{
96-
"model": "llama3.2:1b",
97-
"modified_at": "2025-07-17T22:02:24.953208-07:00",
98-
"digest": "baf6a787fdffd633537aa2eb51cfd54cb93ff08e28040095462bb63daf552878",
99-
"size": 1321098329,
100-
"details": {
101-
"parent_model": "",
102-
"format": "gguf",
103-
"family": "llama",
104-
"families": [
105-
"llama"
106-
],
107-
"parameter_size": "1.2B",
108-
"quantization_level": "Q8_0"
109-
}
110-
},
11147
{
11248
"model": "all-minilm:latest",
113-
"modified_at": "2025-06-03T16:50:10.946583-07:00",
49+
"modified_at": "2025-05-30T13:31:57.695124-04:00",
11450
"digest": "1b226e2802dbb772b5fc32a58f103ca1804ef7501331012de126ab22f67475ef",
11551
"size": 45960996,
11652
"details": {
@@ -126,7 +62,7 @@
12662
},
12763
{
12864
"model": "llama3.2:3b",
129-
"modified_at": "2025-05-01T11:15:23.797447-07:00",
65+
"modified_at": "2025-05-13T18:29:07.722534-04:00",
13066
"digest": "a80c4f17acd55265feec403c7aef86be0c25983ab279d83f3bcd3abbcb5b8b72",
13167
"size": 2019393189,
13268
"details": {
@@ -142,7 +78,7 @@
14278
},
14379
{
14480
"model": "llama3.2:3b-instruct-fp16",
145-
"modified_at": "2025-04-30T15:33:48.939665-07:00",
81+
"modified_at": "2025-02-04T11:40:12.970780-05:00",
14682
"digest": "195a8c01d91ec3cb1e0aad4624a51f2602c51fa7d96110f8ab5a20c84081804d",
14783
"size": 6433703586,
14884
"details": {
@@ -155,6 +91,22 @@
15591
"parameter_size": "3.2B",
15692
"quantization_level": "F16"
15793
}
94+
},
95+
{
96+
"model": "llama3.2:latest",
97+
"modified_at": "2025-02-04T11:16:08.821668-05:00",
98+
"digest": "a80c4f17acd55265feec403c7aef86be0c25983ab279d83f3bcd3abbcb5b8b72",
99+
"size": 2019393189,
100+
"details": {
101+
"parent_model": "",
102+
"format": "gguf",
103+
"family": "llama",
104+
"families": [
105+
"llama"
106+
],
107+
"parameter_size": "3.2B",
108+
"quantization_level": "Q4_K_M"
109+
}
158110
}
159111
]
160112
}
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
{
2+
"request": {
3+
"method": "POST",
4+
"url": "http://localhost:11434/v1/v1/chat/completions",
5+
"headers": {},
6+
"body": {
7+
"model": "llama3.2:3b",
8+
"messages": [
9+
{
10+
"role": "user",
11+
"content": "OpenAI test 0"
12+
}
13+
],
14+
"stream": false
15+
},
16+
"endpoint": "/v1/chat/completions",
17+
"model": "llama3.2:3b"
18+
},
19+
"response": {
20+
"body": {
21+
"__type__": "openai.types.chat.chat_completion.ChatCompletion",
22+
"__data__": {
23+
"id": "chatcmpl-159",
24+
"choices": [
25+
{
26+
"finish_reason": "stop",
27+
"index": 0,
28+
"logprobs": null,
29+
"message": {
30+
"content": "It seems like you're trying to access an internal system, but I don't have any specific information about OpenAI's internal testing environment or how to interact with it. However, I can tell you that OpenAI is a company that provides various AI-related products and services, including the popular language model called Llama.\n\nIf you're looking for information on how to use OpenAI's products or APIs, I'd be happy to help with that! Alternatively, if you have any specific questions about AI or machine learning, feel free to ask, and I'll do my best to provide a helpful response.",
31+
"refusal": null,
32+
"role": "assistant",
33+
"annotations": null,
34+
"audio": null,
35+
"function_call": null,
36+
"tool_calls": null
37+
}
38+
}
39+
],
40+
"created": 1754927774,
41+
"model": "llama3.2:3b",
42+
"object": "chat.completion",
43+
"service_tier": null,
44+
"system_fingerprint": "fp_ollama",
45+
"usage": {
46+
"completion_tokens": 122,
47+
"prompt_tokens": 30,
48+
"total_tokens": 152,
49+
"completion_tokens_details": null,
50+
"prompt_tokens_details": null
51+
}
52+
}
53+
},
54+
"is_streaming": false
55+
}
56+
}
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
{
2+
"request": {
3+
"method": "POST",
4+
"url": "http://localhost:11434/v1/v1/chat/completions",
5+
"headers": {},
6+
"body": {
7+
"model": "llama3.2:3b",
8+
"messages": [
9+
{
10+
"role": "user",
11+
"content": "OpenAI test 1"
12+
}
13+
],
14+
"stream": false
15+
},
16+
"endpoint": "/v1/chat/completions",
17+
"model": "llama3.2:3b"
18+
},
19+
"response": {
20+
"body": {
21+
"__type__": "openai.types.chat.chat_completion.ChatCompletion",
22+
"__data__": {
23+
"id": "chatcmpl-831",
24+
"choices": [
25+
{
26+
"finish_reason": "stop",
27+
"index": 0,
28+
"logprobs": null,
29+
"message": {
30+
"content": "I'm ready to help. What would you like to do? Would you like to:\n\n1. Ask a question\n2. Engage in a conversation\n3. Test my language understanding\n4. Something else (please specify)",
31+
"refusal": null,
32+
"role": "assistant",
33+
"annotations": null,
34+
"audio": null,
35+
"function_call": null,
36+
"tool_calls": null
37+
}
38+
}
39+
],
40+
"created": 1754927774,
41+
"model": "llama3.2:3b",
42+
"object": "chat.completion",
43+
"service_tier": null,
44+
"system_fingerprint": "fp_ollama",
45+
"usage": {
46+
"completion_tokens": 48,
47+
"prompt_tokens": 30,
48+
"total_tokens": 78,
49+
"completion_tokens_details": null,
50+
"prompt_tokens_details": null
51+
}
52+
}
53+
},
54+
"is_streaming": false
55+
}
56+
}
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
{
2+
"request": {
3+
"method": "POST",
4+
"url": "http://localhost:11434/api/generate",
5+
"headers": {},
6+
"body": {
7+
"model": "llama3.2:3b-instruct-fp16",
8+
"raw": true,
9+
"prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTest metrics generation 0<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
10+
"options": {
11+
"temperature": 0.0
12+
},
13+
"stream": false
14+
},
15+
"endpoint": "/api/generate",
16+
"model": "llama3.2:3b-instruct-fp16"
17+
},
18+
"response": {
19+
"body": {
20+
"__type__": "ollama._types.GenerateResponse",
21+
"__data__": {
22+
"model": "llama3.2:3b-instruct-fp16",
23+
"created_at": "2025-08-11T15:51:12.918723Z",
24+
"done": true,
25+
"done_reason": "stop",
26+
"total_duration": 8868987792,
27+
"load_duration": 2793275292,
28+
"prompt_eval_count": 21,
29+
"prompt_eval_duration": 250000000,
30+
"eval_count": 344,
31+
"eval_duration": 5823000000,
32+
"response": "Here are some common test metrics used to evaluate the performance of a system:\n\n1. **Accuracy**: The proportion of correct predictions or classifications out of total predictions made.\n2. **Precision**: The ratio of true positives (correctly predicted instances) to the sum of true positives and false positives (incorrectly predicted instances).\n3. **Recall**: The ratio of true positives to the sum of true positives and false negatives (missed instances).\n4. **F1-score**: The harmonic mean of precision and recall, providing a balanced measure of both.\n5. **Mean Squared Error (MSE)**: The average squared difference between predicted and actual values.\n6. **Mean Absolute Error (MAE)**: The average absolute difference between predicted and actual values.\n7. **Root Mean Squared Percentage Error (RMSPE)**: The square root of the mean of the squared percentage differences between predicted and actual values.\n8. **Coefficient of Determination (R-squared, R2)**: Measures how well a model fits the data, with higher values indicating better fit.\n9. **Mean Absolute Percentage Error (MAPE)**: The average absolute percentage difference between predicted and actual values.\n10. **Normalized Mean Squared Error (NMSE)**: Similar to MSE, but normalized by the mean of the actual values.\n\nThese metrics can be used for various types of data, including:\n\n* Regression problems (e.g., predicting continuous values)\n* Classification problems (e.g., predicting categorical labels)\n* Time series forecasting\n* Clustering and dimensionality reduction\n\nWhen choosing a metric, consider the specific problem you're trying to solve, the type of data, and the desired level of precision.",
33+
"thinking": null,
34+
"context": null
35+
}
36+
},
37+
"is_streaming": false
38+
}
39+
}
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
{
2+
"request": {
3+
"method": "POST",
4+
"url": "http://localhost:11434/api/generate",
5+
"headers": {},
6+
"body": {
7+
"model": "llama3.2:3b",
8+
"raw": true,
9+
"prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTest metrics generation 0<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
10+
"options": {
11+
"temperature": 0.0
12+
},
13+
"stream": false
14+
},
15+
"endpoint": "/api/generate",
16+
"model": "llama3.2:3b"
17+
},
18+
"response": {
19+
"body": {
20+
"__type__": "ollama._types.GenerateResponse",
21+
"__data__": {
22+
"model": "llama3.2:3b",
23+
"created_at": "2025-08-11T15:56:06.703788Z",
24+
"done": true,
25+
"done_reason": "stop",
26+
"total_duration": 2722294000,
27+
"load_duration": 9736083,
28+
"prompt_eval_count": 21,
29+
"prompt_eval_duration": 113000000,
30+
"eval_count": 324,
31+
"eval_duration": 2598000000,
32+
"response": "Here are some test metrics that can be used to evaluate the performance of a system:\n\n1. **Accuracy**: The proportion of correct predictions made by the model.\n2. **Precision**: The ratio of true positives (correctly predicted instances) to total positive predictions.\n3. **Recall**: The ratio of true positives to the sum of true positives and false negatives (missed instances).\n4. **F1-score**: The harmonic mean of precision and recall, providing a balanced measure of both.\n5. **Mean Squared Error (MSE)**: The average squared difference between predicted and actual values.\n6. **Mean Absolute Error (MAE)**: The average absolute difference between predicted and actual values.\n7. **Root Mean Squared Percentage Error (RMSPE)**: A variation of MSE that expresses the error as a percentage.\n8. **Coefficient of Determination (R-squared, R2)**: Measures how well the model explains the variance in the data.\n9. **Mean Absolute Percentage Error (MAPE)**: The average absolute percentage difference between predicted and actual values.\n10. **Mean Squared Logarithmic Error (MSLE)**: A variation of MSE that is more suitable for skewed distributions.\n\nThese metrics can be used to evaluate different aspects of a system's performance, such as:\n\n* Classification models: accuracy, precision, recall, F1-score\n* Regression models: MSE, MAE, RMSPE, R2, MSLE\n* Time series forecasting: MAPE, RMSPE\n\nNote that the choice of metric depends on the specific problem and data.",
33+
"thinking": null,
34+
"context": null
35+
}
36+
},
37+
"is_streaming": false
38+
}
39+
}

0 commit comments

Comments
 (0)