Skip to content

Commit ddefd15

Browse files
committed
test: add integration tests for query_metrics
the integration tests check if all of the metrics we currently support actual are queryable after inference requests this also tests thinks like aggregation, label filtering, etc Signed-off-by: Charlie Doern <[email protected]>
1 parent fb553f3 commit ddefd15

12 files changed

+684
-0
lines changed
0 Bytes
Binary file not shown.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
{
2+
"request": {
3+
"method": "POST",
4+
"url": "http://localhost:11434/api/generate",
5+
"headers": {},
6+
"body": {
7+
"model": "llama3.2:3b-instruct-fp16",
8+
"raw": true,
9+
"prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTest metrics generation 1<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
10+
"options": {
11+
"temperature": 0.0
12+
},
13+
"stream": false
14+
},
15+
"endpoint": "/api/generate",
16+
"model": "llama3.2:3b-instruct-fp16"
17+
},
18+
"response": {
19+
"body": {
20+
"__type__": "ollama._types.GenerateResponse",
21+
"__data__": {
22+
"model": "llama3.2:3b-instruct-fp16",
23+
"created_at": "2025-08-11T15:51:18.170868Z",
24+
"done": true,
25+
"done_reason": "stop",
26+
"total_duration": 5240614083,
27+
"load_duration": 9823416,
28+
"prompt_eval_count": 21,
29+
"prompt_eval_duration": 21000000,
30+
"eval_count": 310,
31+
"eval_duration": 5209000000,
32+
"response": "This is the start of a test. I'll provide some sample data and you can try to generate metrics based on it.\n\n**Data:**\n\nLet's say we have a dataset of user interactions with an e-commerce website. The data includes:\n\n| User ID | Product Name | Purchase Date | Quantity | Price |\n| --- | --- | --- | --- | --- |\n| 1 | iPhone 13 | 2022-01-01 | 2 | 999.99 |\n| 1 | MacBook Air | 2022-01-05 | 1 | 1299.99 |\n| 2 | Samsung TV | 2022-01-10 | 3 | 899.99 |\n| 3 | iPhone 13 | 2022-01-15 | 1 | 999.99 |\n| 4 | MacBook Pro | 2022-01-20 | 2 | 1799.99 |\n\n**Task:**\n\nYour task is to generate the following metrics based on this data:\n\n1. Average order value (AOV)\n2. Conversion rate\n3. Average revenue per user (ARPU)\n4. Customer lifetime value (CLV)\n\nPlease provide your answers in a format like this:\n\n| Metric | Value |\n| --- | --- |\n| AOV | 1234.56 |\n| Conversion Rate | 0.25 |\n| ARPU | 1000.00 |\n| CLV | 5000.00 |\n\nGo ahead and generate the metrics!",
33+
"thinking": null,
34+
"context": null
35+
}
36+
},
37+
"is_streaming": false
38+
}
39+
}
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
{
2+
"request": {
3+
"method": "POST",
4+
"url": "http://localhost:11434/v1/v1/chat/completions",
5+
"headers": {},
6+
"body": {
7+
"model": "llama3.2:3b",
8+
"messages": [
9+
{
10+
"role": "user",
11+
"content": "OpenAI test 0"
12+
}
13+
],
14+
"stream": false
15+
},
16+
"endpoint": "/v1/chat/completions",
17+
"model": "llama3.2:3b"
18+
},
19+
"response": {
20+
"body": {
21+
"__type__": "openai.types.chat.chat_completion.ChatCompletion",
22+
"__data__": {
23+
"id": "chatcmpl-843",
24+
"choices": [
25+
{
26+
"finish_reason": "stop",
27+
"index": 0,
28+
"logprobs": null,
29+
"message": {
30+
"content": "I don't have any information about an \"OpenAI test 0\". It's possible that you may be referring to a specific experiment or task being performed by OpenAI, but without more context, I can only speculate.\n\nHowever, I can tell you that OpenAI is a research organization that has been involved in various projects and tests related to artificial intelligence. If you could provide more context or clarify what you're referring to, I may be able to help further.\n\nIf you're looking for general information about OpenAI, I can try to provide some background on the organization:\n\nOpenAI is a non-profit research organization that was founded in 2015 with the goal of developing and applying advanced artificial intelligence to benefit humanity. The organization has made significant contributions to the field of AI, including the development of the popular language model, ChatGPT.\n\nIf you could provide more context or clarify what you're looking for, I'll do my best to assist you.",
31+
"refusal": null,
32+
"role": "assistant",
33+
"annotations": null,
34+
"audio": null,
35+
"function_call": null,
36+
"tool_calls": null
37+
}
38+
}
39+
],
40+
"created": 1755891518,
41+
"model": "llama3.2:3b",
42+
"object": "chat.completion",
43+
"service_tier": null,
44+
"system_fingerprint": "fp_ollama",
45+
"usage": {
46+
"completion_tokens": 194,
47+
"prompt_tokens": 30,
48+
"total_tokens": 224,
49+
"completion_tokens_details": null,
50+
"prompt_tokens_details": null
51+
}
52+
}
53+
},
54+
"is_streaming": false
55+
}
56+
}
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
{
2+
"request": {
3+
"method": "POST",
4+
"url": "http://localhost:11434/v1/v1/chat/completions",
5+
"headers": {},
6+
"body": {
7+
"model": "llama3.2:3b",
8+
"messages": [
9+
{
10+
"role": "user",
11+
"content": "OpenAI test 1"
12+
}
13+
],
14+
"stream": false
15+
},
16+
"endpoint": "/v1/chat/completions",
17+
"model": "llama3.2:3b"
18+
},
19+
"response": {
20+
"body": {
21+
"__type__": "openai.types.chat.chat_completion.ChatCompletion",
22+
"__data__": {
23+
"id": "chatcmpl-726",
24+
"choices": [
25+
{
26+
"finish_reason": "stop",
27+
"index": 0,
28+
"logprobs": null,
29+
"message": {
30+
"content": "I'm ready to help with the test. What language would you like to use? Would you like to have a conversation, ask questions, or take a specific type of task?",
31+
"refusal": null,
32+
"role": "assistant",
33+
"annotations": null,
34+
"audio": null,
35+
"function_call": null,
36+
"tool_calls": null
37+
}
38+
}
39+
],
40+
"created": 1755891519,
41+
"model": "llama3.2:3b",
42+
"object": "chat.completion",
43+
"service_tier": null,
44+
"system_fingerprint": "fp_ollama",
45+
"usage": {
46+
"completion_tokens": 37,
47+
"prompt_tokens": 30,
48+
"total_tokens": 67,
49+
"completion_tokens_details": null,
50+
"prompt_tokens_details": null
51+
}
52+
}
53+
},
54+
"is_streaming": false
55+
}
56+
}
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
{
2+
"request": {
3+
"method": "POST",
4+
"url": "http://localhost:11434/v1/v1/chat/completions",
5+
"headers": {},
6+
"body": {
7+
"model": "llama3.2:3b",
8+
"messages": [
9+
{
10+
"role": "user",
11+
"content": "OpenAI test 4"
12+
}
13+
],
14+
"stream": false
15+
},
16+
"endpoint": "/v1/chat/completions",
17+
"model": "llama3.2:3b"
18+
},
19+
"response": {
20+
"body": {
21+
"__type__": "openai.types.chat.chat_completion.ChatCompletion",
22+
"__data__": {
23+
"id": "chatcmpl-581",
24+
"choices": [
25+
{
26+
"finish_reason": "stop",
27+
"index": 0,
28+
"logprobs": null,
29+
"message": {
30+
"content": "I'm ready to help. What would you like to test? We could try a variety of things, such as:\n\n1. Conversational dialogue\n2. Language understanding\n3. Common sense reasoning\n4. Joke or pun generation\n5. Trivia or knowledge-based questions\n6. Creative writing or storytelling\n7. Summarization or paraphrasing\n\nLet me know which area you'd like to test, or suggest something else that's on your mind!",
31+
"refusal": null,
32+
"role": "assistant",
33+
"annotations": null,
34+
"audio": null,
35+
"function_call": null,
36+
"tool_calls": null
37+
}
38+
}
39+
],
40+
"created": 1755891527,
41+
"model": "llama3.2:3b",
42+
"object": "chat.completion",
43+
"service_tier": null,
44+
"system_fingerprint": "fp_ollama",
45+
"usage": {
46+
"completion_tokens": 96,
47+
"prompt_tokens": 30,
48+
"total_tokens": 126,
49+
"completion_tokens_details": null,
50+
"prompt_tokens_details": null
51+
}
52+
}
53+
},
54+
"is_streaming": false
55+
}
56+
}
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
{
2+
"request": {
3+
"method": "POST",
4+
"url": "http://localhost:11434/api/generate",
5+
"headers": {},
6+
"body": {
7+
"model": "llama3.2:3b-instruct-fp16",
8+
"raw": true,
9+
"prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTest metrics generation 0<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
10+
"options": {
11+
"temperature": 0.0
12+
},
13+
"stream": false
14+
},
15+
"endpoint": "/api/generate",
16+
"model": "llama3.2:3b-instruct-fp16"
17+
},
18+
"response": {
19+
"body": {
20+
"__type__": "ollama._types.GenerateResponse",
21+
"__data__": {
22+
"model": "llama3.2:3b-instruct-fp16",
23+
"created_at": "2025-08-11T15:51:12.918723Z",
24+
"done": true,
25+
"done_reason": "stop",
26+
"total_duration": 8868987792,
27+
"load_duration": 2793275292,
28+
"prompt_eval_count": 21,
29+
"prompt_eval_duration": 250000000,
30+
"eval_count": 344,
31+
"eval_duration": 5823000000,
32+
"response": "Here are some common test metrics used to evaluate the performance of a system:\n\n1. **Accuracy**: The proportion of correct predictions or classifications out of total predictions made.\n2. **Precision**: The ratio of true positives (correctly predicted instances) to the sum of true positives and false positives (incorrectly predicted instances).\n3. **Recall**: The ratio of true positives to the sum of true positives and false negatives (missed instances).\n4. **F1-score**: The harmonic mean of precision and recall, providing a balanced measure of both.\n5. **Mean Squared Error (MSE)**: The average squared difference between predicted and actual values.\n6. **Mean Absolute Error (MAE)**: The average absolute difference between predicted and actual values.\n7. **Root Mean Squared Percentage Error (RMSPE)**: The square root of the mean of the squared percentage differences between predicted and actual values.\n8. **Coefficient of Determination (R-squared, R2)**: Measures how well a model fits the data, with higher values indicating better fit.\n9. **Mean Absolute Percentage Error (MAPE)**: The average absolute percentage difference between predicted and actual values.\n10. **Normalized Mean Squared Error (NMSE)**: Similar to MSE, but normalized by the mean of the actual values.\n\nThese metrics can be used for various types of data, including:\n\n* Regression problems (e.g., predicting continuous values)\n* Classification problems (e.g., predicting categorical labels)\n* Time series forecasting\n* Clustering and dimensionality reduction\n\nWhen choosing a metric, consider the specific problem you're trying to solve, the type of data, and the desired level of precision.",
33+
"thinking": null,
34+
"context": null
35+
}
36+
},
37+
"is_streaming": false
38+
}
39+
}
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
{
2+
"request": {
3+
"method": "POST",
4+
"url": "http://localhost:11434/v1/v1/chat/completions",
5+
"headers": {},
6+
"body": {
7+
"model": "llama3.2:3b",
8+
"messages": [
9+
{
10+
"role": "user",
11+
"content": "OpenAI test 3"
12+
}
13+
],
14+
"stream": false
15+
},
16+
"endpoint": "/v1/chat/completions",
17+
"model": "llama3.2:3b"
18+
},
19+
"response": {
20+
"body": {
21+
"__type__": "openai.types.chat.chat_completion.ChatCompletion",
22+
"__data__": {
23+
"id": "chatcmpl-48",
24+
"choices": [
25+
{
26+
"finish_reason": "stop",
27+
"index": 0,
28+
"logprobs": null,
29+
"message": {
30+
"content": "I'm happy to help, but it seems you want me to engage in a basic conversation as OpenAI's new chat model, right? I can do that!\n\nHere's my response:\n\nHello! How are you today? Is there something specific on your mind that you'd like to talk about or any particular topic you'd like to explore together?\n\nWhat is it that you're curious about?",
31+
"refusal": null,
32+
"role": "assistant",
33+
"annotations": null,
34+
"audio": null,
35+
"function_call": null,
36+
"tool_calls": null
37+
}
38+
}
39+
],
40+
"created": 1755891524,
41+
"model": "llama3.2:3b",
42+
"object": "chat.completion",
43+
"service_tier": null,
44+
"system_fingerprint": "fp_ollama",
45+
"usage": {
46+
"completion_tokens": 80,
47+
"prompt_tokens": 30,
48+
"total_tokens": 110,
49+
"completion_tokens_details": null,
50+
"prompt_tokens_details": null
51+
}
52+
}
53+
},
54+
"is_streaming": false
55+
}
56+
}
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
{
2+
"request": {
3+
"method": "POST",
4+
"url": "http://localhost:11434/api/generate",
5+
"headers": {},
6+
"body": {
7+
"model": "llama3.2:3b",
8+
"raw": true,
9+
"prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTest metrics generation 0<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
10+
"options": {
11+
"temperature": 0.0
12+
},
13+
"stream": false
14+
},
15+
"endpoint": "/api/generate",
16+
"model": "llama3.2:3b"
17+
},
18+
"response": {
19+
"body": {
20+
"__type__": "ollama._types.GenerateResponse",
21+
"__data__": {
22+
"model": "llama3.2:3b",
23+
"created_at": "2025-08-11T15:56:06.703788Z",
24+
"done": true,
25+
"done_reason": "stop",
26+
"total_duration": 2722294000,
27+
"load_duration": 9736083,
28+
"prompt_eval_count": 21,
29+
"prompt_eval_duration": 113000000,
30+
"eval_count": 324,
31+
"eval_duration": 2598000000,
32+
"response": "Here are some test metrics that can be used to evaluate the performance of a system:\n\n1. **Accuracy**: The proportion of correct predictions made by the model.\n2. **Precision**: The ratio of true positives (correctly predicted instances) to total positive predictions.\n3. **Recall**: The ratio of true positives to the sum of true positives and false negatives (missed instances).\n4. **F1-score**: The harmonic mean of precision and recall, providing a balanced measure of both.\n5. **Mean Squared Error (MSE)**: The average squared difference between predicted and actual values.\n6. **Mean Absolute Error (MAE)**: The average absolute difference between predicted and actual values.\n7. **Root Mean Squared Percentage Error (RMSPE)**: A variation of MSE that expresses the error as a percentage.\n8. **Coefficient of Determination (R-squared, R2)**: Measures how well the model explains the variance in the data.\n9. **Mean Absolute Percentage Error (MAPE)**: The average absolute percentage difference between predicted and actual values.\n10. **Mean Squared Logarithmic Error (MSLE)**: A variation of MSE that is more suitable for skewed distributions.\n\nThese metrics can be used to evaluate different aspects of a system's performance, such as:\n\n* Classification models: accuracy, precision, recall, F1-score\n* Regression models: MSE, MAE, RMSPE, R2, MSLE\n* Time series forecasting: MAPE, RMSPE\n\nNote that the choice of metric depends on the specific problem and data.",
33+
"thinking": null,
34+
"context": null
35+
}
36+
},
37+
"is_streaming": false
38+
}
39+
}

0 commit comments

Comments
 (0)