feat(benchmarking): add API latency comparison experiment pipeline#6088
feat(benchmarking): add API latency comparison experiment pipeline#6088mfleader wants to merge 1 commit into
Conversation
12a86e9 to
b5829f2
Compare
0b45571 to
f90c53b
Compare
| str(cfg["stack_port"]), | ||
| "--factory", | ||
| ] | ||
| log_file = open(log_path, "a") |
There was a problem hiding this comment.
log_file = open(log_path, 'a') is never closed. Each experiment run leaks a file descriptor. Could you wrap this in a context manager or add explicit cleanup in the finally path?
There was a problem hiding this comment.
Wrapped in a context manager. _ogx_server_executor is now a @contextmanager that closes the log file on exit.
| affinity = self._cpu_affinity | ||
|
|
||
| def pinned_preexec(): | ||
| os.setsid() |
There was a problem hiding this comment.
Interesting race window here. The parent checks CPU affinity right after fork(), but the child's preexec_fn might not have run yet. A tiny sleep or moving verification into the child would make this airtight. Also noticed the _run_locust preexec_fn skips os.setsid().
There was a problem hiding this comment.
The race should be safe since mirakuru waits for the health check before returning. Locust skips setsid() intentionally so Ctrl+C propagates from the parent.
| print(f" Latest release tag: {tag}", flush=True) | ||
| return tag | ||
| print("Error: no release tags found (vX.Y.Z)", file=sys.stderr) | ||
| sys.exit(1) |
There was a problem hiding this comment.
These functions are imported by benchmark.py, so sys.exit() would kill the whole orchestrator if something goes wrong. Swapping to raise ValueError/RuntimeError would let callers decide what to do.
There was a problem hiding this comment.
Converted to ValueError, caught at the CLI boundary in main(). Also moved cleanup_worktrees into a finally block so worktrees get cleaned up on error.
| _record_run(results_dir, row, run_id, version, run_start, run_end, n_requests, status) | ||
| log(f" Result: {n_requests} reqs, status={status}") | ||
|
|
||
| except Exception as e: |
There was a problem hiding this comment.
This except Exception catches everything including OOM and OS errors. Could you narrow it to expected failure types (subprocess.CalledProcessError, TimeoutError, ConnectionError) and log the exception type + traceback? Right now only the message is recorded, which makes debugging failed runs harder.
There was a problem hiding this comment.
Added traceback.format_exc() to the log output.
| chunk_id = "chatcmpl-mock-stream" | ||
| created = int(time.time()) | ||
| model = "mock-model" | ||
|
|
There was a problem hiding this comment.
_stream_text and _stream_tool_call share about 40 lines of identical chunk construction. Extracting a _build_stream_chunk() helper would cut the duplication and make it easier to keep the mock's response format consistent.
There was a problem hiding this comment.
Extracted _send_chunk and _finish_stream helpers.
| assert all(r["status"] == "ok" for r in rows) | ||
| assert all(int(r["requests_completed"]) > 0 for r in rows) | ||
|
|
||
| def test_each_run_has_request_data(self, benchmark_results): |
There was a problem hiding this comment.
The tests cover the happy path well. but some negative tests (bad refs, port conflicts) would strengthen coverage.
There was a problem hiding this comment.
Added test_bad_ref_raises_valueerror and test_port_conflict_raises_preflight_error.
| [[package]] | ||
| name = "cachetools" | ||
| version = "7.1.1" | ||
| version = "6.2.6" |
There was a problem hiding this comment.
This might be intentional, but cachetools dropped from 7.1.1 to 6.2.6 in uv.lock. If it's not related to the benchmarking deps, it might be worth regenerating the lockfile from main?
There was a problem hiding this comment.
Stale lockfile artifact from a previous branch. Ran uv lock --upgrade-package cachetools, back to 7.1.4.
72b468e to
afa482c
Compare
Adds a data collection pipeline under benchmarking/api_latency_comparison/ for comparing per-request API latency between two OGX versions. The orchestrator sets up git worktrees for each version, generates a randomized complete block design experiment matrix, starts servers with CPU pinning via mirakuru, runs Locust against each version, and records per-request response times. A third "comparison control" group runs the same code as comparison to catch false positives from environmental noise. First of two commits. Follow-up adds model fitting and a CI workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew F Leader <mleader@redhat.com>
afa482c to
d94d1f6
Compare
What does this PR do?
Adds a data collection pipeline under
benchmarking/api_latency_comparison/for comparing per-request API latency between two OGX versions.The orchestrator (
experiment/benchmark.py) sets up git worktrees for each version, generates a randomized complete block design experiment matrix, starts servers with CPU pinning via mirakuru, runs Locust against each version, and records per-request response times. A third "comparison control" group runs the same code as comparison to catch false positives from environmental noise.Each run sends agentic requests to
/v1/responseswithweb_searchtool use, so latency measurements capture the full agentic loop (request -> tool call -> mock search -> final response).experiment/benchmark.py— orchestrator: worktree setup, server lifecycle, run loopexperiment/mock_server.py— canned OpenAI chat completions + Brave Search backend with tool call cycling for agentic workloadsexperiment/setup-worktree.sh— creates isolated git worktrees per version, patches older releases for mock compatibilityexperiment/preflight.py— port, disk, CPU pinning checks and computing environment recordingexperiment/generate_design_matrix.py— RCBD matrix generator with version hash resolutionexperiment/locustfile_responses.py— Locust user targeting/v1/responsesexperiment/runlog.py— CSV run log I/Oconfigs/stack-config-benchmark.yaml— OGX server config pointed at the mock backendFirst of two PRs. Follow-up adds Bayesian model fitting and a CI workflow.
Test Plan
test_benchmark.pyhas 5 tests: 1 RCBD matrix invariant (unit) and 4 end-to-end pipeline tests that run 2 replicates with 3s runs, then verify experiment matrix schema, run log completeness, per-run request data, and environment recording.