feat(benchmarking): add API latency comparison experiment pipeline by mfleader · Pull Request #6088 · ogx-ai/ogx

mfleader · 2026-06-12T03:47:30Z

What does this PR do?

Adds a data collection pipeline under benchmarking/api_latency_comparison/ for comparing per-request API latency between two OGX versions.

The orchestrator (experiment/benchmark.py) sets up git worktrees for each version, generates a randomized complete block design experiment matrix, starts servers with CPU pinning via mirakuru, runs Locust against each version, and records per-request response times. A third "comparison control" group runs the same code as comparison to catch false positives from environmental noise.

Each run sends agentic requests to /v1/responses with web_search tool use, so latency measurements capture the full agentic loop (request -> tool call -> mock search -> final response).

experiment/benchmark.py — orchestrator: worktree setup, server lifecycle, run loop
experiment/mock_server.py — canned OpenAI chat completions + Brave Search backend with tool call cycling for agentic workloads
experiment/setup-worktree.sh — creates isolated git worktrees per version, patches older releases for mock compatibility
experiment/preflight.py — port, disk, CPU pinning checks and computing environment recording
experiment/generate_design_matrix.py — RCBD matrix generator with version hash resolution
experiment/locustfile_responses.py — Locust user targeting /v1/responses
experiment/runlog.py — CSV run log I/O
configs/stack-config-benchmark.yaml — OGX server config pointed at the mock backend

First of two PRs. Follow-up adds Bayesian model fitting and a CI workflow.

Test Plan

uv sync --group api-latency-comparison
uv run python -m pytest benchmarking/api_latency_comparison/experiment/test_benchmark.py -v -s

test_benchmark.py has 5 tests: 1 RCBD matrix invariant (unit) and 4 end-to-end pipeline tests that run 2 replicates with 3s runs, then verify experiment matrix schema, run log completeness, per-run request data, and environment recording.

skamenan7

Looks good. Nice PR @mfleader . Few comments.

skamenan7 · 2026-06-12T19:35:46Z

+        str(cfg["stack_port"]),
+        "--factory",
+    ]
+    log_file = open(log_path, "a")


log_file = open(log_path, 'a') is never closed. Each experiment run leaks a file descriptor. Could you wrap this in a context manager or add explicit cleanup in the finally path?

Wrapped in a context manager. _ogx_server_executor is now a @contextmanager that closes the log file on exit.

skamenan7 · 2026-06-12T19:41:05Z

+            affinity = self._cpu_affinity
+
+            def pinned_preexec():
+                os.setsid()


Interesting race window here. The parent checks CPU affinity right after fork(), but the child's preexec_fn might not have run yet. A tiny sleep or moving verification into the child would make this airtight. Also noticed the _run_locust preexec_fn skips os.setsid().

The race should be safe since mirakuru waits for the health check before returning. Locust skips setsid() intentionally so Ctrl+C propagates from the parent.

skamenan7 · 2026-06-12T19:42:32Z

+            print(f"  Latest release tag: {tag}", flush=True)
+            return tag
+    print("Error: no release tags found (vX.Y.Z)", file=sys.stderr)
+    sys.exit(1)


These functions are imported by benchmark.py, so sys.exit() would kill the whole orchestrator if something goes wrong. Swapping to raise ValueError/RuntimeError would let callers decide what to do.

Converted to ValueError, caught at the CLI boundary in main(). Also moved cleanup_worktrees into a finally block so worktrees get cleaned up on error.

skamenan7 · 2026-06-12T19:43:41Z

+            _record_run(results_dir, row, run_id, version, run_start, run_end, n_requests, status)
+            log(f"  Result: {n_requests} reqs, status={status}")
+
+        except Exception as e:


This except Exception catches everything including OOM and OS errors. Could you narrow it to expected failure types (subprocess.CalledProcessError, TimeoutError, ConnectionError) and log the exception type + traceback? Right now only the message is recorded, which makes debugging failed runs harder.

Added traceback.format_exc() to the log output.

skamenan7 · 2026-06-12T19:44:52Z

+        chunk_id = "chatcmpl-mock-stream"
+        created = int(time.time())
+        model = "mock-model"
+


_stream_text and _stream_tool_call share about 40 lines of identical chunk construction. Extracting a _build_stream_chunk() helper would cut the duplication and make it easier to keep the mock's response format consistent.

Extracted _send_chunk and _finish_stream helpers.

skamenan7 · 2026-06-12T19:46:54Z

+        assert all(r["status"] == "ok" for r in rows)
+        assert all(int(r["requests_completed"]) > 0 for r in rows)
+
+    def test_each_run_has_request_data(self, benchmark_results):


The tests cover the happy path well. but some negative tests (bad refs, port conflicts) would strengthen coverage.

Added test_bad_ref_raises_valueerror and test_port_conflict_raises_preflight_error.

skamenan7 · 2026-06-12T19:48:15Z

 [[package]]
 name = "cachetools"
-version = "7.1.1"
+version = "6.2.6"


This might be intentional, but cachetools dropped from 7.1.1 to 6.2.6 in uv.lock. If it's not related to the benchmarking deps, it might be worth regenerating the lockfile from main?

Stale lockfile artifact from a previous branch. Ran uv lock --upgrade-package cachetools, back to 7.1.4.

Adds a data collection pipeline under benchmarking/api_latency_comparison/ for comparing per-request API latency between two OGX versions. The orchestrator sets up git worktrees for each version, generates a randomized complete block design experiment matrix, starts servers with CPU pinning via mirakuru, runs Locust against each version, and records per-request response times. A third "comparison control" group runs the same code as comparison to catch false positives from environmental noise. First of two commits. Follow-up adds model fitting and a CI workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew F Leader <mleader@redhat.com>

mfleader requested review from bbrowning, cdoern, franciscojavierarceo, leseb, mattf, raghotham and skamenan7 as code owners June 12, 2026 03:47

mfleader force-pushed the api-latency-experiment branch from 12a86e9 to b5829f2 Compare June 12, 2026 03:50

mfleader mentioned this pull request Jun 12, 2026

feat(benchmarking): add Bayesian model fitting and CI workflow #6089

Draft

mfleader force-pushed the api-latency-experiment branch 2 times, most recently from 0b45571 to f90c53b Compare June 12, 2026 15:13

skamenan7 requested changes Jun 12, 2026

View reviewed changes

mfleader force-pushed the api-latency-experiment branch 5 times, most recently from 72b468e to afa482c Compare June 12, 2026 21:25

mfleader force-pushed the api-latency-experiment branch from afa482c to d94d1f6 Compare June 12, 2026 21:30

Conversation

mfleader commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Uh oh!

skamenan7 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mfleader commented Jun 12, 2026 •

edited

Loading

skamenan7 left a comment •

edited

Loading