Skip to content

feat(benchmarking): add API latency comparison experiment pipeline#6088

Open
mfleader wants to merge 1 commit into
ogx-ai:mainfrom
mfleader:api-latency-experiment
Open

feat(benchmarking): add API latency comparison experiment pipeline#6088
mfleader wants to merge 1 commit into
ogx-ai:mainfrom
mfleader:api-latency-experiment

Conversation

@mfleader

@mfleader mfleader commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds a data collection pipeline under benchmarking/api_latency_comparison/ for comparing per-request API latency between two OGX versions.

The orchestrator (experiment/benchmark.py) sets up git worktrees for each version, generates a randomized complete block design experiment matrix, starts servers with CPU pinning via mirakuru, runs Locust against each version, and records per-request response times. A third "comparison control" group runs the same code as comparison to catch false positives from environmental noise.

Each run sends agentic requests to /v1/responses with web_search tool use, so latency measurements capture the full agentic loop (request -> tool call -> mock search -> final response).

  • experiment/benchmark.py — orchestrator: worktree setup, server lifecycle, run loop
  • experiment/mock_server.py — canned OpenAI chat completions + Brave Search backend with tool call cycling for agentic workloads
  • experiment/setup-worktree.sh — creates isolated git worktrees per version, patches older releases for mock compatibility
  • experiment/preflight.py — port, disk, CPU pinning checks and computing environment recording
  • experiment/generate_design_matrix.py — RCBD matrix generator with version hash resolution
  • experiment/locustfile_responses.py — Locust user targeting /v1/responses
  • experiment/runlog.py — CSV run log I/O
  • configs/stack-config-benchmark.yaml — OGX server config pointed at the mock backend

First of two PRs. Follow-up adds Bayesian model fitting and a CI workflow.

Test Plan

uv sync --group api-latency-comparison
uv run python -m pytest benchmarking/api_latency_comparison/experiment/test_benchmark.py -v -s

test_benchmark.py has 5 tests: 1 RCBD matrix invariant (unit) and 4 end-to-end pipeline tests that run 2 replicates with 3s runs, then verify experiment matrix schema, run log completeness, per-run request data, and environment recording.

@skamenan7 skamenan7 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Nice PR @mfleader . Few comments.

str(cfg["stack_port"]),
"--factory",
]
log_file = open(log_path, "a")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_file = open(log_path, 'a') is never closed. Each experiment run leaks a file descriptor. Could you wrap this in a context manager or add explicit cleanup in the finally path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapped in a context manager. _ogx_server_executor is now a @contextmanager that closes the log file on exit.

affinity = self._cpu_affinity

def pinned_preexec():
os.setsid()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting race window here. The parent checks CPU affinity right after fork(), but the child's preexec_fn might not have run yet. A tiny sleep or moving verification into the child would make this airtight. Also noticed the _run_locust preexec_fn skips os.setsid().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race should be safe since mirakuru waits for the health check before returning. Locust skips setsid() intentionally so Ctrl+C propagates from the parent.

print(f" Latest release tag: {tag}", flush=True)
return tag
print("Error: no release tags found (vX.Y.Z)", file=sys.stderr)
sys.exit(1)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are imported by benchmark.py, so sys.exit() would kill the whole orchestrator if something goes wrong. Swapping to raise ValueError/RuntimeError would let callers decide what to do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted to ValueError, caught at the CLI boundary in main(). Also moved cleanup_worktrees into a finally block so worktrees get cleaned up on error.

_record_run(results_dir, row, run_id, version, run_start, run_end, n_requests, status)
log(f" Result: {n_requests} reqs, status={status}")

except Exception as e:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This except Exception catches everything including OOM and OS errors. Could you narrow it to expected failure types (subprocess.CalledProcessError, TimeoutError, ConnectionError) and log the exception type + traceback? Right now only the message is recorded, which makes debugging failed runs harder.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added traceback.format_exc() to the log output.

chunk_id = "chatcmpl-mock-stream"
created = int(time.time())
model = "mock-model"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_stream_text and _stream_tool_call share about 40 lines of identical chunk construction. Extracting a _build_stream_chunk() helper would cut the duplication and make it easier to keep the mock's response format consistent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted _send_chunk and _finish_stream helpers.

assert all(r["status"] == "ok" for r in rows)
assert all(int(r["requests_completed"]) > 0 for r in rows)

def test_each_run_has_request_data(self, benchmark_results):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests cover the happy path well. but some negative tests (bad refs, port conflicts) would strengthen coverage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_bad_ref_raises_valueerror and test_port_conflict_raises_preflight_error.

Comment thread uv.lock Outdated
[[package]]
name = "cachetools"
version = "7.1.1"
version = "6.2.6"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be intentional, but cachetools dropped from 7.1.1 to 6.2.6 in uv.lock. If it's not related to the benchmarking deps, it might be worth regenerating the lockfile from main?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale lockfile artifact from a previous branch. Ran uv lock --upgrade-package cachetools, back to 7.1.4.

@mfleader mfleader force-pushed the api-latency-experiment branch 5 times, most recently from 72b468e to afa482c Compare June 12, 2026 21:25
Adds a data collection pipeline under
benchmarking/api_latency_comparison/ for comparing per-request API
latency between two OGX versions.

The orchestrator sets up git worktrees for each version, generates a
randomized complete block design experiment matrix, starts servers
with CPU pinning via mirakuru, runs Locust against each version, and
records per-request response times. A third "comparison control"
group runs the same code as comparison to catch false positives from
environmental noise.

First of two commits. Follow-up adds model fitting and a CI workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Matthew F Leader <mleader@redhat.com>
@mfleader mfleader force-pushed the api-latency-experiment branch from afa482c to d94d1f6 Compare June 12, 2026 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants