feat: agentic-replay (mooncake_trace) on official AIPerf — Qwen3.5-4B smoke by thangquang09 · Pull Request #7 · vngcloud/InferenceX

thangquang09 · 2026-06-03T06:21:03Z

Summary

Adds an agentic-replay scenario-type that replays a recorded mooncake_trace JSONL through official AIPerf (the aiperf client) and rides the standard bmk_* aggregation — fully off the retired cquil11/aiperf fork pipeline (per ADR-0001).

First config: qwen3.5-4b-bf16-h100-vllm (vLLM, bf16, TP=1, conc=2) against a committed 12-record smoke trace.

Wiring

New single_node['agentic-replay'] bucket (process_changelog.py) → dedicated sweep-single-node-agentic-replay job (run-sweep.yml) → benchmark-tmpl.yml with new input-file/custom-dataset-type inputs.
Artifact gates key on scenario-type != 'agentic-coding', so results flow through process_result.py → bmk_* automatically.
New launcher qwen3.5-4b_bf16_h100_vllm.sh replays the trace once (--request-count = record count, no --isl/--osl).

Known limitation

mooncake_trace is not multi-turn in AIPerf — the 12 records replay as flat requests; prefix-cache reuse comes from hash_ids block overlap, not FORK-mode threading.

Verification

Matrix gen verified for full-sweep and CI (test-config) paths → 1 entry, conc 2.
181 matrix_logic tests pass (+8 new).

Sweep trigger (perf-changelog.yaml entry) added in a follow-up commit; label sweep-enabled to run.

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… encoder floor Gemma4's MM encoder requires max_tokens_per_mm_item=2496; vLLM forces --disable_chunked_mm_input for bidirectional attention, making max_num_batched_tokens equal max_model_len (2304 for 1k1k). This crashes at startup. Fix: config sets max-num-batched-tokens: 4096 in search-space; script reads MAX_NUM_BATCHED_TOKENS and passes it to --max-num-batched-tokens. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…4B smoke Add an `agentic-replay` scenario-type that replays a recorded mooncake_trace JSONL through official AIPerf (the `aiperf` client) and rides the standard bmk_* aggregation, fully off the retired cquil11/aiperf fork pipeline. - validation.py: AgenticReplaySearchSpaceEntry / AgenticReplayConfig / SingleNodeAgenticReplayMatrixEntry + validator; scenarios container and ChangelogMatrixEntry union updated. - generate_sweep_configs.py: agentic-replay branch in full-sweep and test-config; --scenario-type choice; eval-skip guard. - process_changelog.py: route agentic-replay into single_node['agentic-replay']. - run-sweep.yml: dedicated sweep-single-node-agentic-replay job, wired into collect-results. benchmark-tmpl.yml: input-file / custom-dataset-type inputs. - launch_h100-greennode.sh forwards the new env; new launcher qwen3.5-4b_bf16_h100_vllm.sh replays the trace once (request-count = record count, no isl/osl). benchmark_lib.sh: run_client_benchmark accepts --input-file/--custom-dataset-type/--request-count. - nvidia-master.yaml: qwen3.5-4b-bf16-h100-vllm config (vLLM bf16 TP=1 conc=2). - Smoke dataset committed; AIPERF_INTEGRATION.md documents the path and the mooncake_trace multi-turn limitation; +8 validation tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-03T06:21:14Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-03T06:29:00Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26867517453
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26867517453

…-replay in e2e dispatch run_client_benchmark reads BENCHMARK_CLIENT from the env, but launch_h100-greennode.sh never forwarded it into the docker container, so it defaulted to inferencex_native — silently for fixed-seq aiperf runs, and fatally for agentic-replay (which only supports aiperf). Add BENCHMARK_CLIENT to the launcher passthrough. Also wire agentic-replay into e2e-tests.yml (workflow_dispatch): a dedicated agentic-replay bucket + test-sweep-agentic-replay job that forwards input-file/custom-dataset-type/scenario-type, so a single config can be run in isolation via dispatch (run-sweep.yml is not dispatchable). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Compute tok/W = total_token_throughput / mean total GPU power, derived from the power.draw column that start_gpu_monitor already logs to gpu_metrics.csv every run. No DCGM/pynvml required; works for both the native and aiperf clients and all engines. - process_result.py: mean_total_power_w() parses gpu_metrics.csv, sums only the N busiest GPUs (N=TP single-node / total_gpus multi-node) so idle cards on a shared host don't inflate power, and windows samples to the last `duration` seconds to exclude model-load/warmup. Emits tok_per_watt + mean_power_w (null when the CSV is absent). - summarize.py: new "Token/Watt (tok/s/W)" and "Power Mean (W)" columns on the single-node and multi-node tables. - aiperf_adapter.py: emit `duration` from AIPerf's benchmark_duration so the aiperf-client path windows power correctly. - tests: busiest-GPU selection, warmup windowing, missing-CSV null, duration mapping present/absent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add an "Energy Efficiency (tokens/Watt)" section covering the metric definition, the nvidia-smi power source (no DCGM required), the busiest-GPU + duration-window derivation, and the new result/table fields. Cross-reference gpu_metrics.csv in Artifacts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… run Drop the hf:// fetch approach (the container's hf CLI produced a 0-record subset → request-count<concurrency failure). Commit the traces directly: benchmarks/single_node/agentic/datasets/agentic-coding-{64k,128k}.jsonl (18,595 / 16,957 records). Committed as normal blobs — repo does not LFS-track .jsonl. The agentic-replay launcher now supports an optional "#N" suffix on the input-file path to replay only the first N records (low-resource subset of a large committed trace). Add a qwen3.5-4b-bf16-h100-vllm agentic-replay scenario: 64k trace, first 2000 records, concurrency 32, max-model-len 40960 (covers the dataset's real max input+output of 38,613). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…l tok/W Model swap (prefix-caching fix): - Qwen/Qwen3.5-4B is a hybrid-Mamba model; vLLM V1 auto-disables prefix caching for it (observed 0% hit rate on the agentic-replay trace). Replace it with Qwen/Qwen3-4B-Instruct-2507, a dense Qwen3ForCausalLM (256K native context, no rope-scaling) for which vLLM keeps prefix caching ON by default. - Rename benchmarks/single_node/{qwen3.5-4b => qwen3-4b-2507}_bf16_h100_vllm.sh (launcher resolves the script from model-prefix) and fix its served-model default. Update the nvidia-master.yaml config block and docs. aiperf adapter / summary metrics: - aiperf_adapter.py now emits p50/p75/p90/p95/p99 (was mean+p99 only) for ttft/tpot/itl/e2el via a helper; process_result.py's *_ms loop carries them through. summarize.py renders mean+p50+p75+p90+p95+p99 for each latency. - tok/W is now reported in two conventions: tok_per_watt_total (input+output, the prior value, kept as the tok_per_watt alias) and tok_per_watt_output (decoded tokens only). Both surfaced as summary columns and documented. Tests: adapter fixture/assertions extended; 121 passed / 1 skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ions) AIPerf replays mooncake_trace records as multi-turn sessions keyed on session_id, so context accumulates across turns. The realized per-request total reaches ~65,847 tokens, far above the per-record input_length max of 37,818 the value was originally sized from. At max-model-len 40960, 1100/2000 (55%) requests were rejected with HTTP 400 (input+output > window), silently biasing the metrics to short early-turn requests. 73728 covers the realized max with headroom; Qwen3-4B-2507 supports 256K so no model-side limit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mooncake_trace replays each session_id as a multi-turn conversation, so context accumulates across turns; size max-model-len from the session- cumulative max (64k: 66,655→73728; 128k: 133,851→~147456), not the per-record length. Documents the silent-truncation trap (40960 rejected 55% of the 64k run while the CI job stayed green) and corrects the earlier 'not multi-turn' claim, which the run evidence disproves. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…TP=2, fp8) Add an engine-vs-engine agentic-replay comparison on google/gemma-4-31B-it over the committed 64k agentic-coding trace (SemiAnalysisAI#1000), 2x H100 / TP=2 / fp8, both cache-ON. - New configs gemma4-agentic-fp8-h100-2x-{vllm,sglang} under a distinct model-prefix `gemma4-agentic` so the launcher resolves dedicated trace-replay scripts instead of the fixed-seq-len gemma4_*.sh. - New scripts gemma4-agentic_fp8_h100_{vllm,sglang}.sh: vLLM on-the-fly fp8 over google/gemma-4-31B-it; SGLang pre-quantized RedHatAI/gemma-4-31B-it-FP8-dynamic (its on-the-fly fp8 crashes on the vision tower). Both fp8_e4m3 KV, prefix/radix cache default-ON, replayed once via AIPerf. - context-length/max-model-len 73728 covers the session-cumulative max of the SemiAnalysisAI#1000 subset (AIPerf threads mooncake sessions; context accumulates across turns). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…he 64k pair) The smoke pre-flight scenario doubled the matrix to 4 jobs; keep only the 64k SemiAnalysisAI#1000 agentic-replay leg so the run is exactly the 2-engine comparison (vLLM, SGLang). No script/serve changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d trap; SGLang fa3 - AIPERF_INTEGRATION.md: add "Verifying a run is valid" (200/400 gate, cache-hit check, measured prefix/radix hit table correcting the ~95% claim), and "Engine-vs-engine comparison" (AIPerf engine-agnostic client, distinct model-prefix, dual-key dispatch + the feature/aiperf workflow-ref gotcha, and the backend-fairness trap: SGLang forces triton for Gemma multimodal -> ~3-4x slower). - gemma4-agentic_fp8_h100_sglang.sh: pin --attention-backend fa3 (text-only trace has no image tokens, so the triton-only bidirectional path is unneeded) to get a fair comparison vs vLLM/FlashInfer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Thắng. Lý Quang (5) and others added 4 commits June 3, 2026 11:04

test: narrow gemma4 bf16 h100 sweep to 1k1k conc=1 aiperf only

bcfe5e3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci: ignore compare-results failure on PR sweeps

e082dce

ci: trigger agentic-replay smoke sweep for qwen3.5-4b-bf16-h100-vllm

899b2bd

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

thangquang09 added the sweep-enabled Run trimmed benchmark sweep for this PR label Jun 3, 2026

thangquang09 removed the sweep-enabled Run trimmed benchmark sweep for this PR label Jun 3, 2026

Thắng. Lý Quang (5) and others added 3 commits June 3, 2026 13:33

thangquang09 force-pushed the feature/aiperf branch from 2a5bb21 to 8d23f35 Compare June 3, 2026 08:05

Thắng. Lý Quang (5) and others added 9 commits June 3, 2026 15:06

fix(aiperf): fail closed on partial replay results

d51531a

fix(aiperf): keep replay CLI compatible with PyPI

fa6e877

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: agentic-replay (mooncake_trace) on official AIPerf — Qwen3.5-4B smoke#7

feat: agentic-replay (mooncake_trace) on official AIPerf — Qwen3.5-4B smoke#7
thangquang09 wants to merge 17 commits into
devfrom
feature/aiperf

thangquang09 commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thangquang09 commented Jun 3, 2026

Summary

Wiring

Known limitation

Verification

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant