Skip to content

feat: agentic-replay (mooncake_trace) on official AIPerf — Qwen3.5-4B smoke#7

Open
thangquang09 wants to merge 17 commits into
devfrom
feature/aiperf
Open

feat: agentic-replay (mooncake_trace) on official AIPerf — Qwen3.5-4B smoke#7
thangquang09 wants to merge 17 commits into
devfrom
feature/aiperf

Conversation

@thangquang09
Copy link
Copy Markdown
Collaborator

Summary

Adds an agentic-replay scenario-type that replays a recorded mooncake_trace JSONL through official AIPerf (the aiperf client) and rides the standard bmk_* aggregation — fully off the retired cquil11/aiperf fork pipeline (per ADR-0001).

First config: qwen3.5-4b-bf16-h100-vllm (vLLM, bf16, TP=1, conc=2) against a committed 12-record smoke trace.

Wiring

  • New single_node['agentic-replay'] bucket (process_changelog.py) → dedicated sweep-single-node-agentic-replay job (run-sweep.yml) → benchmark-tmpl.yml with new input-file/custom-dataset-type inputs.
  • Artifact gates key on scenario-type != 'agentic-coding', so results flow through process_result.pybmk_* automatically.
  • New launcher qwen3.5-4b_bf16_h100_vllm.sh replays the trace once (--request-count = record count, no --isl/--osl).

Known limitation

mooncake_trace is not multi-turn in AIPerf — the 12 records replay as flat requests; prefix-cache reuse comes from hash_ids block overlap, not FORK-mode threading.

Verification

  • Matrix gen verified for full-sweep and CI (test-config) paths → 1 entry, conc 2.
  • 181 matrix_logic tests pass (+8 new).

Sweep trigger (perf-changelog.yaml entry) added in a follow-up commit; label sweep-enabled to run.

🤖 Generated with Claude Code

Thắng. Lý Quang (5) and others added 4 commits June 3, 2026 11:04
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… encoder floor

Gemma4's MM encoder requires max_tokens_per_mm_item=2496; vLLM forces
--disable_chunked_mm_input for bidirectional attention, making
max_num_batched_tokens equal max_model_len (2304 for 1k1k). This crashes
at startup. Fix: config sets max-num-batched-tokens: 4096 in search-space;
script reads MAX_NUM_BATCHED_TOKENS and passes it to --max-num-batched-tokens.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…4B smoke

Add an `agentic-replay` scenario-type that replays a recorded mooncake_trace
JSONL through official AIPerf (the `aiperf` client) and rides the standard
bmk_* aggregation, fully off the retired cquil11/aiperf fork pipeline.

- validation.py: AgenticReplaySearchSpaceEntry / AgenticReplayConfig /
  SingleNodeAgenticReplayMatrixEntry + validator; scenarios container and
  ChangelogMatrixEntry union updated.
- generate_sweep_configs.py: agentic-replay branch in full-sweep and
  test-config; --scenario-type choice; eval-skip guard.
- process_changelog.py: route agentic-replay into single_node['agentic-replay'].
- run-sweep.yml: dedicated sweep-single-node-agentic-replay job, wired into
  collect-results. benchmark-tmpl.yml: input-file / custom-dataset-type inputs.
- launch_h100-greennode.sh forwards the new env; new launcher
  qwen3.5-4b_bf16_h100_vllm.sh replays the trace once (request-count = record
  count, no isl/osl). benchmark_lib.sh: run_client_benchmark accepts
  --input-file/--custom-dataset-type/--request-count.
- nvidia-master.yaml: qwen3.5-4b-bf16-h100-vllm config (vLLM bf16 TP=1 conc=2).
- Smoke dataset committed; AIPERF_INTEGRATION.md documents the path and the
  mooncake_trace multi-turn limitation; +8 validation tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@thangquang09 thangquang09 added the sweep-enabled Run trimmed benchmark sweep for this PR label Jun 3, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

@thangquang09 thangquang09 removed the sweep-enabled Run trimmed benchmark sweep for this PR label Jun 3, 2026
Thắng. Lý Quang (5) and others added 3 commits June 3, 2026 13:33
…-replay in e2e dispatch

run_client_benchmark reads BENCHMARK_CLIENT from the env, but
launch_h100-greennode.sh never forwarded it into the docker container, so it
defaulted to inferencex_native — silently for fixed-seq aiperf runs, and fatally
for agentic-replay (which only supports aiperf). Add BENCHMARK_CLIENT to the
launcher passthrough.

Also wire agentic-replay into e2e-tests.yml (workflow_dispatch): a dedicated
agentic-replay bucket + test-sweep-agentic-replay job that forwards
input-file/custom-dataset-type/scenario-type, so a single config can be run in
isolation via dispatch (run-sweep.yml is not dispatchable).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compute tok/W = total_token_throughput / mean total GPU power, derived
from the power.draw column that start_gpu_monitor already logs to
gpu_metrics.csv every run. No DCGM/pynvml required; works for both the
native and aiperf clients and all engines.

- process_result.py: mean_total_power_w() parses gpu_metrics.csv, sums
  only the N busiest GPUs (N=TP single-node / total_gpus multi-node) so
  idle cards on a shared host don't inflate power, and windows samples to
  the last `duration` seconds to exclude model-load/warmup. Emits
  tok_per_watt + mean_power_w (null when the CSV is absent).
- summarize.py: new "Token/Watt (tok/s/W)" and "Power Mean (W)" columns
  on the single-node and multi-node tables.
- aiperf_adapter.py: emit `duration` from AIPerf's benchmark_duration so
  the aiperf-client path windows power correctly.
- tests: busiest-GPU selection, warmup windowing, missing-CSV null,
  duration mapping present/absent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an "Energy Efficiency (tokens/Watt)" section covering the metric
definition, the nvidia-smi power source (no DCGM required), the
busiest-GPU + duration-window derivation, and the new result/table
fields. Cross-reference gpu_metrics.csv in Artifacts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Thắng. Lý Quang (5) and others added 9 commits June 3, 2026 15:06
… run

Drop the hf:// fetch approach (the container's hf CLI produced a 0-record
subset → request-count<concurrency failure). Commit the traces directly:
  benchmarks/single_node/agentic/datasets/agentic-coding-{64k,128k}.jsonl
(18,595 / 16,957 records). Committed as normal blobs — repo does not LFS-track
.jsonl.

The agentic-replay launcher now supports an optional "#N" suffix on the
input-file path to replay only the first N records (low-resource subset of a
large committed trace). Add a qwen3.5-4b-bf16-h100-vllm agentic-replay
scenario: 64k trace, first 2000 records, concurrency 32, max-model-len 40960
(covers the dataset's real max input+output of 38,613).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l tok/W

Model swap (prefix-caching fix):
- Qwen/Qwen3.5-4B is a hybrid-Mamba model; vLLM V1 auto-disables prefix
  caching for it (observed 0% hit rate on the agentic-replay trace). Replace
  it with Qwen/Qwen3-4B-Instruct-2507, a dense Qwen3ForCausalLM (256K native
  context, no rope-scaling) for which vLLM keeps prefix caching ON by default.
- Rename benchmarks/single_node/{qwen3.5-4b => qwen3-4b-2507}_bf16_h100_vllm.sh
  (launcher resolves the script from model-prefix) and fix its served-model
  default. Update the nvidia-master.yaml config block and docs.

aiperf adapter / summary metrics:
- aiperf_adapter.py now emits p50/p75/p90/p95/p99 (was mean+p99 only) for
  ttft/tpot/itl/e2el via a helper; process_result.py's *_ms loop carries them
  through. summarize.py renders mean+p50+p75+p90+p95+p99 for each latency.
- tok/W is now reported in two conventions: tok_per_watt_total (input+output,
  the prior value, kept as the tok_per_watt alias) and tok_per_watt_output
  (decoded tokens only). Both surfaced as summary columns and documented.

Tests: adapter fixture/assertions extended; 121 passed / 1 skipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions)

AIPerf replays mooncake_trace records as multi-turn sessions keyed on
session_id, so context accumulates across turns. The realized per-request
total reaches ~65,847 tokens, far above the per-record input_length max of
37,818 the value was originally sized from. At max-model-len 40960, 1100/2000
(55%) requests were rejected with HTTP 400 (input+output > window), silently
biasing the metrics to short early-turn requests. 73728 covers the realized
max with headroom; Qwen3-4B-2507 supports 256K so no model-side limit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mooncake_trace replays each session_id as a multi-turn conversation, so
context accumulates across turns; size max-model-len from the session-
cumulative max (64k: 66,655→73728; 128k: 133,851→~147456), not the per-record
length. Documents the silent-truncation trap (40960 rejected 55% of the 64k
run while the CI job stayed green) and corrects the earlier 'not multi-turn'
claim, which the run evidence disproves.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…TP=2, fp8)

Add an engine-vs-engine agentic-replay comparison on google/gemma-4-31B-it over
the committed 64k agentic-coding trace (SemiAnalysisAI#1000), 2x H100 / TP=2 / fp8, both cache-ON.

- New configs gemma4-agentic-fp8-h100-2x-{vllm,sglang} under a distinct
  model-prefix `gemma4-agentic` so the launcher resolves dedicated trace-replay
  scripts instead of the fixed-seq-len gemma4_*.sh.
- New scripts gemma4-agentic_fp8_h100_{vllm,sglang}.sh: vLLM on-the-fly fp8 over
  google/gemma-4-31B-it; SGLang pre-quantized RedHatAI/gemma-4-31B-it-FP8-dynamic
  (its on-the-fly fp8 crashes on the vision tower). Both fp8_e4m3 KV, prefix/radix
  cache default-ON, replayed once via AIPerf.
- context-length/max-model-len 73728 covers the session-cumulative max of the
  SemiAnalysisAI#1000 subset (AIPerf threads mooncake sessions; context accumulates across turns).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…he 64k pair)

The smoke pre-flight scenario doubled the matrix to 4 jobs; keep only the 64k
SemiAnalysisAI#1000 agentic-replay leg so the run is exactly the 2-engine comparison (vLLM,
SGLang). No script/serve changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d trap; SGLang fa3

- AIPERF_INTEGRATION.md: add "Verifying a run is valid" (200/400 gate, cache-hit
  check, measured prefix/radix hit table correcting the ~95% claim), and
  "Engine-vs-engine comparison" (AIPerf engine-agnostic client, distinct
  model-prefix, dual-key dispatch + the feature/aiperf workflow-ref gotcha, and the
  backend-fairness trap: SGLang forces triton for Gemma multimodal -> ~3-4x slower).
- gemma4-agentic_fp8_h100_sglang.sh: pin --attention-backend fa3 (text-only trace
  has no image tokens, so the triton-only bidirectional path is unneeded) to get a
  fair comparison vs vLLM/FlashInfer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant