feat: agentic-replay (mooncake_trace) on official AIPerf — Qwen3.5-4B smoke#7
feat: agentic-replay (mooncake_trace) on official AIPerf — Qwen3.5-4B smoke#7thangquang09 wants to merge 17 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… encoder floor Gemma4's MM encoder requires max_tokens_per_mm_item=2496; vLLM forces --disable_chunked_mm_input for bidirectional attention, making max_num_batched_tokens equal max_model_len (2304 for 1k1k). This crashes at startup. Fix: config sets max-num-batched-tokens: 4096 in search-space; script reads MAX_NUM_BATCHED_TOKENS and passes it to --max-num-batched-tokens. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…4B smoke Add an `agentic-replay` scenario-type that replays a recorded mooncake_trace JSONL through official AIPerf (the `aiperf` client) and rides the standard bmk_* aggregation, fully off the retired cquil11/aiperf fork pipeline. - validation.py: AgenticReplaySearchSpaceEntry / AgenticReplayConfig / SingleNodeAgenticReplayMatrixEntry + validator; scenarios container and ChangelogMatrixEntry union updated. - generate_sweep_configs.py: agentic-replay branch in full-sweep and test-config; --scenario-type choice; eval-skip guard. - process_changelog.py: route agentic-replay into single_node['agentic-replay']. - run-sweep.yml: dedicated sweep-single-node-agentic-replay job, wired into collect-results. benchmark-tmpl.yml: input-file / custom-dataset-type inputs. - launch_h100-greennode.sh forwards the new env; new launcher qwen3.5-4b_bf16_h100_vllm.sh replays the trace once (request-count = record count, no isl/osl). benchmark_lib.sh: run_client_benchmark accepts --input-file/--custom-dataset-type/--request-count. - nvidia-master.yaml: qwen3.5-4b-bf16-h100-vllm config (vLLM bf16 TP=1 conc=2). - Smoke dataset committed; AIPERF_INTEGRATION.md documents the path and the mooncake_trace multi-turn limitation; +8 validation tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26867517453 |
…-replay in e2e dispatch run_client_benchmark reads BENCHMARK_CLIENT from the env, but launch_h100-greennode.sh never forwarded it into the docker container, so it defaulted to inferencex_native — silently for fixed-seq aiperf runs, and fatally for agentic-replay (which only supports aiperf). Add BENCHMARK_CLIENT to the launcher passthrough. Also wire agentic-replay into e2e-tests.yml (workflow_dispatch): a dedicated agentic-replay bucket + test-sweep-agentic-replay job that forwards input-file/custom-dataset-type/scenario-type, so a single config can be run in isolation via dispatch (run-sweep.yml is not dispatchable). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compute tok/W = total_token_throughput / mean total GPU power, derived from the power.draw column that start_gpu_monitor already logs to gpu_metrics.csv every run. No DCGM/pynvml required; works for both the native and aiperf clients and all engines. - process_result.py: mean_total_power_w() parses gpu_metrics.csv, sums only the N busiest GPUs (N=TP single-node / total_gpus multi-node) so idle cards on a shared host don't inflate power, and windows samples to the last `duration` seconds to exclude model-load/warmup. Emits tok_per_watt + mean_power_w (null when the CSV is absent). - summarize.py: new "Token/Watt (tok/s/W)" and "Power Mean (W)" columns on the single-node and multi-node tables. - aiperf_adapter.py: emit `duration` from AIPerf's benchmark_duration so the aiperf-client path windows power correctly. - tests: busiest-GPU selection, warmup windowing, missing-CSV null, duration mapping present/absent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an "Energy Efficiency (tokens/Watt)" section covering the metric definition, the nvidia-smi power source (no DCGM required), the busiest-GPU + duration-window derivation, and the new result/table fields. Cross-reference gpu_metrics.csv in Artifacts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2a5bb21 to
8d23f35
Compare
… run
Drop the hf:// fetch approach (the container's hf CLI produced a 0-record
subset → request-count<concurrency failure). Commit the traces directly:
benchmarks/single_node/agentic/datasets/agentic-coding-{64k,128k}.jsonl
(18,595 / 16,957 records). Committed as normal blobs — repo does not LFS-track
.jsonl.
The agentic-replay launcher now supports an optional "#N" suffix on the
input-file path to replay only the first N records (low-resource subset of a
large committed trace). Add a qwen3.5-4b-bf16-h100-vllm agentic-replay
scenario: 64k trace, first 2000 records, concurrency 32, max-model-len 40960
(covers the dataset's real max input+output of 38,613).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l tok/W
Model swap (prefix-caching fix):
- Qwen/Qwen3.5-4B is a hybrid-Mamba model; vLLM V1 auto-disables prefix
caching for it (observed 0% hit rate on the agentic-replay trace). Replace
it with Qwen/Qwen3-4B-Instruct-2507, a dense Qwen3ForCausalLM (256K native
context, no rope-scaling) for which vLLM keeps prefix caching ON by default.
- Rename benchmarks/single_node/{qwen3.5-4b => qwen3-4b-2507}_bf16_h100_vllm.sh
(launcher resolves the script from model-prefix) and fix its served-model
default. Update the nvidia-master.yaml config block and docs.
aiperf adapter / summary metrics:
- aiperf_adapter.py now emits p50/p75/p90/p95/p99 (was mean+p99 only) for
ttft/tpot/itl/e2el via a helper; process_result.py's *_ms loop carries them
through. summarize.py renders mean+p50+p75+p90+p95+p99 for each latency.
- tok/W is now reported in two conventions: tok_per_watt_total (input+output,
the prior value, kept as the tok_per_watt alias) and tok_per_watt_output
(decoded tokens only). Both surfaced as summary columns and documented.
Tests: adapter fixture/assertions extended; 121 passed / 1 skipped.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions) AIPerf replays mooncake_trace records as multi-turn sessions keyed on session_id, so context accumulates across turns. The realized per-request total reaches ~65,847 tokens, far above the per-record input_length max of 37,818 the value was originally sized from. At max-model-len 40960, 1100/2000 (55%) requests were rejected with HTTP 400 (input+output > window), silently biasing the metrics to short early-turn requests. 73728 covers the realized max with headroom; Qwen3-4B-2507 supports 256K so no model-side limit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mooncake_trace replays each session_id as a multi-turn conversation, so context accumulates across turns; size max-model-len from the session- cumulative max (64k: 66,655→73728; 128k: 133,851→~147456), not the per-record length. Documents the silent-truncation trap (40960 rejected 55% of the 64k run while the CI job stayed green) and corrects the earlier 'not multi-turn' claim, which the run evidence disproves. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…TP=2, fp8) Add an engine-vs-engine agentic-replay comparison on google/gemma-4-31B-it over the committed 64k agentic-coding trace (SemiAnalysisAI#1000), 2x H100 / TP=2 / fp8, both cache-ON. - New configs gemma4-agentic-fp8-h100-2x-{vllm,sglang} under a distinct model-prefix `gemma4-agentic` so the launcher resolves dedicated trace-replay scripts instead of the fixed-seq-len gemma4_*.sh. - New scripts gemma4-agentic_fp8_h100_{vllm,sglang}.sh: vLLM on-the-fly fp8 over google/gemma-4-31B-it; SGLang pre-quantized RedHatAI/gemma-4-31B-it-FP8-dynamic (its on-the-fly fp8 crashes on the vision tower). Both fp8_e4m3 KV, prefix/radix cache default-ON, replayed once via AIPerf. - context-length/max-model-len 73728 covers the session-cumulative max of the SemiAnalysisAI#1000 subset (AIPerf threads mooncake sessions; context accumulates across turns). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…he 64k pair) The smoke pre-flight scenario doubled the matrix to 4 jobs; keep only the 64k SemiAnalysisAI#1000 agentic-replay leg so the run is exactly the 2-engine comparison (vLLM, SGLang). No script/serve changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d trap; SGLang fa3 - AIPERF_INTEGRATION.md: add "Verifying a run is valid" (200/400 gate, cache-hit check, measured prefix/radix hit table correcting the ~95% claim), and "Engine-vs-engine comparison" (AIPerf engine-agnostic client, distinct model-prefix, dual-key dispatch + the feature/aiperf workflow-ref gotcha, and the backend-fairness trap: SGLang forces triton for Gemma multimodal -> ~3-4x slower). - gemma4-agentic_fp8_h100_sglang.sh: pin --attention-backend fa3 (text-only trace has no image tokens, so the triton-only bidirectional path is unneeded) to get a fair comparison vs vLLM/FlashInfer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Adds an
agentic-replayscenario-type that replays a recordedmooncake_traceJSONL through official AIPerf (theaiperfclient) and rides the standardbmk_*aggregation — fully off the retiredcquil11/aiperffork pipeline (per ADR-0001).First config:
qwen3.5-4b-bf16-h100-vllm(vLLM, bf16, TP=1, conc=2) against a committed 12-record smoke trace.Wiring
single_node['agentic-replay']bucket (process_changelog.py) → dedicatedsweep-single-node-agentic-replayjob (run-sweep.yml) →benchmark-tmpl.ymlwith newinput-file/custom-dataset-typeinputs.scenario-type != 'agentic-coding', so results flow throughprocess_result.py→bmk_*automatically.qwen3.5-4b_bf16_h100_vllm.shreplays the trace once (--request-count= record count, no--isl/--osl).Known limitation
mooncake_traceis not multi-turn in AIPerf — the 12 records replay as flat requests; prefix-cache reuse comes fromhash_idsblock overlap, not FORK-mode threading.Verification
test-config) paths → 1 entry, conc 2.🤖 Generated with Claude Code