Skip to content

Test against existing endpoint#85

Open
maryamtahhan wants to merge 14 commits intoredhat-et:mainfrom
maryamtahhan:test-against-endpoint
Open

Test against existing endpoint#85
maryamtahhan wants to merge 14 commits intoredhat-et:mainfrom
maryamtahhan:test-against-endpoint

Conversation

@maryamtahhan
Copy link
Copy Markdown
Collaborator

@maryamtahhan maryamtahhan commented Mar 30, 2026

PR Summary: External vLLM Endpoint Support for Concurrent Load Testing

⚠️ This PR depends on #81 - Please merge #81 first to avoid conflicts.

Overview

This PR adds support for running concurrent load tests against existing vLLM endpoints (cloud, Kubernetes, production deployments) instead of requiring Ansible to manage vLLM containers. This simplifies test orchestration, reduces test execution time, and better reflects production deployment patterns.

Key Features

1. Environment Variable Configuration

  • VLLM_ENDPOINT_MODE: Set to external to use existing endpoint (default: managed)
  • VLLM_ENDPOINT_URL: Full URL to external vLLM instance

2. Automatic Model Detection

  • Auto-detect model name from /v1/models endpoint when test_model not provided
  • Validates model matches endpoint if test_model is specified
  • Works seamlessly with concurrent load test suite

3. Smart DUT Handling

  • Skips NUMA topology detection in external mode
  • Skips core allocation in external mode
  • Skips vLLM container management entirely
  • No SSH connection to DUT required

4. Enhanced Test Metadata

New fields in test-metadata.json:

  • vllm_mode: "managed" or "external"
  • vllm_endpoint_url: External endpoint URL
  • model_source: "auto-detected" or "specified"

Usage

Quick Start

# Start vLLM on your endpoint
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16 \
  --no-enable-prefix-caching

# Configure external mode
export VLLM_ENDPOINT_MODE=external
export VLLM_ENDPOINT_URL=http://your-vllm-host:8000

# Configure loadgen
export LOADGEN_HOSTNAME=<loadgen-host>
export ANSIBLE_SSH_USER=<username>
export ANSIBLE_SSH_KEY=<key-file>
export HF_TOKEN=$(cat ~/hf_token)

# Run tests (model auto-detected!)
cd automation/test-execution/ansible
ansible-playbook -i inventory/hosts.yml \
  llm-benchmark-concurrent-load.yml \
  -e "base_workload=chat" \
  -e "requested_cores=16"

Alternative: Configuration File

Edit inventory/group_vars/all/endpoints.yml:

vllm_endpoint:
  mode: "external"
  external:
    url: "http://your-vllm-instance:8000"

Benefits

  1. Simplified Orchestration: No container lifecycle management during tests
  2. Faster Execution: Single model load for all test variations (no restart between tests)
  3. Production Alignment: Test against real deployment patterns (cloud, K8s, bare metal)
  4. Reduced Infrastructure Requirements: No need for DUT SSH access when using external endpoints
  5. Better Separation of Concerns: Infrastructure management separate from performance testing

Changes Summary

Modified Files

  1. automation/test-execution/ansible/inventory/group_vars/all/endpoints.yml

    • Add env var support for VLLM_ENDPOINT_MODE and VLLM_ENDPOINT_URL
    • Document env var usage in comments
  2. automation/test-execution/ansible/llm-benchmark-auto.yml

    • Add vLLM mode detection and validation
    • Add model auto-detection from /v1/models endpoint
    • Skip DUT plays entirely in external mode (NUMA, core allocation, vLLM startup)
    • Add vllm_mode, vllm_endpoint_url, and model_source to test metadata
  3. automation/test-execution/ansible/llm-benchmark-concurrent-load.yml

    • Support optional test_model parameter in external mode
    • Pass vLLM mode to child playbooks
    • Display mode information in test suite configuration
  4. automation/test-execution/ansible/inventory/README.md

    • Add env vars to Quick Start section
    • Update Scenario 3 with env var approach
    • Add example showing auto-detection
  5. tests/concurrent-load/concurrent-load.md

    • Add "Deployment Modes" section explaining managed vs external
    • Add "Using External vLLM Endpoint" subsection with complete workflow
    • Show how to run tests without specifying test_model
    • Include phase-specific vLLM startup commands
  6. automation/test-execution/ansible/inventory/group_vars/all/infrastructure.yml

    • Add registry prefix to vLLM container image for Podman compatibility

Commits

  1. fix: add registry prefix to vLLM container image for podman

    • Fixes short-name resolution errors in non-interactive environments
  2. feat: add external vLLM endpoint support with auto-detection

    • Core functionality for external endpoint support
    • Model auto-detection from endpoint
    • Documentation updates
  3. fix: skip DUT connection entirely in external mode

    • Prevents Ansible from connecting to DUT in external mode
    • Uses meta: end_play to cleanly skip entire plays
  4. feat: add vLLM deployment mode to test metadata

    • Adds deployment metadata to results for analysis
    • Tracks model source (auto-detected vs specified)

Testing

All changes tested with:

  • External vLLM endpoint on AWS EC2
  • Model auto-detection working correctly
  • No DUT connection required
  • Results correctly tagged with deployment mode

Validation Checklist

  • Ansible syntax validation passes
  • External mode skips DUT connection
  • Model auto-detection from /v1/models works
  • Test metadata includes vLLM mode
  • Documentation complete and accurate
  • Backward compatible with managed mode

Documentation

Complete documentation added:

  • Environment variable configuration in inventory README
  • External endpoint usage in concurrent load test suite documentation
  • vLLM startup commands for each test phase
  • Examples showing model auto-detection

Backward Compatibility

Fully backward compatible

  • Default mode remains managed
  • Existing playbooks and workflows unchanged
  • No breaking changes to any interfaces

Future Enhancements

Potential follow-ups (not in this PR):

  • Add vLLM configuration validation (check caching mode, dtype, etc.)
  • Support for authenticated external endpoints
  • Helper scripts for common vLLM deployment scenarios
  • AWX/AAP workflow templates for external mode

Related Issues

Addresses the need to test against existing vLLM deployments without managing infrastructure through Ansible.

Summary by CodeRabbit

  • New Features

    • Run benchmarks against external vLLM endpoints via environment variables (optional API key/token); model auto-detection and use of the resolved model for results and logs.
    • Dashboard: deployment-mode selector, mode-aware filtering and comparison (managed vs external) and endpoint-URL filtering; external runs surface client metrics and mode-specific views.
  • Documentation

    • Added “Deployment Modes” guidance, workflows, examples, and clarified client vs conditional server/system metrics with example env var usage.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

Warning

Rate limit exceeded

@maryamtahhan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 47 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 47 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 48b45ada-cbbd-4eb5-b989-b3bf308a75f3

📥 Commits

Reviewing files that changed from the base of the PR and between 353a1f1 and 8fd1a91.

📒 Files selected for processing (3)
  • automation/test-execution/ansible/llm-benchmark-auto.yml
  • automation/test-execution/ansible/roles/vllm_metrics_collector/tasks/main.yml
  • automation/test-execution/ansible/roles/vllm_metrics_collector/tasks/stop.yml
📝 Walkthrough

Walkthrough

Adds external vLLM endpoint support: environment-driven endpoint configuration, playbook mode detection and optional endpoint model auto-detection via /v1/models, conditional /metrics probing for server metrics, dashboard filtering for deployment mode, and updated docs/tests for external-endpoint workflows.

Changes

Cohort / File(s) Summary
Ansible Configuration
automation/test-execution/ansible/inventory/README.md, automation/test-execution/ansible/inventory/group_vars/all/endpoints.yml
Docs and inventory now support environment-variable-driven endpoint settings (VLLM_ENDPOINT_MODE, VLLM_ENDPOINT_URL, optional VLLM_API_KEY_ENABLED/VLLM_API_KEY, HF_TOKEN).
Ansible Playbooks - Core
automation/test-execution/ansible/llm-benchmark-auto.yml, automation/test-execution/ansible/llm-benchmark-concurrent-load.yml
Introduce vllm_mode (managed
Ansible Roles
automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml
Add resolved_model preferring actual_model and update results_path to use resolved_model (slashes replaced with __).
Dashboard - Metrics & Display
automation/test-execution/dashboard-examples/vllm_dashboard/Home.py, automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_📊_Client_Metrics.py, automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_🖥️_Server_Metrics.py
Streamlit UI and loaders expose vllm_mode, vllm_endpoint_url, model_source; add External Endpoint quick-start; adapt filters, comparisons, and displayed columns to managed vs external modes; show informational notes when server metrics are unavailable.
Documentation & Tests
docs/dashboards-quickstart.md, docs/getting-started.md, docs/metrics-collection.md, tests/concurrent-load/concurrent-load.md
Add docs/examples for External endpoint testing, env var guidance, workflows, metric availability semantics, and deployment-mode comparisons.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Playbook as Ansible Playbook
    participant ExtEndpoint as External Endpoint
    participant ModelsAPI as /v1/models
    participant BenchTask as Benchmark Task

    User->>Playbook: Run with VLLM_ENDPOINT_MODE=external\nVLLM_ENDPOINT_URL=<endpoint>
    Playbook->>Playbook: Detect vllm_mode = 'external'
    Playbook->>Playbook: Validate endpoint URL & workload_type
    alt test_model not provided
        Playbook->>ExtEndpoint: GET /v1/models
        ExtEndpoint->>ModelsAPI: Query available models
        ModelsAPI-->>Playbook: Return model list
        Playbook->>Playbook: Auto-detect endpoint_model -> actual_model
    else test_model provided
        Playbook->>Playbook: Validate test_model matches endpoint model
    end
    Playbook->>ExtEndpoint: Probe /metrics (optional)
    Playbook->>BenchTask: Execute benchmark using actual_model
    BenchTask->>ExtEndpoint: Send load (no vLLM container management)
    ExtEndpoint-->>BenchTask: Return responses
    BenchTask-->>Playbook: Collect client metrics
    Playbook->>Playbook: Store results with vllm_mode=external
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Test against existing endpoint' directly describes the main feature added—support for running tests against external/existing vLLM endpoints instead of managed containers, which is the primary objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
automation/test-execution/ansible/filter_plugins/cpu_utils.py (1)

599-606: ⚠️ Potential issue | 🟠 Major

Potential silent under-allocation when physical_cpus_list is shorter than cores_per_node.

If physical_cpus_list contains fewer CPUs than cores_per_node, the slice on line 602 silently returns a shorter list. The downstream validation in start-llm.yml (lines 75-127 in the context snippet) only checks format and range—not that the allocated CPU count matches requested_cores. This could result in containers running with fewer cores than requested without any error.

Consider adding a length check after slicing:

🛡️ Proposed fix to validate allocated CPU count
         # Take first N physical cores
         allocated_cpus = [int(cpu.strip()) for cpu in physical_cpus_list[:cores_per_node]]
+
+        if len(allocated_cpus) < cores_per_node:
+            raise AnsibleFilterError(
+                f"Node {node['id']} has only {len(allocated_cpus)} CPUs in list, "
+                f"but {cores_per_node} cores requested per node"
+            )

         # Convert to range format
         cpu_range = cpu_list_to_range(allocated_cpus)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/filter_plugins/cpu_utils.py` around lines
599 - 606, The slice of physical_cpus_list into allocated_cpus can silently
produce fewer CPUs than requested (cores_per_node); after computing
allocated_cpus (and before converting with cpu_list_to_range) add a validation
that len(allocated_cpus) == cores_per_node and if not, raise a clear exception
or return an error (including the values of physical_cpus_list and
cores_per_node) so callers (e.g., start-llm.yml) cannot proceed with
under-allocation; reference allocated_cpus, physical_cpus_list, cores_per_node,
and cpu_list_to_range when adding this check and error handling.
🧹 Nitpick comments (2)
automation/test-execution/ansible/tests/unit/test_cpu_utils.py (1)

21-32: Mock pytest.raises doesn't verify exception types when pytest is unavailable.

The fallback pytest.raises context manager doesn't actually catch or verify exceptions—it just returns False from __exit__. This means exception-testing methods won't run in the non-pytest fallback path (which is fine since __main__ only runs smoke tests), but if someone tries to run the full test class without pytest, tests will fail rather than skip.

This is acceptable for the current design since the non-pytest path explicitly runs only a subset of smoke tests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/tests/unit/test_cpu_utils.py` around lines
21 - 32, The fallback pytest.raises context manager (class pytest -> class
raises) doesn't verify or suppress exceptions because __exit__ always returns
False; update raises.__exit__ to inspect the exception tuple passed in, check
whether the exception type matches self.exc, and return True to suppress the
exception when it matches (otherwise return False so it propagates). Ensure
__init__ still stores the expected exception (self.exc) and __enter__ returns
self; reference the raises class and its __exit__/__init__/__enter__ methods
when making the change.
automation/test-execution/ansible/llm-benchmark-auto.yml (1)

155-157: Redundant vllm_mode assignment.

vllm_mode is already set at lines 24-26. This second assignment at line 155-157 is identical and redundant.

♻️ Proposed fix to remove redundant assignment
-    - name: Set vLLM mode
-      ansible.builtin.set_fact:
-        vllm_mode: "{{ vllm_endpoint.mode | default('managed') }}"
-
     - name: Set fallback core configuration for external mode
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/llm-benchmark-auto.yml` around lines 155 -
157, Remove the redundant ansible task "Set vLLM mode" that sets the vllm_mode
fact (vllm_mode: "{{ vllm_endpoint.mode | default('managed') }}") because the
same fact is already defined earlier; locate the duplicate task by the task name
"Set vLLM mode" or the variable vllm_mode and delete the second occurrence
(lines around the repeated set_fact) so only the initial assignment remains,
ensuring no other tasks rely specifically on the later duplicate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@automation/test-execution/ansible/llm-benchmark-auto.yml`:
- Around line 105-140: The external endpoint URL (vllm_endpoint.external.url)
must be validated before attempting to call the "Get model from external
endpoint" task; add an early guard (e.g., an ansible.builtin.assert or
ansible.builtin.fail task run on localhost when vllm_mode == 'external') that
checks vllm_endpoint.external.url is defined and not null/empty and returns a
clear error message if invalid, then proceed with the existing model-detection
block that sets endpoint_model and optionally sets/validates test_model.

In `@automation/test-execution/ansible/llm-benchmark-concurrent-load.yml`:
- Around line 63-64: The validation allows base_workload value 'short_codegen'
but no corresponding workload is defined, causing lookup failures; either add a
'short_codegen' entry to the test workload definitions or remove it from the
allowed list. Update test-workloads.yml (the test_configs mapping) to include a
'short_codegen' workload key with the same required fields/structure as other
workloads (e.g., name, params, steps) or delete 'short_codegen' from the
base_workload validation in llm-benchmark-concurrent-load.yml (and mirror the
change in start-llm.yml) so the validator and test_configs remain consistent.
Ensure the key is exactly 'short_codegen' to match the validation.

In `@automation/test-execution/ansible/roles/vllm_server/tasks/start-llm.yml`:
- Around line 55-59: The assertion allowing 'short_codegen' in the workload_type
validation is inconsistent with the available test_configs and causes a KeyError
when test_configs[workload_type] is accessed; either remove 'short_codegen' from
the validation list in the Validate workload type assertion (update the list in
the ansible task that references workload_type) or add a corresponding
'short_codegen' entry to test-workloads.yml under test_configs so that
test_configs contains a key for workload_type; update the same symbols
workload_type and test_configs accordingly to keep them in sync.

In `@collections/requirements.yml`:
- Around line 6-10: Replace the permissive version ranges for the Ansible
collections with exact pinned versions to ensure reproducible AWX/CI runs:
update the collections entry for containers.podman to pin version "1.9.4"
(instead of ">=1.9.0,<2.0.0") and update ansible.posix to pin version "1.5.4"
(instead of ">=1.4.0,<1.6.0"); ensure the values for the name keys
containers.podman and ansible.posix are changed accordingly so future runs use
the exact, stable releases.

---

Outside diff comments:
In `@automation/test-execution/ansible/filter_plugins/cpu_utils.py`:
- Around line 599-606: The slice of physical_cpus_list into allocated_cpus can
silently produce fewer CPUs than requested (cores_per_node); after computing
allocated_cpus (and before converting with cpu_list_to_range) add a validation
that len(allocated_cpus) == cores_per_node and if not, raise a clear exception
or return an error (including the values of physical_cpus_list and
cores_per_node) so callers (e.g., start-llm.yml) cannot proceed with
under-allocation; reference allocated_cpus, physical_cpus_list, cores_per_node,
and cpu_list_to_range when adding this check and error handling.

---

Nitpick comments:
In `@automation/test-execution/ansible/llm-benchmark-auto.yml`:
- Around line 155-157: Remove the redundant ansible task "Set vLLM mode" that
sets the vllm_mode fact (vllm_mode: "{{ vllm_endpoint.mode | default('managed')
}}") because the same fact is already defined earlier; locate the duplicate task
by the task name "Set vLLM mode" or the variable vllm_mode and delete the second
occurrence (lines around the repeated set_fact) so only the initial assignment
remains, ensuring no other tasks rely specifically on the later duplicate.

In `@automation/test-execution/ansible/tests/unit/test_cpu_utils.py`:
- Around line 21-32: The fallback pytest.raises context manager (class pytest ->
class raises) doesn't verify or suppress exceptions because __exit__ always
returns False; update raises.__exit__ to inspect the exception tuple passed in,
check whether the exception type matches self.exc, and return True to suppress
the exception when it matches (otherwise return False so it propagates). Ensure
__init__ still stores the expected exception (self.exc) and __enter__ returns
self; reference the raises class and its __exit__/__init__/__enter__ methods
when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c0be321b-7c7d-4135-ac0b-fad97257545a

📥 Commits

Reviewing files that changed from the base of the PR and between b4126fb and 5a0eba9.

📒 Files selected for processing (17)
  • automation/test-execution/ansible/ansible.md
  • automation/test-execution/ansible/filter_plugins/cpu_utils.py
  • automation/test-execution/ansible/inventory/README.md
  • automation/test-execution/ansible/inventory/group_vars/all/endpoints.yml
  • automation/test-execution/ansible/inventory/group_vars/all/infrastructure.yml
  • automation/test-execution/ansible/llm-benchmark-auto.yml
  • automation/test-execution/ansible/llm-benchmark-concurrent-load.yml
  • automation/test-execution/ansible/llm-benchmark.yml
  • automation/test-execution/ansible/llm-core-sweep-auto.yml
  • automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml
  • automation/test-execution/ansible/roles/common/tasks/allocate-cores-from-count.yml
  • automation/test-execution/ansible/roles/common/tasks/detect-numa-topology.yml
  • automation/test-execution/ansible/roles/vllm_server/tasks/start-embedding.yml
  • automation/test-execution/ansible/roles/vllm_server/tasks/start-llm.yml
  • automation/test-execution/ansible/tests/unit/test_cpu_utils.py
  • collections/requirements.yml
  • tests/concurrent-load/concurrent-load.md

Comment on lines +105 to +140
- name: Handle external endpoint model detection/validation
when: vllm_mode == 'external'
block:
- name: Get model from external endpoint
ansible.builtin.uri:
url: "{{ vllm_endpoint.external.url }}/v1/models"
method: GET
status_code: 200
register: external_models_response
delegate_to: localhost

- name: Extract model name from endpoint
ansible.builtin.set_fact:
endpoint_model: "{{ external_models_response.json.data[0].id }}"

- name: Use endpoint model if test_model not provided or set to auto-detect
ansible.builtin.set_fact:
test_model: "{{ endpoint_model }}"
when: test_model is not defined or test_model == '__AUTO_DETECT__'

- name: Validate test_model matches endpoint if provided
ansible.builtin.assert:
that:
- test_model == endpoint_model
fail_msg: |
Model mismatch detected!
Specified: {{ test_model }}
Endpoint serving: {{ endpoint_model }}

Either omit test_model to auto-detect, or ensure it matches the running model.
when: test_model is defined and test_model != '__AUTO_DETECT__' and test_model != endpoint_model

- name: Display model source
ansible.builtin.debug:
msg: "Detected model from endpoint: {{ test_model }}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

External endpoint URL should be validated before use.

The code queries the external endpoint at line 110 using vllm_endpoint.external.url before the URL is validated. The validation happens later at lines 253-259 on the load_generator host.

Per the context snippet from endpoints.yml, the URL defaults to Python None when not configured. This means if a user sets mode=external but forgets to set VLLM_ENDPOINT_URL, line 110 will attempt to connect to None/v1/models, resulting in a confusing connection error rather than a clear validation message.

Consider adding URL validation before the model detection block:

🛡️ Proposed fix to add early URL validation
     - name: Handle external endpoint model detection/validation
       when: vllm_mode == 'external'
       block:
+        - name: Validate external endpoint URL is configured
+          ansible.builtin.assert:
+            that:
+              - vllm_endpoint.external.url is defined
+              - vllm_endpoint.external.url is not none
+              - vllm_endpoint.external.url | length > 0
+            fail_msg: |
+              External endpoint URL is required when mode=external.
+              Set via: export VLLM_ENDPOINT_URL=http://your-vllm-instance:8000
+
         - name: Get model from external endpoint
           ansible.builtin.uri:
             url: "{{ vllm_endpoint.external.url }}/v1/models"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Handle external endpoint model detection/validation
when: vllm_mode == 'external'
block:
- name: Get model from external endpoint
ansible.builtin.uri:
url: "{{ vllm_endpoint.external.url }}/v1/models"
method: GET
status_code: 200
register: external_models_response
delegate_to: localhost
- name: Extract model name from endpoint
ansible.builtin.set_fact:
endpoint_model: "{{ external_models_response.json.data[0].id }}"
- name: Use endpoint model if test_model not provided or set to auto-detect
ansible.builtin.set_fact:
test_model: "{{ endpoint_model }}"
when: test_model is not defined or test_model == '__AUTO_DETECT__'
- name: Validate test_model matches endpoint if provided
ansible.builtin.assert:
that:
- test_model == endpoint_model
fail_msg: |
Model mismatch detected!
Specified: {{ test_model }}
Endpoint serving: {{ endpoint_model }}
Either omit test_model to auto-detect, or ensure it matches the running model.
when: test_model is defined and test_model != '__AUTO_DETECT__' and test_model != endpoint_model
- name: Display model source
ansible.builtin.debug:
msg: "Detected model from endpoint: {{ test_model }}"
- name: Handle external endpoint model detection/validation
when: vllm_mode == 'external'
block:
- name: Validate external endpoint URL is configured
ansible.builtin.assert:
that:
- vllm_endpoint.external.url is defined
- vllm_endpoint.external.url is not none
- vllm_endpoint.external.url | length > 0
fail_msg: |
External endpoint URL is required when mode=external.
Set via: export VLLM_ENDPOINT_URL=http://your-vllm-instance:8000
- name: Get model from external endpoint
ansible.builtin.uri:
url: "{{ vllm_endpoint.external.url }}/v1/models"
method: GET
status_code: 200
register: external_models_response
delegate_to: localhost
- name: Extract model name from endpoint
ansible.builtin.set_fact:
endpoint_model: "{{ external_models_response.json.data[0].id }}"
- name: Use endpoint model if test_model not provided or set to auto-detect
ansible.builtin.set_fact:
test_model: "{{ endpoint_model }}"
when: test_model is not defined or test_model == '__AUTO_DETECT__'
- name: Validate test_model matches endpoint if provided
ansible.builtin.assert:
that:
- test_model == endpoint_model
fail_msg: |
Model mismatch detected!
Specified: {{ test_model }}
Endpoint serving: {{ endpoint_model }}
Either omit test_model to auto-detect, or ensure it matches the running model.
when: test_model is defined and test_model != '__AUTO_DETECT__' and test_model != endpoint_model
- name: Display model source
ansible.builtin.debug:
msg: "Detected model from endpoint: {{ test_model }}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/llm-benchmark-auto.yml` around lines 105 -
140, The external endpoint URL (vllm_endpoint.external.url) must be validated
before attempting to call the "Get model from external endpoint" task; add an
early guard (e.g., an ansible.builtin.assert or ansible.builtin.fail task run on
localhost when vllm_mode == 'external') that checks vllm_endpoint.external.url
is defined and not null/empty and returns a clear error message if invalid, then
proceed with the existing model-detection block that sets endpoint_model and
optionally sets/validates test_model.

maryamtahhan added a commit to maryamtahhan/vllm-cpu-perf-eval that referenced this pull request Mar 30, 2026
Changes based on PR redhat-et#85 feedback:

1. llm-benchmark-auto.yml:
   - Add validation for vllm_endpoint.external.url before API calls
   - Remove duplicate 'Set vLLM mode' task
   - Pass test_model explicitly to child plays to ensure proper propagation

2. llm-benchmark-concurrent-load.yml:
   - Remove 'short_codegen' from allowed workload types (not defined in test_configs)

Note: Other review comments (cpu_utils.py, test_cpu_utils.py, start-llm.yml,
collections/requirements.yml) apply to ansible-playbook-enhancements branch
and will be addressed separately.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the test-against-endpoint branch from d83af3d to faf80d7 Compare March 30, 2026 14:43
@maryamtahhan maryamtahhan marked this pull request as ready for review March 31, 2026 10:06
@maryamtahhan
Copy link
Copy Markdown
Collaborator Author

rebasing now

maryamtahhan and others added 8 commits April 3, 2026 14:10
- Add environment variable support for external endpoints
  - VLLM_ENDPOINT_MODE: set to 'external' to use existing endpoint
  - VLLM_ENDPOINT_URL: URL of external vLLM instance
- Auto-detect model name from /v1/models endpoint when test_model not provided
- Skip NUMA topology detection and vLLM container management in external mode
- Update documentation with external endpoint usage examples
- Make test_model optional in concurrent load tests (auto-detected in external mode)

This enables testing against cloud/K8s/production vLLM deployments without
managing containers, simplifying test orchestration and better reflecting
production deployment patterns.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add pre_tasks with meta: end_play to NUMA detection, core allocation,
and vLLM startup plays to prevent Ansible from trying to connect to DUT
when using external endpoint mode.

This fixes the issue where Ansible would fail trying to gather facts
from DUT even when all tasks were properly skipped.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add the following fields to test-metadata.json:
- vllm_mode: 'managed' or 'external'
- vllm_endpoint_url: External endpoint URL (or 'n/a' if managed)
- model_source: 'auto-detected' or 'specified'

This allows analysis tools to identify which tests were run against
external endpoints and whether the model was auto-detected from the
endpoint or explicitly specified.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
In external mode, while DUT connection is not needed (vLLM runs externally),
the load generator connection is still required to run GuideLLM benchmarks.

Updates:
- Add load generator env vars to external endpoint examples
- Clarify that only load generator access is needed (not DUT)
- Use generic hostnames/IPs in documentation examples

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Enable API key authentication via environment variables for secured external
vLLM endpoints:
- VLLM_API_KEY_ENABLED: set to 'true' to enable API key auth
- VLLM_API_KEY: the API key value

This makes it easier to test against secured production vLLM deployments
without editing configuration files. The existing API key infrastructure
already supported multiple sources (env, file, vault, prompt), but the
enabled flag was hardcoded to false.

Updates documentation to show API key usage in all external endpoint examples.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Issues fixed:
1. Collect Logs and Cleanup plays were still trying to connect to DUT in
   external mode - added gather_facts: false and meta: end_play guards
2. Auto-detected model name was not propagating to results paths - added
   cacheable: true to set_fact so the detected test_model persists across
   plays and shows up in results paths instead of '__AUTO_DETECT__'

These fixes ensure that in external mode:
- No DUT connection attempts occur after benchmarking completes
- Results are stored with the actual detected model name, not the sentinel value

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Changes based on PR redhat-et#85 feedback:

1. llm-benchmark-auto.yml:
   - Add validation for vllm_endpoint.external.url before API calls
   - Remove duplicate 'Set vLLM mode' task
   - Pass test_model explicitly to child plays to ensure proper propagation

2. llm-benchmark-concurrent-load.yml:
   - Remove 'short_codegen' from allowed workload types (not defined in test_configs)

Note: Other review comments (cpu_utils.py, test_cpu_utils.py, start-llm.yml,
collections/requirements.yml) apply to ansible-playbook-enhancements branch
and will be addressed separately.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
When test_model is passed as '__AUTO_DETECT__' via import_playbook,
it has extra vars precedence which cannot be overridden. Introduce
actual_model variable to properly resolve the model name from external
endpoints and use it in all result paths and metadata files.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the test-against-endpoint branch from bee4a5b to ab07a91 Compare April 3, 2026 14:31
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
automation/test-execution/ansible/llm-benchmark-concurrent-load.yml (1)

59-64: Duplicate base_workload validation blocks.

The same assertion for base_workload appears twice (lines 59-64 and 99-104). This is redundant and should be consolidated.

♻️ Remove the duplicate validation
     - name: Validate base_workload parameter
       ansible.builtin.assert:
         that:
           - base_workload is defined
           - base_workload in ['chat', 'rag', 'code', 'summarization']
         fail_msg: "base_workload must be one of: chat, rag, code, summarization"

-    - name: Validate required parameters
-      ansible.builtin.assert:
-        that:
           ...

-    - name: Validate base_workload parameter
-      ansible.builtin.assert:
-        that:
-          - base_workload is defined
-          - base_workload in ['chat', 'rag', 'code', 'summarization']
-        fail_msg: "base_workload must be one of: chat, rag, code, summarization"

Also applies to: 99-104

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/llm-benchmark-concurrent-load.yml` around
lines 59 - 64, Remove the duplicate "Validate base_workload parameter" task so
the assertion appears only once; locate the two tasks named "Validate
base_workload parameter" (one at lines around the first occurrence and the other
near the second) and delete the redundant block, keeping a single assertion that
checks base_workload is defined and in ['chat','rag','code','summarization'] and
preserves the fail_msg text; ensure no other tasks rely on the deleted
duplicate.
automation/test-execution/ansible/llm-benchmark-auto.yml (1)

431-431: Inconsistent variable usage: test_model vs actual_model in metrics paths.

Lines 431 and 458 use test_model for the results path while other result-related paths (lines 511, 517, 525-526, etc.) use actual_model. While test_model gets set in external auto-detect mode, using actual_model consistently would be clearer and more maintainable.

♻️ Use actual_model consistently
         - name: Start metrics collector in background
           ansible.builtin.include_role:
             name: vllm_metrics_collector
           vars:
             vllm_url: "{{ hostvars['localhost']['vllm_metrics_url'] }}"
-            results_path: "{{ playbook_dir }}/../../../results/llm/{{ test_model | replace('/', '__') }}/{{ workload_type }}-{{ hostvars['localhost']['test_run_id'] }}/{{ core_configuration.name }}"
+            results_path: "{{ playbook_dir }}/../../../results/llm/{{ hostvars['localhost']['actual_model'] | replace('/', '__') }}/{{ workload_type }}-{{ hostvars['localhost']['test_run_id'] }}/{{ core_configuration.name }}"

Apply similar change to line 458.

Also applies to: 458-458

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/llm-benchmark-auto.yml` at line 431, The
results_path currently uses the variable test_model which is inconsistent with
other metrics paths that use actual_model; update the results_path entries (the
YAML key results_path where it references {{ test_model | replace('/', '__') }})
to use {{ actual_model | replace('/', '__') }} instead, and make the same
replacement for the second occurrence noted (the other results_path at the same
block), ensuring all result-related paths uniformly use actual_model.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@automation/test-execution/ansible/llm-benchmark-auto.yml`:
- Around line 129-131: The task that sets endpoint_model from
external_models_response.json.data[0].id can crash if data is empty; add a guard
before or instead of the set_fact to validate external_models_response.json.data
has length > 0 and emit a clear error. Use ansible.builtin.assert or
ansible.builtin.fail to check "external_models_response.json.data | length > 0"
(referencing the task name "Extract model name from endpoint" and the variable
external_models_response.json.data) and only then set endpoint_model from
data[0].id; if the check fails, fail with a descriptive message like "No models
returned from external endpoint".

In `@automation/test-execution/dashboard-examples/vllm_dashboard/Home.py`:
- Around line 156-157: The note string in Home.py that currently states
"External endpoint runs show client metrics only (no server-side metrics)" is
inaccurate; update that docstring/text block (the triple-quoted note near the
end of the Home.py dashboard text) to a conditional statement that clarifies
external runs normally show only client metrics but will include server-side
metrics if the endpoint exposes a reachable /metrics path (i.e., change the
absolute assertion to a conditional phrasing mentioning the /metrics exception).

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_`🖥️_Server_Metrics.py:
- Around line 98-100: The loop currently unconditionally skips runs where
metadata.get('vllm_mode') == 'external', which hides valid runs that expose
/metrics (vllm-metrics.json); change the checks (the metadata.get('vllm_mode')
== 'external' conditions around the Server Metrics page) to only skip when the
run is external AND there are no server metrics present—i.e., check for the
existence of the vllm-metrics.json (or equivalent server metrics payload) for
that run before continuing; apply the same change to the other occurrences
referenced (the checks around lines with metadata.get('vllm_mode') at the two
other blocks) so external runs with metrics are included.

In `@tests/concurrent-load/concurrent-load.md`:
- Around line 388-455: The Phase 3 run steps in the "Using External vLLM
Endpoint" section contradict the document's earlier "Phase 3 not supported"
statement; update the section that shows Phase 3 commands (the block that
restarts vLLM with caching and runs Phase 3) to align with the doc: either
remove that Phase 3 example entirely or replace it with an explicit warning that
Phase 3 is unsupported in this mode and show the safe alternative (e.g.,
instruct users to skip_phase_3=true or only run Phases 1&2), and clarify the
meaning of VLLM_ENDPOINT_MODE and VLLM_API_KEY variables so the external-mode
workflow cannot be misinterpreted as supporting Phase 3.

---

Nitpick comments:
In `@automation/test-execution/ansible/llm-benchmark-auto.yml`:
- Line 431: The results_path currently uses the variable test_model which is
inconsistent with other metrics paths that use actual_model; update the
results_path entries (the YAML key results_path where it references {{
test_model | replace('/', '__') }}) to use {{ actual_model | replace('/', '__')
}} instead, and make the same replacement for the second occurrence noted (the
other results_path at the same block), ensuring all result-related paths
uniformly use actual_model.

In `@automation/test-execution/ansible/llm-benchmark-concurrent-load.yml`:
- Around line 59-64: Remove the duplicate "Validate base_workload parameter"
task so the assertion appears only once; locate the two tasks named "Validate
base_workload parameter" (one at lines around the first occurrence and the other
near the second) and delete the redundant block, keeping a single assertion that
checks base_workload is defined and in ['chat','rag','code','summarization'] and
preserves the fail_msg text; ensure no other tasks rely on the deleted
duplicate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a5845cb1-f3ea-4f12-9f57-3b05babb46e4

📥 Commits

Reviewing files that changed from the base of the PR and between c991528 and ab07a91.

📒 Files selected for processing (12)
  • automation/test-execution/ansible/inventory/README.md
  • automation/test-execution/ansible/inventory/group_vars/all/endpoints.yml
  • automation/test-execution/ansible/llm-benchmark-auto.yml
  • automation/test-execution/ansible/llm-benchmark-concurrent-load.yml
  • automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml
  • automation/test-execution/dashboard-examples/vllm_dashboard/Home.py
  • automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_📊_Client_Metrics.py
  • automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_🖥️_Server_Metrics.py
  • docs/dashboards-quickstart.md
  • docs/getting-started.md
  • docs/metrics-collection.md
  • tests/concurrent-load/concurrent-load.md

Comment on lines +129 to +131
- name: Extract model name from endpoint
ansible.builtin.set_fact:
endpoint_model: "{{ external_models_response.json.data[0].id }}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

No validation that external endpoint returns at least one model.

If the endpoint returns an empty data array, accessing data[0].id will fail with an unclear index error. Add a guard to provide a clearer error message.

🛡️ Proposed fix to validate models array
+        - name: Validate endpoint returned at least one model
+          ansible.builtin.assert:
+            that:
+              - external_models_response.json.data | length > 0
+            fail_msg: |
+              External endpoint returned no models.
+              URL: {{ vllm_endpoint.external.url }}/v1/models
+              Ensure the vLLM server has a model loaded.
+
         - name: Extract model name from endpoint
           ansible.builtin.set_fact:
             endpoint_model: "{{ external_models_response.json.data[0].id }}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/llm-benchmark-auto.yml` around lines 129 -
131, The task that sets endpoint_model from
external_models_response.json.data[0].id can crash if data is empty; add a guard
before or instead of the set_fact to validate external_models_response.json.data
has length > 0 and emit a clear error. Use ansible.builtin.assert or
ansible.builtin.fail to check "external_models_response.json.data | length > 0"
(referencing the task name "Extract model name from endpoint" and the variable
external_models_response.json.data) and only then set endpoint_model from
data[0].id; if the check fails, fail with a descriptive message like "No models
returned from external endpoint".

Comment on lines +156 to 157
**Note**: External endpoint runs show client metrics only (no server-side metrics).
""")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

External-mode note is inaccurate for endpoints that expose /metrics.

The current text says external runs have no server-side metrics, but external mode can include server metrics when /metrics is reachable. Please update this to a conditional statement.

Suggested wording
-    **Note**: External endpoint runs show client metrics only (no server-side metrics).
+    **Note**: External endpoint runs always show client metrics. Server-side metrics are
+    available when the external endpoint exposes `/metrics`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Note**: External endpoint runs show client metrics only (no server-side metrics).
""")
**Note**: External endpoint runs always show client metrics. Server-side metrics are
available when the external endpoint exposes `/metrics`.
""")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/dashboard-examples/vllm_dashboard/Home.py` around
lines 156 - 157, The note string in Home.py that currently states "External
endpoint runs show client metrics only (no server-side metrics)" is inaccurate;
update that docstring/text block (the triple-quoted note near the end of the
Home.py dashboard text) to a conditional statement that clarifies external runs
normally show only client metrics but will include server-side metrics if the
endpoint exposes a reachable /metrics path (i.e., change the absolute assertion
to a conditional phrasing mentioning the /metrics exception).

Comment on lines +388 to +455
#### Using External vLLM Endpoint

To test against an existing vLLM instance:

**1. Start your vLLM instance:**

```bash
# For Phases 1 & 2 (baseline/realistic - no caching)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--max-model-len 8192 \
--no-enable-prefix-caching

# For Phase 3 (production - with caching)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16 \
--max-model-len 8192 \
--enable-prefix-caching
```

**2. Configure load generator and external endpoint:**

```bash
# Load generator connection (required - GuideLLM runs here)
export LOADGEN_HOSTNAME=192.168.1.200 # or your load generator hostname/IP
export ANSIBLE_SSH_USER=ec2-user
export ANSIBLE_SSH_KEY=~/.ssh/my-key.pem

# External vLLM endpoint (replaces DUT configuration)
export VLLM_ENDPOINT_MODE=external
export VLLM_ENDPOINT_URL=http://your-vllm-host:8000

# Optional: API key for secured vLLM endpoints
export VLLM_API_KEY_ENABLED=true
export VLLM_API_KEY=your-api-key

# Optional: HuggingFace token (if needed by GuideLLM)
export HF_TOKEN=hf_xxxxx
```

**Note:** In external mode, you only need to configure the load generator.
DUT connection is not required since vLLM is already running externally.

**3. Run tests (model name auto-detected from endpoint):**

```bash
cd ../../automation/test-execution/ansible

# Run Phases 1 & 2 (against baseline vLLM instance)
# Note: test_model is optional - will be queried from endpoint
ansible-playbook -i inventory/hosts.yml \
llm-benchmark-concurrent-load.yml \
-e "base_workload=chat" \
-e "requested_cores=16" \
-e "skip_phase_3=true"

# Restart vLLM with caching enabled, then run Phase 3
ansible-playbook -i inventory/hosts.yml \
llm-benchmark-concurrent-load.yml \
-e "base_workload=chat" \
-e "requested_cores=16" \
-e "skip_phase_1=true" \
-e "skip_phase_2=true"
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Phase 3 external instructions conflict with this file’s “Phase 3 not supported” status.

This section provides runnable Phase 3 steps, but the same document declares Phase 3 unsupported (Line 123). Please align these instructions to avoid sending users into a known unsupported path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/concurrent-load/concurrent-load.md` around lines 388 - 455, The Phase 3
run steps in the "Using External vLLM Endpoint" section contradict the
document's earlier "Phase 3 not supported" statement; update the section that
shows Phase 3 commands (the block that restarts vLLM with caching and runs Phase
3) to align with the doc: either remove that Phase 3 example entirely or replace
it with an explicit warning that Phase 3 is unsupported in this mode and show
the safe alternative (e.g., instruct users to skip_phase_3=true or only run
Phases 1&2), and clarify the meaning of VLLM_ENDPOINT_MODE and VLLM_API_KEY
variables so the external-mode workflow cannot be misinterpreted as supporting
Phase 3.

…nal endpoints

Enhances external endpoint testing with intelligent metrics collection and
comprehensive dashboard support.

**Metrics Collection:**
- Auto-detect if external endpoint exposes /metrics endpoint
- Enable server metrics collection when available (HTTP 200)
- Gracefully skip when not available (HTTP 404/403)
- Display clear status messages during test execution

**Dashboard Updates:**
- Add vLLM Mode filter to Client Metrics (managed/external)
- Skip external runs in Server Metrics (no data available)
- Display helpful info messages for external endpoint runs
- Update Home page with external mode examples

**Documentation:**
- Add external endpoint testing to getting-started.md
- Document metrics availability in dashboards-quickstart.md
- Add comprehensive external metrics guide in metrics-collection.md
- Include troubleshooting and best practices

This allows users to benchmark production/cloud/K8s vLLM deployments
while automatically collecting server metrics when endpoints expose them.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the test-against-endpoint branch from ab07a91 to f0e2dbc Compare April 3, 2026 14:54
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_🖥️_Server_Metrics.py (1)

98-100: ⚠️ Potential issue | 🟠 Major

Don't discard external runs that already have a metrics file.

This loader is already walking vllm-metrics.json, so the vllm_mode == 'external' guard hides valid external runs that expose /metrics. The empty-state note and skipped-run counter below then reinforce the same false assumption, and the bare except: pass can silently undercount malformed metadata.

Possible fix
-                        # Skip external endpoint runs (no server metrics available)
-                        if metadata.get('vllm_mode') == 'external':
-                            continue
-
                         data['platform'] = metadata.get('platform', 'unknown')
                         data['model'] = metadata.get('model', 'unknown')
                         data['workload'] = metadata.get('workload', 'unknown')
@@
-    st.info("**Note**: External endpoint runs do not have server metrics (only client-side metrics available)")
+    st.info(
+        "No server metrics were found. External runs only appear here when "
+        "`vllm-metrics.json` was collected."
+    )
@@
-# Check if any external runs were skipped
+# Check if any external runs are missing a server-metrics file
 results_path = Path(results_dir)
-external_count = 0
+external_without_metrics_count = 0
 if results_path.exists():
     for metadata_file in results_path.rglob("test-metadata.json"):
         try:
             with open(metadata_file) as f:
                 metadata = json.load(f)
                 if metadata.get('vllm_mode') == 'external':
-                    external_count += 1
-        except:
-            pass
+                    if not (metadata_file.parent / "vllm-metrics.json").exists():
+                        external_without_metrics_count += 1
+        except Exception as e:
+            st.sidebar.warning(f"Failed to read {metadata_file.name}: {e}")
 
-if external_count > 0:
-    st.info(f"ℹ️ Skipped {external_count} external endpoint runs (no server metrics available). View them in the Client Metrics dashboard.")
+if external_without_metrics_count > 0:
+    st.info(
+        f"Skipped {external_without_metrics_count} external endpoint runs "
+        "without server metrics. View them in the Client Metrics dashboard."
+    )

Also applies to: 128-129, 133-147

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_`🖥️_Server_Metrics.py
around lines 98 - 100, The code currently unconditionally skips runs where
metadata.get('vllm_mode') == 'external', which hides valid external runs that
include a vllm-metrics.json; remove or change this guard so external runs are
only skipped if no metrics file is present (i.e., check for the
existence/parseability of the vllm-metrics.json before continuing), and replace
the bare except: pass around the metadata/metrics parsing with specific
exception handling (e.g., FileNotFoundError, JSONDecodeError, KeyError) so
malformed metadata/metrics are counted/logged correctly; ensure any skipped-run
counter and the empty-state note are updated only when no metrics file is found
rather than for all external runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_`📊_Client_Metrics.py:
- Around line 109-111: The efficiency calculation currently divides
throughput_mean by cores and can produce inf/NaN when core_count is missing;
update the computation that sets df["efficiency"] (and any place computing
per-core metrics) to coerce cores to numeric (e.g., pd.to_numeric(df["cores"],
errors="coerce")) and compute efficiency only when cores > 0 (e.g., use a
conditional like np.where(cores > 0, df["throughput_mean"] / cores, np.nan)).
After that, gate all per-core views and cards (the efficiency plot, comparison
card and data table) by checking filtered_df["efficiency"].notna().any() or hide
them when test_mode == "external" so external runs with no core_count do not
render per-core metrics.

---

Duplicate comments:
In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_`🖥️_Server_Metrics.py:
- Around line 98-100: The code currently unconditionally skips runs where
metadata.get('vllm_mode') == 'external', which hides valid external runs that
include a vllm-metrics.json; remove or change this guard so external runs are
only skipped if no metrics file is present (i.e., check for the
existence/parseability of the vllm-metrics.json before continuing), and replace
the bare except: pass around the metadata/metrics parsing with specific
exception handling (e.g., FileNotFoundError, JSONDecodeError, KeyError) so
malformed metadata/metrics are counted/logged correctly; ensure any skipped-run
counter and the empty-state note are updated only when no metrics file is found
rather than for all external runs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a5c68ac8-2265-4b2a-996a-2e93d5edd990

📥 Commits

Reviewing files that changed from the base of the PR and between ab07a91 and f0e2dbc.

📒 Files selected for processing (7)
  • automation/test-execution/ansible/llm-benchmark-auto.yml
  • automation/test-execution/dashboard-examples/vllm_dashboard/Home.py
  • automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_📊_Client_Metrics.py
  • automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_🖥️_Server_Metrics.py
  • docs/dashboards-quickstart.md
  • docs/getting-started.md
  • docs/metrics-collection.md
✅ Files skipped from review due to trivial changes (1)
  • docs/dashboards-quickstart.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • docs/getting-started.md
  • automation/test-execution/dashboard-examples/vllm_dashboard/Home.py
  • docs/metrics-collection.md
  • automation/test-execution/ansible/llm-benchmark-auto.yml

…dology

Adopt sophisticated performance comparison methodology from the GPU-focused
performance-dashboard, adapted for CPU-specific analysis:

- Add geometric mean aggregation as alternative to peak-only comparison
- Track at which request rate/concurrency peak performance occurs
- Implement separate views for managed (DUT) vs external endpoint tests
- Remove efficiency metric (tokens/sec/core) for external endpoints where
  core count is unknown
- Add flexible comparison modes: Peak (best value) vs Geometric Mean
  (average across all load points)
- Enhance summary tables to show load point where optimal performance occurs

This provides more statistically rigorous performance comparisons while
adapting the methodology for CPU inference workloads and multi-deployment
scenarios (managed container vs cloud/K8s endpoints).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_📊_Client_Metrics.py (1)

160-162: ⚠️ Potential issue | 🟠 Major

efficiency is still computed without guarding missing/zero cores.

This still produces inf/NaN for rows lacking valid core_count (notably external/legacy metadata), which can leak into downstream table/export logic even if charts are mode-gated.

Suggested fix
-    # Calculate efficiency (throughput per core)
-    df['efficiency'] = df['throughput_mean'] / df['cores']
+    # Calculate efficiency (throughput per core) only when cores are valid
+    cores = pd.to_numeric(df['cores'], errors='coerce')
+    df['efficiency'] = np.where(cores > 0, df['throughput_mean'] / cores, np.nan)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_`📊_Client_Metrics.py
around lines 160 - 162, Compute efficiency only when core counts are present and
non-zero: coerce df['cores'] to numeric, replace or mask zeros/invalid entries
(e.g., set zeros or NaNs to np.nan), then set df['efficiency'] =
df['throughput_mean'] / df['cores'] only for valid cores (or use
df['efficiency'] = df['throughput_mean'].where(df['cores']>0) / df['cores']).
Ensure df['efficiency'] remains NaN for rows with missing/zero/invalid core
counts so no inf/NaN leak into downstream exports.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_`📊_Client_Metrics.py:
- Around line 579-581: The selectbox options can be empty when all
vllm_endpoint_url values are 'n/a', causing errors; update the logic around
baseline_endpoints / baseline_endpoint /baseline_df (and the analogous
compare_endpoints/compare_endpoint/compare_df at lines ~621-623) to handle empty
option lists by replacing an empty filtered list with a safe default (e.g.,
['n/a']) and ensuring the selectbox has a valid default selection before
filtering the DataFrame; in short, when computing baseline_endpoints (and
compare_endpoints), if the list is empty set a fallback option and then build
baseline_df (and compare_df) using that safe selected value so the page doesn't
error when no valid URLs exist.
- Around line 338-347: The current exact set intersection of a_loads and b_loads
is brittle for floating-point request_rate; replace the strict equality check
with tolerant matching (e.g., use numpy.isclose or rounding/quantization) to
consider values equal within a small atol/rtol so small numeric drift doesn't
produce an empty intersection; implement this by computing matches between
unique values in data_a and data_b (or by rounding x_axis_column to a fixed
number of decimals or using pandas.merge_asof after sorting) and then build
common matched values (or a mapping) to filter data_a and data_b (replace the
a_loads/b_loads intersection and the a_common/b_common filtering logic).
- Around line 745-755: The displayed compare_val must be computed from the same
common-subset used to compute pct_diff, not from full compare_data; update the
logic around pct_diff/compare_val to use the common subset (e.g., use
compare_common or the common-load result returned by compare_two_datasets) and
compute compare_val = (max or min for aggregation=="peak", or geometric_mean) on
compare_common[config["column"]]. If compare_two_datasets currently doesn't
return the common subset or a basis value, modify it to return that value (or
the subset) and read that returned value here instead of recomputing from
compare_data.

---

Duplicate comments:
In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_`📊_Client_Metrics.py:
- Around line 160-162: Compute efficiency only when core counts are present and
non-zero: coerce df['cores'] to numeric, replace or mask zeros/invalid entries
(e.g., set zeros or NaNs to np.nan), then set df['efficiency'] =
df['throughput_mean'] / df['cores'] only for valid cores (or use
df['efficiency'] = df['throughput_mean'].where(df['cores']>0) / df['cores']).
Ensure df['efficiency'] remains NaN for rows with missing/zero/invalid core
counts so no inf/NaN leak into downstream exports.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ec671a4e-9038-4c84-9e7d-7f05a4802641

📥 Commits

Reviewing files that changed from the base of the PR and between f0e2dbc and 6160356.

📒 Files selected for processing (1)
  • automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_📊_Client_Metrics.py

Comment on lines +338 to +347
a_loads = set(data_a[x_axis_column].dropna().unique())
b_loads = set(data_b[x_axis_column].dropna().unique())
common = a_loads.intersection(b_loads)

if not common:
return None, None, None, None, None

# Filter to common load points
a_common = data_a[data_a[x_axis_column].isin(common)]
b_common = data_b[data_b[x_axis_column].isin(common)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Common-load matching is brittle for request_rate comparisons.

Using exact set intersection on raw float load values can drop all overlap due to tiny numeric drift between runs, causing false “no comparable data”.

Suggested fix
-    a_loads = set(data_a[x_axis_column].dropna().unique())
-    b_loads = set(data_b[x_axis_column].dropna().unique())
+    if x_axis_column == "request_rate":
+        a_keys = np.round(pd.to_numeric(data_a[x_axis_column], errors="coerce"), 2)
+        b_keys = np.round(pd.to_numeric(data_b[x_axis_column], errors="coerce"), 2)
+    else:
+        a_keys = pd.to_numeric(data_a[x_axis_column], errors="coerce")
+        b_keys = pd.to_numeric(data_b[x_axis_column], errors="coerce")
+
+    a_loads = set(a_keys.dropna().unique())
+    b_loads = set(b_keys.dropna().unique())
     common = a_loads.intersection(b_loads)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_`📊_Client_Metrics.py
around lines 338 - 347, The current exact set intersection of a_loads and
b_loads is brittle for floating-point request_rate; replace the strict equality
check with tolerant matching (e.g., use numpy.isclose or rounding/quantization)
to consider values equal within a small atol/rtol so small numeric drift doesn't
produce an empty intersection; implement this by computing matches between
unique values in data_a and data_b (or by rounding x_axis_column to a fixed
number of decimals or using pandas.merge_asof after sorting) and then build
common matched values (or a mapping) to filter data_a and data_b (replace the
a_loads/b_loads intersection and the a_common/b_common filtering logic).

Comment on lines +579 to +581
baseline_endpoints = [e for e in df['vllm_endpoint_url'].unique() if e != 'n/a']
baseline_endpoint = st.selectbox("Endpoint URL", baseline_endpoints, key="baseline_endpoint")
baseline_df = df[df['vllm_endpoint_url'] == baseline_endpoint]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

External compare selectors can fail when no valid endpoint URLs exist.

If external rows only have 'n/a' endpoint metadata, both st.selectbox("Endpoint URL", ...) calls receive empty options and the page can error.

Suggested fix
         else:
             # External mode: filter by endpoint
             baseline_endpoints = [e for e in df['vllm_endpoint_url'].unique() if e != 'n/a']
+            if not baseline_endpoints:
+                st.warning("No valid endpoint URLs found for external runs.")
+                return
             baseline_endpoint = st.selectbox("Endpoint URL", baseline_endpoints, key="baseline_endpoint")
             baseline_df = df[df['vllm_endpoint_url'] == baseline_endpoint]
...
         else:
             # External mode: filter by endpoint
             compare_endpoints = [e for e in df['vllm_endpoint_url'].unique() if e != 'n/a']
+            if not compare_endpoints:
+                st.warning("No valid endpoint URLs found for external runs.")
+                return
             compare_endpoint = st.selectbox("Endpoint URL", compare_endpoints, key="compare_endpoint")
             compare_df = df[df['vllm_endpoint_url'] == compare_endpoint]

Also applies to: 621-623

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_`📊_Client_Metrics.py
around lines 579 - 581, The selectbox options can be empty when all
vllm_endpoint_url values are 'n/a', causing errors; update the logic around
baseline_endpoints / baseline_endpoint /baseline_df (and the analogous
compare_endpoints/compare_endpoint/compare_df at lines ~621-623) to handle empty
option lists by replacing an empty filtered list with a safe default (e.g.,
['n/a']) and ensuring the selectbox has a valid default selection before
filtering the DataFrame; in short, when computing baseline_endpoints (and
compare_endpoints), if the list is empty set a fallback option and then build
baseline_df (and compare_df) using that safe selected value so the page doesn't
error when no valid URLs exist.

Comment on lines +745 to +755
if pct_diff is not None:
# Get the actual values for display
if aggregation == "peak":
if config["higher_is_better"]:
compare_val = compare_data[config["column"]].max()
else:
compare_val = compare_data[config["column"]].min()
else:
compare_val = geometric_mean(
compare_data[config["column"]].dropna().tolist()
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Comparison card value can disagree with its delta calculation.

pct_diff is computed from common-load subsets in compare_two_datasets, but the displayed compare_val here is recomputed from full compare_data, which can show a different value than the delta basis.

Suggested fix
-    Returns:
-        (pct_diff, a_is_better, a_peak_load, b_peak_load, is_similar)
+    Returns:
+        (pct_diff, a_is_better, a_peak_load, b_peak_load, is_similar, a_val, b_val)
...
-    return pct_diff, a_better, a_peak_load, b_peak_load, is_similar
+    return pct_diff, a_better, a_peak_load, b_peak_load, is_similar, a_val, b_val
-        pct_diff, better, a_load, b_load, similar = (
+        pct_diff, better, a_load, b_load, similar, _a_val, compare_val = (
             comparison_results[metric_name]
         )
...
-            if aggregation == "peak":
-                if config["higher_is_better"]:
-                    compare_val = compare_data[config["column"]].max()
-                else:
-                    compare_val = compare_data[config["column"]].min()
-            else:
-                compare_val = geometric_mean(
-                    compare_data[config["column"]].dropna().tolist()
-                )
+            # compare_val comes from the same subset used for pct_diff
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/1_`📊_Client_Metrics.py
around lines 745 - 755, The displayed compare_val must be computed from the same
common-subset used to compute pct_diff, not from full compare_data; update the
logic around pct_diff/compare_val to use the common subset (e.g., use
compare_common or the common-load result returned by compare_two_datasets) and
compute compare_val = (max or min for aggregation=="peak", or geometric_mean) on
compare_common[config["column"]]. If compare_two_datasets currently doesn't
return the common subset or a basis value, modify it to return that value (or
the subset) and read that returned value here instead of recomputing from
compare_data.

- Remove skip logic for external endpoints with server metrics
- Add radio buttons for managed/external mode selection (similar to Client Metrics)
- Support filtering by endpoint URL in external mode
- Update comparison views to handle both deployment modes
- Show appropriate metadata (endpoint vs platform/cores) based on mode

Fixes the issue where external endpoints with available /metrics endpoint
were being excluded from the Server Metrics dashboard.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_🖥️_Server_Metrics.py (1)

93-108: ⚠️ Potential issue | 🟠 Major

Default deployment metadata before the test-metadata.json guard.

If vllm-metrics.json exists but test-metadata.json does not, this function still appends the record without vllm_mode or vllm_endpoint_url. The later mode filter then treats it as neither managed nor external and drops it entirely, so a metrics-only run shows up as “loaded” and then disappears from the page.

Suggested fix
                 # Load test metadata
                 metadata_file = metrics_file.parent / "test-metadata.json"
+                data['vllm_mode'] = 'managed'
+                data['vllm_endpoint_url'] = 'n/a'
                 if metadata_file.exists():
                     with open(metadata_file) as f:
                         metadata = json.load(f)

                         # Load metadata for both managed and external endpoints
@@
-                        data['vllm_mode'] = metadata.get('vllm_mode', 'managed')
-                        data['vllm_endpoint_url'] = metadata.get('vllm_endpoint_url', 'n/a')
+                        data['vllm_mode'] = metadata.get('vllm_mode', data['vllm_mode'])
+                        data['vllm_endpoint_url'] = metadata.get('vllm_endpoint_url', data['vllm_endpoint_url'])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_`🖥️_Server_Metrics.py
around lines 93 - 108, The bug is that when metrics_file exists but
metadata_file (test-metadata.json) does not, the data dict never gets vllm_mode
or vllm_endpoint_url set, causing later filters to drop the record; fix by
ensuring default deployment metadata are assigned to data before or when
skipping the metadata_file guard: set data['vllm_mode']='managed' and
data['vllm_endpoint_url']='n/a' (and any other defaults like
'platform','model','workload','test_run_id','cores','backend','vllm_version','core_config')
whenever metadata_file.exists() is False so metrics-only runs still produce a
complete data record (update the block around metadata_file, metrics_file and
the data dict initialization to apply these defaults).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_`🖥️_Server_Metrics.py:
- Around line 300-303: The endpoint-shortening logic collapses host:port and
breaks IPv6 by splitting on ':' after '//'—update the logic that computes
endpoint_short (currently set from endpoint_url variable) to preserve host:port
and correctly handle IPv6: parse endpoint_url to extract the netloc
(host[:port]) rather than naive string splits, then use that netloc for
cols[0].metric and other uses (replace the same logic at the other occurrences
noted) so labels keep distinct host:port and full IPv6 addresses.

---

Outside diff comments:
In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_`🖥️_Server_Metrics.py:
- Around line 93-108: The bug is that when metrics_file exists but metadata_file
(test-metadata.json) does not, the data dict never gets vllm_mode or
vllm_endpoint_url set, causing later filters to drop the record; fix by ensuring
default deployment metadata are assigned to data before or when skipping the
metadata_file guard: set data['vllm_mode']='managed' and
data['vllm_endpoint_url']='n/a' (and any other defaults like
'platform','model','workload','test_run_id','cores','backend','vllm_version','core_config')
whenever metadata_file.exists() is False so metrics-only runs still produce a
complete data record (update the block around metadata_file, metrics_file and
the data dict initialization to apply these defaults).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 58be081b-4044-4b41-9f95-b84ac08382cf

📥 Commits

Reviewing files that changed from the base of the PR and between 6160356 and 353a1f1.

📒 Files selected for processing (1)
  • automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_🖥️_Server_Metrics.py

Comment on lines +300 to +303
endpoint_url = test_data.get('vllm_endpoint_url', 'N/A')
# Shorten endpoint for display
endpoint_short = endpoint_url.split('//')[1].split(':')[0] if '//' in endpoint_url else endpoint_url
cols[0].metric("Endpoint", endpoint_short)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve host:port when shortening external endpoint URLs.

The current split('//')[1].split(':')[0] logic collapses https://host:8000 and https://host:9000 into the same label, and it truncates IPv6 endpoints at the first colon. In compare mode, that makes different external targets indistinguishable in the header, legend, and summary table.

Suggested fix
-        endpoint_short = endpoint_url.split('//')[1].split(':')[0] if '//' in endpoint_url else endpoint_url
+        endpoint_short = endpoint_url.split('//', 1)[-1].rsplit('@', 1)[-1].split('/', 1)[0]
@@
-            baseline_endpoint_short = baseline_endpoint_short.split('//')[1].split(':')[0]
+            baseline_endpoint_short = baseline_endpoint_short.split('//', 1)[-1].rsplit('@', 1)[-1].split('/', 1)[0]
@@
-            compare_endpoint_short = compare_endpoint_short.split('//')[1].split(':')[0]
+            compare_endpoint_short = compare_endpoint_short.split('//', 1)[-1].rsplit('@', 1)[-1].split('/', 1)[0]
@@
-            endpoint_short = endpoint_url.split('//')[1].split(':')[0] if '//' in endpoint_url else endpoint_url
+            endpoint_short = endpoint_url.split('//', 1)[-1].rsplit('@', 1)[-1].split('/', 1)[0]

Also applies to: 977-984, 1094-1096

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_`🖥️_Server_Metrics.py
around lines 300 - 303, The endpoint-shortening logic collapses host:port and
breaks IPv6 by splitting on ':' after '//'—update the logic that computes
endpoint_short (currently set from endpoint_url variable) to preserve host:port
and correctly handle IPv6: parse endpoint_url to extract the netloc
(host[:port]) rather than naive string splits, then use that netloc for
cols[0].metric and other uses (replace the same logic at the other occurrences
noted) so labels keep distinct host:port and full IPv6 addresses.

maryamtahhan and others added 3 commits April 3, 2026 17:47
The benchmark_guidellm role fetches guidellm.log from the load generator
to the controller's ~/benchmark-results directory. However, the synchronize
task only pulls files from the load_generator's ~/benchmark-results to the
final results directory, missing the log file that was fetched to the controller.

This adds:
1. A copy task to copy guidellm.log from the controller's ~/benchmark-results
   to the final results directory (results/llm/...) so it appears alongside
   the other benchmark files
2. A cleanup task to remove the entire test run directory from the controller's
   ~/benchmark-results after copying is complete

Fixes the issue where guidellm.log was missing from the final results
directory after test execution, and prevents accumulation of old test
results in ~/benchmark-results.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Python buffers stdout when redirected to a file (via nohup > file),
causing print() statements to not appear in metrics-collector.log
until the buffer is flushed or the script completes.

Adding the `-u` flag runs Python in unbuffered mode, ensuring that
log messages appear immediately in metrics-collector.log for real-time
monitoring of the collection process.

Fixes the issue where metrics-collector.log appears empty even though
the collection script is running.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The metrics collector was running for the full estimated duration
(test_duration + 120s buffer) even when the test completed early,
causing the stop task to timeout waiting for vllm-metrics.json.

Changes:
1. Added signal handlers (SIGTERM/SIGINT) to the metrics collector
   script for graceful shutdown
2. Modified the main loop to check should_stop flag
3. Updated stop task to:
   - Send SIGTERM first for graceful shutdown
   - Wait up to 60s for the metrics file to be written
   - Force kill (SIGKILL) only if still running after timeout
4. Added file existence check and better status reporting

This allows the metrics collector to be stopped cleanly when the
benchmark completes, while still writing the collected samples to
the metrics file.

Fixes the timeout and missing vllm-metrics.json issues.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan
Copy link
Copy Markdown
Collaborator Author

maryamtahhan commented Apr 3, 2026

@lmzuccarelli @PoolPooer this is how I tested

Setup vllm on a host.

then on the controller node:

# Required: Set vLLM to external mode
export VLLM_ENDPOINT_MODE=external

# Required: Full URL with protocol and port
export VLLM_ENDPOINT_URL=http://<dut>:8000

# Required: Load generator hostname
export LOADGEN_HOSTNAME=<loadgen>

# Required: Ansible SSH configuration
export ANSIBLE_SSH_USER=<user>
export ANSIBLE_SSH_KEY=<key>

# Optional but recommended: HuggingFace token
export HF_TOKEN=$(cat ~/hf_token)

then run a test

ansible-playbook -i inventory/hosts.yml \
llm-benchmark-concurrent-load.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "base_workload=chat" \
-e "requested_cores=16" \
-e "skip_phase_2=true" \
-e "skip_phase_3=true" \
-e "guidellm_max_seconds=100" \
-e "guidellm_rate=[1,2]" \
-e "guidellm_wait_timeout_seconds=2400"

check the dashboard after...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant