Skip to content

[fix][trtllm] fix trtllm rollout docker image and a few scripts#6230

Draft
hchings wants to merge 1 commit intoverl-project:mainfrom
hchings:fix_ci
Draft

[fix][trtllm] fix trtllm rollout docker image and a few scripts#6230
hchings wants to merge 1 commit intoverl-project:mainfrom
hchings:fix_ci

Conversation

@hchings
Copy link
Copy Markdown
Collaborator

@hchings hchings commented Apr 30, 2026

What does this PR do?

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@hchings hchings self-assigned this Apr 30, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Docker configuration, documentation, and training scripts for the GRPO trainer. Critical issues were identified regarding invalid package versions for cupy-cuda12x and ray in the Dockerfile, which will cause build failures. Furthermore, the newly added TRT-LLM parameters in the shell script are currently ineffective due to missing support in the server implementation, and the script requires improvements to align with shell scripting best practices.

# Install Python dependencies
RUN pip3 install --no-cache-dir --no-deps trl==0.27.0 && \
pip3 install --no-cache-dir nvtx matplotlib liger_kernel cachetools && \
pip3 install --no-cache-dir cupy-cuda12x==14.0.1 && \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The version cupy-cuda12x==14.0.1 does not appear to exist on PyPI (the latest stable release is 13.3.0). This will cause the Docker build to fail. Please verify the intended version and use a valid release.

    pip3 install --no-cache-dir cupy-cuda12x==13.3.0 && \



# Pin Ray to a version compatible with TRT-LLM 1.3.0rc13
RUN pip install --no-cache-dir "ray[default]==2.54.1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Ray version 2.54.1 is not a valid release on PyPI and will cause the build to fail. Additionally, the preceding pip uninstall -y verl at line 49 leaves the image without the verl package installed. You should remove the redundant uninstalls and use a valid Ray version (e.g., 2.35.0 or 2.40.0).

RUN pip install --no-cache-dir "ray[default]==2.35.0"

Comment on lines +232 to +233
+actor_rollout_ref.rollout.engine_kwargs.trtllm.batch_wait_timeout_iters=32
+actor_rollout_ref.rollout.engine_kwargs.trtllm.batch_wait_max_tokens_ratio=0.5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

These parameters (batch_wait_timeout_iters and batch_wait_max_tokens_ratio) are currently ineffective because the TRTLLMHttpServer implementation in verl/workers/rollout/trtllm_rollout/trtllm_async_server.py does not pass them to the SchedulerConfig. To make these settings work, the server implementation needs to be updated. Additionally, ensure this shell script follows repository standards: use the + prefix for Hydra overrides, enable set -xeuo pipefail, ensure log directories are created before use, and redirect stderr to stdout (2>&1) before piping to tee.

References
  1. Use + instead of ++ as the prefix for overriding configuration values in Hydra.
  2. Enable set -xeuo pipefail in shell scripts to ensure that the script exits on errors, treats unset variables as errors, and pipelines fail correctly.
  3. In shell scripts, redirect stderr to stdout using 2>&1 before piping to tee to ensure that error messages are captured in the log file.
  4. Ensure that the log directory exists before writing to it in shell scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant