[fix][trtllm] fix trtllm rollout docker image and a few scripts by hchings · Pull Request #6230 · verl-project/verl

hchings · 2026-04-30T23:00:35Z

What does this PR do?

Pin ray version to 2.54.1, which is the compatible version with TRT-LLM 1.3.0rc13
Add back a few trtllm-specific args for grpo scripts that were dropped by [misc] refactor: re-format examples and deprecate old examples #6126

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

gemini-code-assist

Code Review

This pull request updates the Docker configuration, documentation, and training scripts for the GRPO trainer. Critical issues were identified regarding invalid package versions for cupy-cuda12x and ray in the Dockerfile, which will cause build failures. Furthermore, the newly added TRT-LLM parameters in the shell script are currently ineffective due to missing support in the server implementation, and the script requires improvements to align with shell scripting best practices.

gemini-code-assist · 2026-04-30T23:04:21Z

 # Install Python dependencies
 RUN pip3 install --no-cache-dir --no-deps trl==0.27.0 && \
    pip3 install --no-cache-dir nvtx matplotlib liger_kernel cachetools && \
+    pip3 install --no-cache-dir cupy-cuda12x==14.0.1 && \


The version cupy-cuda12x==14.0.1 does not appear to exist on PyPI (the latest stable release is 13.3.0). This will cause the Docker build to fail. Please verify the intended version and use a valid release.

pip3 install --no-cache-dir cupy-cuda12x==13.3.0 && \

gemini-code-assist · 2026-04-30T23:04:21Z



+# Pin Ray to a version compatible with TRT-LLM 1.3.0rc13
+RUN pip install --no-cache-dir "ray[default]==2.54.1"


The Ray version 2.54.1 is not a valid release on PyPI and will cause the build to fail. Additionally, the preceding pip uninstall -y verl at line 49 leaves the image without the verl package installed. You should remove the redundant uninstalls and use a valid Ray version (e.g., 2.35.0 or 2.40.0).

RUN pip install --no-cache-dir "ray[default]==2.35.0"

gemini-code-assist · 2026-04-30T23:04:21Z

+        +actor_rollout_ref.rollout.engine_kwargs.trtllm.batch_wait_timeout_iters=32
+        +actor_rollout_ref.rollout.engine_kwargs.trtllm.batch_wait_max_tokens_ratio=0.5


These parameters (batch_wait_timeout_iters and batch_wait_max_tokens_ratio) are currently ineffective because the TRTLLMHttpServer implementation in verl/workers/rollout/trtllm_rollout/trtllm_async_server.py does not pass them to the SchedulerConfig. To make these settings work, the server implementation needs to be updated. Additionally, ensure this shell script follows repository standards: use the + prefix for Hydra overrides, enable set -xeuo pipefail, ensure log directories are created before use, and redirect stderr to stdout (2>&1) before piping to tee.

References

Use + instead of ++ as the prefix for overriding configuration values in Hydra.

Enable set -xeuo pipefail in shell scripts to ensure that the script exits on errors, treats unset variables as errors, and pipelines fail correctly.

In shell scripts, redirect stderr to stdout using 2>&1 before piping to tee to ensure that error messages are captured in the log file.

Ensure that the log directory exists before writing to it in shell scripts.

pin ray version, a few fixes

79d53d4

hchings self-assigned this Apr 30, 2026

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][trtllm] fix trtllm rollout docker image and a few scripts#6230

[fix][trtllm] fix trtllm rollout docker image and a few scripts#6230
hchings wants to merge 1 commit intoverl-project:mainfrom
hchings:fix_ci

hchings commented Apr 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		# Pin Ray to a version compatible with TRT-LLM 1.3.0rc13
		RUN pip install --no-cache-dir "ray[default]==2.54.1"

		+actor_rollout_ref.rollout.engine_kwargs.trtllm.batch_wait_timeout_iters=32
		+actor_rollout_ref.rollout.engine_kwargs.trtllm.batch_wait_max_tokens_ratio=0.5

Conversation

hchings commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hchings commented Apr 30, 2026 •

edited

Loading