[fix][trtllm] fix trtllm rollout docker image and a few scripts#6230
[fix][trtllm] fix trtllm rollout docker image and a few scripts#6230hchings wants to merge 1 commit intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the Docker configuration, documentation, and training scripts for the GRPO trainer. Critical issues were identified regarding invalid package versions for cupy-cuda12x and ray in the Dockerfile, which will cause build failures. Furthermore, the newly added TRT-LLM parameters in the shell script are currently ineffective due to missing support in the server implementation, and the script requires improvements to align with shell scripting best practices.
| # Install Python dependencies | ||
| RUN pip3 install --no-cache-dir --no-deps trl==0.27.0 && \ | ||
| pip3 install --no-cache-dir nvtx matplotlib liger_kernel cachetools && \ | ||
| pip3 install --no-cache-dir cupy-cuda12x==14.0.1 && \ |
There was a problem hiding this comment.
|
|
||
|
|
||
| # Pin Ray to a version compatible with TRT-LLM 1.3.0rc13 | ||
| RUN pip install --no-cache-dir "ray[default]==2.54.1" |
There was a problem hiding this comment.
The Ray version 2.54.1 is not a valid release on PyPI and will cause the build to fail. Additionally, the preceding pip uninstall -y verl at line 49 leaves the image without the verl package installed. You should remove the redundant uninstalls and use a valid Ray version (e.g., 2.35.0 or 2.40.0).
RUN pip install --no-cache-dir "ray[default]==2.35.0"
| +actor_rollout_ref.rollout.engine_kwargs.trtllm.batch_wait_timeout_iters=32 | ||
| +actor_rollout_ref.rollout.engine_kwargs.trtllm.batch_wait_max_tokens_ratio=0.5 |
There was a problem hiding this comment.
These parameters (batch_wait_timeout_iters and batch_wait_max_tokens_ratio) are currently ineffective because the TRTLLMHttpServer implementation in verl/workers/rollout/trtllm_rollout/trtllm_async_server.py does not pass them to the SchedulerConfig. To make these settings work, the server implementation needs to be updated. Additionally, ensure this shell script follows repository standards: use the + prefix for Hydra overrides, enable set -xeuo pipefail, ensure log directories are created before use, and redirect stderr to stdout (2>&1) before piping to tee.
References
- Use + instead of ++ as the prefix for overriding configuration values in Hydra.
- Enable set -xeuo pipefail in shell scripts to ensure that the script exits on errors, treats unset variables as errors, and pipelines fail correctly.
- In shell scripts, redirect stderr to stdout using 2>&1 before piping to tee to ensure that error messages are captured in the log file.
- Ensure that the log directory exists before writing to it in shell scripts.
What does this PR do?
2.54.1, which is the compatible version with TRT-LLM 1.3.0rc13Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.