Skip to content

Conversation

zhangxinyuehfad
Copy link
Contributor

@zhangxinyuehfad zhangxinyuehfad commented Sep 1, 2025

What this PR does / why we need it?

  1. fix soc_version for 310p
  2. refactor _build_info and add ascend_soc_version(A2, A3, 310P) into _build_info
  3. set default SOC_VERSION(ASCEND910B1, Ascend910_9392, ASCEND310P3) for ascend_soc_version
(EngineCore_0 pid=7454)
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700] EngineCore failed to start.
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700] Traceback (most recent call last):
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]   File "/__w/vllm-benchmarks/vllm-benchmarks/vllm-empty/vllm/v1/engine/core.py", line 691, in run_engine_core
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]   File "/__w/vllm-benchmarks/vllm-benchmarks/vllm-empty/vllm/v1/engine/core.py", line 492, in __init__
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]   File "/__w/vllm-benchmarks/vllm-benchmarks/vllm-empty/vllm/v1/engine/core.py", line 89, in __init__
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]     self._initialize_kv_caches(vllm_config)
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]   File "/__w/vllm-benchmarks/vllm-benchmarks/vllm-empty/vllm/v1/engine/core.py", line 179, in _initialize_kv_caches
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]     self.model_executor.determine_available_memory())
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]   File "/__w/vllm-benchmarks/vllm-benchmarks/vllm-empty/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]     output = self.collective_rpc("determine_available_memory")
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]   File "/__w/vllm-benchmarks/vllm-benchmarks/vllm-empty/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]   File "/__w/vllm-benchmarks/vllm-benchmarks/vllm-empty/vllm/utils/__init__.py", line 3007, in run_method
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]     return func(*args, **kwargs)
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]   File "/__w/vllm-benchmarks/vllm-benchmarks/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 161, in determine_available_memory
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]     self.model_runner.profile_run()
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]   File "/__w/vllm-benchmarks/vllm-benchmarks/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2163, in profile_run
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]     hidden_states = self._dummy_run(self.max_num_tokens,
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]     return func(*args, **kwargs)
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]   File "/__w/vllm-benchmarks/vllm-benchmarks/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2015, in _dummy_run
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]     moe_comm_method = self._select_moe_comm_method(num_tokens)
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]   File "/__w/vllm-benchmarks/vllm-benchmarks/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1633, in _select_moe_comm_method
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700]     raise ValueError(f"Unsupported soc_version: {soc_version}")
(EngineCore_0 pid=7454) ERROR 09-01 08:01:59 [core.py:700] ValueError: Unsupported soc_version: AscendSocVersion.UNDEFINED

Does this PR introduce any user-facing change?

Users can use 310p nomarlly

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a bug for Ascend 310p devices by adding support for its soc_version. The changes properly identify the new soc_version and configure the appropriate MoE communication method. I have one high-severity suggestion to improve maintainability by replacing a magic number with a named constant.

Copy link

github-actions bot commented Sep 1, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@wangxiyuan
Copy link
Collaborator

I don't like this way, how about refactor to build_info way totally?

Copy link

codecov bot commented Sep 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.84%. Comparing base (2693196) to head (5fc0f77).
⚠️ Report is 24 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2676      +/-   ##
==========================================
+ Coverage   72.61%   73.84%   +1.22%     
==========================================
  Files         154      155       +1     
  Lines       21319    21338      +19     
==========================================
+ Hits        15480    15756     +276     
+ Misses       5839     5582     -257     
Flag Coverage Δ
unittests 73.84% <100.00%> (+1.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

github-actions bot commented Sep 5, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@AlphaINF
Copy link

AlphaINF commented Sep 7, 2025

same issue! Want this branch merge!

@zhangxinyuehfad
Copy link
Contributor Author

I don't like this way, how about refactor to build_info way totally?

It's impossible to import torch_npu and using get_soc_version() in the isolated environment, We need to get ascend_soc_version in the run phase like before。

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@wangxiyuan wangxiyuan added ready-for-test start test by label for PR and removed ready-for-test start test by label for PR labels Sep 15, 2025
@zhangxinyuehfad zhangxinyuehfad force-pushed the zxy_fix branch 16 times, most recently from bb4559a to c002951 Compare September 19, 2025 02:57
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants