[Bug]: matryoshka need gpu-memory???

### Your current environment

```
vllm 0.16.0
```

</details>


### 🐛 Describe the bug

first yaml
```yaml
model: "/workspace/KaLM-embedding-multilingual-mini-instruct-v2.5"
convert: embed
gpu-memory-utilization: 0.85
served-model-name: KaLM-embedding-multilingual-mini-instruct-v2.5
max-model-len: 8000
port: 8000
max_num_batched_tokens: 20000
# hf-overrides: {"matryoshka_dimensions": [896, 512, 256, 128, 64, 32]}
max_num_seqs: 16
block_size: 16
host: "0.0.0.0"
```
It Ok in 3GB.


second yaml:
```yaml
model: "/workspace/KaLM-embedding-multilingual-mini-instruct-v2.5"
convert: embed
gpu-memory-utilization: 0.85
served-model-name: KaLM-embedding-multilingual-mini-instruct-v2.5
max-model-len: 8000
port: 8000
max_num_batched_tokens: 20000
hf-overrides: {"matryoshka_dimensions": [896, 512, 256, 128, 64, 32]}
max_num_seqs: 16
block_size: 16
host: "0.0.0.0"
```
It failed in 3GB but OK in 5GB.


error log:
```
(EngineCore_DP0 pid=372) INFO 03-09 01:15:12 [core.py:97] Initializing a V1 LLM engine (v0.16.0) with config: model='/workspace/KaLM-embedding-multilingual-mini-instruct-v2.5', speculative_config=None, tokenizer='/workspace/KaLM-embedding-multilingual-mini-instruct-v2.5', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=KaLM-embedding-multilingual-mini-instruct-v2.5, enable_prefix_caching=False, enable_chunked_prefill=False, pooler_config=PoolerConfig(pooling_type=None, seq_pooling_type='MEAN', tok_pooling_type='ALL', use_activation=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, logit_bias=None, step_tag_id=None, returned_token_ids=None), compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [20000], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 32, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
43
(EngineCore_DP0 pid=372) WARNING 03-09 01:15:12 [network_utils.py:36] The environment variable HOST_IP is deprecated and ignored, as it is often used by Docker and other software to interact with the container's network stack. Please use VLLM_HOST_IP instead to set the IP address for vLLM processes to communicate with each other.
44
[HAMI-core Warn(372:139878795960832:utils.c:183)]: get default cuda from (null)
45
[HAMI-core Msg(372:139878795960832:libvgpu.c:855)]: Initialized
46
(EngineCore_DP0 pid=372) INFO 03-09 01:15:13 [parallel_state.py:1234] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.42.8.62:38451 backend=nccl
47
(EngineCore_DP0 pid=372) INFO 03-09 01:15:13 [parallel_state.py:1445] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
48
[HAMI-core Msg(372:139878795960832:memory.c:511)]: orig free=71957348352 total=84987740160 limit=5242880000 usage=434110464
49
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] EngineCore failed to start.
50
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] Traceback (most recent call last):
51
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 996, in run_engine_core
52
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
53
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 740, in __init__
54
(EngineCore_DP0 pid=372) Process EngineCore_DP0:
55
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] super().__init__(
56
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 106, in __init__
57
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] self.model_executor = executor_class(vllm_config)
58
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
59
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] self._init_executor()
60
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor
61
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] self.driver_worker.init_device()
62
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 322, in init_device
63
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] self.worker.init_device() # type: ignore
64
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 252, in init_device
65
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] self.requested_memory = request_memory(init_snapshot, self.cache_config)
66
(EngineCore_DP0 pid=372) Traceback (most recent call last):
67
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
68
(EngineCore_DP0 pid=372) self.run()
69
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 108, in run
70
(EngineCore_DP0 pid=372) self._target(*self._args, **self._kwargs)
71
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1010, in run_engine_core
72
(EngineCore_DP0 pid=372) raise e
73
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 996, in run_engine_core
74
(EngineCore_DP0 pid=372) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
75
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 740, in __init__
76
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/utils.py", line 102, in request_memory
77
(EngineCore_DP0 pid=372) super().__init__(
78
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] raise ValueError(
79
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 106, in __init__
80
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] ValueError: Free memory on device cuda:0 (4.48/4.88 GiB) on startup is less than desired GPU memory utilization (0.95, 4.64 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
81
(EngineCore_DP0 pid=372) self.model_executor = executor_class(vllm_config)
82
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
83
(EngineCore_DP0 pid=372) self._init_executor()
84
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor
85
(EngineCore_DP0 pid=372) self.driver_worker.init_device()
86
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 322, in init_device
87
(EngineCore_DP0 pid=372) self.worker.init_device() # type: ignore
88
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 252, in init_device
89
(EngineCore_DP0 pid=372) self.requested_memory = request_memory(init_snapshot, self.cache_config)
90
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/utils.py", line 102, in request_memory
91
(EngineCore_DP0 pid=372) raise ValueError(
92
(EngineCore_DP0 pid=372) ValueError: Free memory on device cuda:0 (4.48/4.88 GiB) on startup is less than desired GPU memory utilization (0.95, 4.64 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
93
[rank0]:[W309 01:15:14.891823851 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
94
[HAMI-core Msg(372:139878795960832:multiprocess_memory_limit.c:468)]: Calling exit handler 372
95
(APIServer pid=1) Traceback (most recent call last):
96
(APIServer pid=1) File "/root/miniconda3/bin/vllm", line 8, in <module>
97
(APIServer pid=1) sys.exit(main())
98
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
99
(APIServer pid=1) args.dispatch_function(args)
100
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 111, in cmd
101
(APIServer pid=1) uvloop.run(run_server(args))
102
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/uvloop/__init__.py", line 69, in run
103
(APIServer pid=1) return loop.run_until_complete(wrapper())
104
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
105
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/uvloop/__init__.py", line 48, in wrapper
106
(APIServer pid=1) return await main
107
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 457, in run_server
108
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
109
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 476, in run_server_worker
110
(APIServer pid=1) async with build_async_engine_client(
111
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/contextlib.py", line 199, in __aenter__
112
(APIServer pid=1) return await anext(self.gen)
113
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
114
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
115
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/contextlib.py", line 199, in __aenter__
116
(APIServer pid=1) return await anext(self.gen)
117
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
118
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
119
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 222, in from_vllm_config
120
(APIServer pid=1) return cls(
121
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 148, in __init__
122
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
123
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 124, in make_async_mp_client
124
(APIServer pid=1) return AsyncMPClient(*client_args)
125
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 835, in __init__
126
(APIServer pid=1) super().__init__(
127
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 490, in __init__
128
(APIServer pid=1) with launch_core_engines(vllm_config, executor_class, log_stats) as (
129
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/contextlib.py", line 142, in __exit__
130
(APIServer pid=1) next(self.gen)
131
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines
132
(APIServer pid=1) wait_for_engine_startup(
133
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup
134
(APIServer pid=1) raise RuntimeError(
135
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
136
[HAMI-core Msg(1:140455094399488:multiprocess_memory_limit.c:468)]: Calling exit handler 1
137
```


We use HAMI to allocate memory resources.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: matryoshka need gpu-memory??? #36433

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: matryoshka need gpu-memory??? #36433

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions