-
-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Open
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers
Description
Your current environment
vllm 0.16.0
๐ Describe the bug
first yaml
model: "/workspace/KaLM-embedding-multilingual-mini-instruct-v2.5"
convert: embed
gpu-memory-utilization: 0.85
served-model-name: KaLM-embedding-multilingual-mini-instruct-v2.5
max-model-len: 8000
port: 8000
max_num_batched_tokens: 20000
# hf-overrides: {"matryoshka_dimensions": [896, 512, 256, 128, 64, 32]}
max_num_seqs: 16
block_size: 16
host: "0.0.0.0"It Ok in 3GB.
second yaml:
model: "/workspace/KaLM-embedding-multilingual-mini-instruct-v2.5"
convert: embed
gpu-memory-utilization: 0.85
served-model-name: KaLM-embedding-multilingual-mini-instruct-v2.5
max-model-len: 8000
port: 8000
max_num_batched_tokens: 20000
hf-overrides: {"matryoshka_dimensions": [896, 512, 256, 128, 64, 32]}
max_num_seqs: 16
block_size: 16
host: "0.0.0.0"It failed in 3GB but OK in 5GB.
error log:
(EngineCore_DP0 pid=372) INFO 03-09 01:15:12 [core.py:97] Initializing a V1 LLM engine (v0.16.0) with config: model='/workspace/KaLM-embedding-multilingual-mini-instruct-v2.5', speculative_config=None, tokenizer='/workspace/KaLM-embedding-multilingual-mini-instruct-v2.5', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=KaLM-embedding-multilingual-mini-instruct-v2.5, enable_prefix_caching=False, enable_chunked_prefill=False, pooler_config=PoolerConfig(pooling_type=None, seq_pooling_type='MEAN', tok_pooling_type='ALL', use_activation=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, logit_bias=None, step_tag_id=None, returned_token_ids=None), compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [20000], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 32, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
43
(EngineCore_DP0 pid=372) WARNING 03-09 01:15:12 [network_utils.py:36] The environment variable HOST_IP is deprecated and ignored, as it is often used by Docker and other software to interact with the container's network stack. Please use VLLM_HOST_IP instead to set the IP address for vLLM processes to communicate with each other.
44
[HAMI-core Warn(372:139878795960832:utils.c:183)]: get default cuda from (null)
45
[HAMI-core Msg(372:139878795960832:libvgpu.c:855)]: Initialized
46
(EngineCore_DP0 pid=372) INFO 03-09 01:15:13 [parallel_state.py:1234] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.42.8.62:38451 backend=nccl
47
(EngineCore_DP0 pid=372) INFO 03-09 01:15:13 [parallel_state.py:1445] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
48
[HAMI-core Msg(372:139878795960832:memory.c:511)]: orig free=71957348352 total=84987740160 limit=5242880000 usage=434110464
49
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] EngineCore failed to start.
50
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] Traceback (most recent call last):
51
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 996, in run_engine_core
52
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
53
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 740, in __init__
54
(EngineCore_DP0 pid=372) Process EngineCore_DP0:
55
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] super().__init__(
56
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 106, in __init__
57
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] self.model_executor = executor_class(vllm_config)
58
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
59
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] self._init_executor()
60
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor
61
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] self.driver_worker.init_device()
62
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 322, in init_device
63
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] self.worker.init_device() # type: ignore
64
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 252, in init_device
65
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] self.requested_memory = request_memory(init_snapshot, self.cache_config)
66
(EngineCore_DP0 pid=372) Traceback (most recent call last):
67
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
68
(EngineCore_DP0 pid=372) self.run()
69
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 108, in run
70
(EngineCore_DP0 pid=372) self._target(*self._args, **self._kwargs)
71
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1010, in run_engine_core
72
(EngineCore_DP0 pid=372) raise e
73
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 996, in run_engine_core
74
(EngineCore_DP0 pid=372) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
75
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 740, in __init__
76
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/utils.py", line 102, in request_memory
77
(EngineCore_DP0 pid=372) super().__init__(
78
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] raise ValueError(
79
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 106, in __init__
80
(EngineCore_DP0 pid=372) ERROR 03-09 01:15:14 [core.py:1006] ValueError: Free memory on device cuda:0 (4.48/4.88 GiB) on startup is less than desired GPU memory utilization (0.95, 4.64 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
81
(EngineCore_DP0 pid=372) self.model_executor = executor_class(vllm_config)
82
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
83
(EngineCore_DP0 pid=372) self._init_executor()
84
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor
85
(EngineCore_DP0 pid=372) self.driver_worker.init_device()
86
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 322, in init_device
87
(EngineCore_DP0 pid=372) self.worker.init_device() # type: ignore
88
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 252, in init_device
89
(EngineCore_DP0 pid=372) self.requested_memory = request_memory(init_snapshot, self.cache_config)
90
(EngineCore_DP0 pid=372) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/utils.py", line 102, in request_memory
91
(EngineCore_DP0 pid=372) raise ValueError(
92
(EngineCore_DP0 pid=372) ValueError: Free memory on device cuda:0 (4.48/4.88 GiB) on startup is less than desired GPU memory utilization (0.95, 4.64 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
93
[rank0]:[W309 01:15:14.891823851 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
94
[HAMI-core Msg(372:139878795960832:multiprocess_memory_limit.c:468)]: Calling exit handler 372
95
(APIServer pid=1) Traceback (most recent call last):
96
(APIServer pid=1) File "/root/miniconda3/bin/vllm", line 8, in <module>
97
(APIServer pid=1) sys.exit(main())
98
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
99
(APIServer pid=1) args.dispatch_function(args)
100
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 111, in cmd
101
(APIServer pid=1) uvloop.run(run_server(args))
102
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/uvloop/__init__.py", line 69, in run
103
(APIServer pid=1) return loop.run_until_complete(wrapper())
104
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
105
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/uvloop/__init__.py", line 48, in wrapper
106
(APIServer pid=1) return await main
107
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 457, in run_server
108
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
109
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 476, in run_server_worker
110
(APIServer pid=1) async with build_async_engine_client(
111
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/contextlib.py", line 199, in __aenter__
112
(APIServer pid=1) return await anext(self.gen)
113
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
114
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
115
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/contextlib.py", line 199, in __aenter__
116
(APIServer pid=1) return await anext(self.gen)
117
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
118
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
119
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 222, in from_vllm_config
120
(APIServer pid=1) return cls(
121
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 148, in __init__
122
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
123
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 124, in make_async_mp_client
124
(APIServer pid=1) return AsyncMPClient(*client_args)
125
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 835, in __init__
126
(APIServer pid=1) super().__init__(
127
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 490, in __init__
128
(APIServer pid=1) with launch_core_engines(vllm_config, executor_class, log_stats) as (
129
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/contextlib.py", line 142, in __exit__
130
(APIServer pid=1) next(self.gen)
131
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines
132
(APIServer pid=1) wait_for_engine_startup(
133
(APIServer pid=1) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup
134
(APIServer pid=1) raise RuntimeError(
135
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
136
[HAMI-core Msg(1:140455094399488:multiprocess_memory_limit.c:468)]: Calling exit handler 1
137
We use HAMI to allocate memory resources.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers