Skip to content

Trouble with multiple GPUS: GPU options impose ntasks-per-gpu=1 even when not specified #316

@eloualiche

Description

@eloualiche

Preamble

Versions

$ snakemake --version
$ uv tool run --from snakemake python -c "import importlib.metadata; print(f'snakemake-executor-plugin-slurm: {importlib.metadata.version(\"snakemake-executor-plugin-slurm\")}')"
snakemake-executor-plugin-slurm: 1.3.6
$ sinfo --version
slurm 23.11.8

Description

Slurm executor adds a ntasks-per-gpu=1 option as a default. I cannot find a way to disable it.
This leads to issues on jobs submitted with 2 gpus.

An easy fix could be (non-breaking) to allow a flag value that disables the option

call += f" --ntasks-per-gpu={job.resources.get('tasks', 1)}"

if gpu_job:
    ntasks_per_gpu_val = job.resources.get('ntasks_per_gpu', job.resources.get('tasks', 1))
    if ntasks_per_gpu_val != 0:  # or whichever flag is appropriate to remove the command
        call += f" --ntasks-per-gpu={ntasks_per_gpu_val}"
else:
    call += f" --ntasks={job.resources.get('tasks', 1)}"

This is just a sketch as I dont know enough about how this plugin has decided to handle flags etc...

Below are the logs.

The Rule

rule TEST:
    output: "output/JSON/structured_10k_analysis.json"
    params:
        model="mistralai/Mistral-Nemo-Instruct-2407",
        cik=66740,
        daterange=["2000-01-01", "2002-01-01"]
    resources: jobs=1, nodes=1, ntasks=1, tasks=1, cpus_per_gpu=2, mem_mb=64000, tmp=32000, slurm_partition="preempt-gpu,msigpu", gres="gpu:a40:2", runtime=30, slurm_account="eloualic"
    log: "log/TEST_VLLM_10K.log"
    shell: """
    echo "=== Job Started: $(date) ===" &> {log}
    source  {python_venv}/bin/activate  # activate uv python env    
    echo "=== Slurm GPU Allocation Debug ===" &>> {log}
    echo "SLURM_JOB_GPUS: ${{SLURM_JOB_GPUS:-'not set'}}" &>> {log}
    echo "SLURM_STEP_GPUS: ${{SLURM_STEP_GPUS:-'not set'}}" &>> {log}
    echo "SLURM_GPUS_ON_NODE: ${{SLURM_GPUS_ON_NODE:-'not set'}}" &>> {log}
    echo "SLURM_JOB_ID: ${{SLURM_JOB_ID:-'not set'}}" &>> {log}
    echo "SLURM_NODELIST: ${{SLURM_NODELIST:-'not set'}}" &>> {log}

    echo "=== All GPUs on this node ===" &>> {log}
    nvidia-smi -L &>> {log}
    nvidia-smi --query-gpu=index,name,memory.total,memory.used --format=csv &>> {log}

    # Test 1: Default (what Slurm set)
    echo "Test 1 - Default CUDA_VISIBLE_DEVICES: ${{CUDA_VISIBLE_DEVICES:-'not set'}}" &>> {log}
    python -c "import torch; print(f'Test 1 torch views: {{torch.cuda.device_count()}} GPUs')" &>> {log}
    echo "=== GPU Debug Complete - NOT starting VLLM yet ===" &>> {log}

    echo "=== Starting VLLM with all GPUs: $(date) ===" &>> {log}
    uv run --project {python_project} python -m vllm.entrypoints.openai.api_server --model {params.model} --port 8000 --host 0.0.0.0 --max-model-len 128000 --tensor-parallel-size 2 --gpu-memory-utilization 0.9 &>> {log} 

    """

Snakemake execution

I executed the rule with
snakemake --executor slurm -j1 -R TEST_VLLM_10K --verbose

Log

=== All GPUs on this node ===                                                                                                                                                                                 │
│=== Slurm GPU Allocation Debug ===                                                                                                                                                                            │
│SLURM_JOB_GPUS: 0,2                                                                                                                                                                                           │
│SLURM_STEP_GPUS: 0                                                                                                                                                                                            │
│SLURM_GPUS_ON_NODE: 2                                                                                                                                                                                         │
│SLURM_JOB_ID: 36160287                                                                                                                                                                                        │
│SLURM_NODELIST: agc03                                                                                                                                                                                         │
│=== All GPUs on this node ===                                                                                                                                                                                 │
│GPU 0: NVIDIA A40 (UUID: GPU-8bc7ea13-3b8f-69ea-6322-0c6cb001f22a)                                                                                                                                            │
│GPU 0: NVIDIA A40 (UUID: GPU-9b70f8e4-5777-df04-dc25-ed7316d3335f)                                                                                                                                            │
│index, name, memory.total [MiB], memory.used [MiB]                                                                                                                                                            │
│0, NVIDIA A40, 46068 MiB, 1 MiB                                                                                                                                                                               │
│Test 1 - Default CUDA_VISIBLE_DEVICES: 0                                                                                                                                                                      │
│index, name, memory.total [MiB], memory.used [MiB]                                                                                                                                                            │
│0, NVIDIA A40, 46068 MiB, 1 MiB                                                                                                                                                                               │
│Test 1 - Default CUDA_VISIBLE_DEVICES: 0                                                                                                                                                                      │
│Test 1 torch views: 1 GPUs                                                                                                                                                                                    │
│Test 1 torch views: 1 GPUs                                                                                                                                                                                    │
│=== GPU Debug Complete - NOT starting VLLM yet ===                                                                                                                                                            │
│=== Starting VLLM with all GPUs: Thu May 29 23:31:41 CDT 2025 ===                                                                                                                                             │
│=== GPU Debug Complete - NOT starting VLLM yet ===                                                                                                                                                            │
│=== Starting VLLM with all GPUs: Thu May 29 23:31:41 CDT 2025 ===                                                                                                                                             │
│INFO 05-29 23:31:52 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                   │
│INFO 05-29 23:31:52 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                   │
│INFO 05-29 23:31:56 [api_server.py:1043] vLLM API server version 0.8.5.post1                                                                                                                                  │
│INFO 05-29 23:31:56 [api_server.py:1044] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_method│
│INFO 05-29 23:31:56 [api_server.py:1043] vLLM API server version 0.8.5.post1                                                                                                                                  │
│INFO 05-29 23:31:56 [api_server.py:1044] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_method│
│INFO 05-29 23:32:07 [config.py:717] This model supports multiple tasks: {'classify', 'score', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.                                                       │
│INFO 05-29 23:32:07 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.                                                       │
│INFO 05-29 23:32:07 [config.py:1770] Defaulting to use ray for distributed inference                                                                                                                          │
│INFO 05-29 23:32:07 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.                                                                                                             │
│INFO 05-29 23:32:07 [config.py:1770] Defaulting to use ray for distributed inference                                                                                                                          │
│INFO 05-29 23:32:07 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.                                                                                                             │
│/scratch.global/eloualic/llm-in-finance/config/python_llm_env/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:23: FutureWarning: It is strongly recommended to run mistral model│
│  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)                                                                                                                                       │
│/scratch.global/eloualic/llm-in-finance/config/python_llm_env/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:23: FutureWarning: It is strongly recommended to run mistral model│
│  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)                                                                                                                                       │
│INFO 05-29 23:32:15 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                   │
│INFO 05-29 23:32:15 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                   │
│INFO 05-29 23:32:19 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='mistralai/Mistral-Nemo-Instruct-2407', speculative_config=None, tokenizer='mistralai/Mistral-Nemo-Instruct-24│
│INFO 05-29 23:32:19 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='mistralai/Mistral-Nemo-Instruct-2407', speculative_config=None, tokenizer='mistralai/Mistral-Nemo-Instruct-24│
│2025-05-29 23:32:23,970        INFO worker.py:1888 -- Started a local Ray instance.                                                                                                                           │
│2025-05-29 23:32:23,979        INFO worker.py:1888 -- Started a local Ray instance.

What seems to happen is that this runs the code twice on two instances, each having access to a different gpu (see the uuid).
Torch only ever sees one of the gpu at a time which means it never pools memory.

Srun execution

I copy pasted the rule from the verbose log of snakemake. I only removed ntasks-per-gpu=1 option

sbatch --parsable --job-name f7d8b643-6865-4907-81d8-da0ca1357747 --output "/scratch.global/eloualic/llm-in-finance/llm_testing/.snakemake/slurm_logs/rule_TEST_VLLM_10K/%j.log" --export=ALL --comment "TEST"  -A 'eloualic'  -p preempt-gpu,msigpu -t 30 --mem 64000 --nodes=1 --cpus-per-gpu=2 -D '/scratch.global/eloualic/llm-in-finance/llm_testing' --gres=gpu:a40:2 --wrap="/home/eloualic/eloualic/.local/uv/tools/snakemake/bin/python -m snakemake --snakefile '/scratch.global/eloualic/llm-in-finance/llm_testing/Snakefile' --target-jobs 'TEST_VLLM_10K:' --allowed-rules TEST_VLLM_10K --cores 'all' --attempt 1 --force-use-threads  --resources 'jobs=1' 'nodes=1' 'ntasks=1' 'tasks=1' 'cpus_per_gpu=2' 'mem_mb=64000' 'mem_mib=61036' 'tmp=32000' --wait-for-files '/scratch.global/eloualic/llm-in-finance/llm_testing/.snakemake/tmp.14gj3wul' 'src/test_ollama_api.jl' 'src/jl_routines/VLLMInterface.jl' 'src/jl_routines/M_PULL10K.jl' --force --target-files-omit-workdir-adjustment --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers params software-env mtime input code --conda-frontend 'conda' --shared-fs-usage sources storage-local-copies software-deployment persistence input-output source-cache --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 5 --scheduler 'greedy' --local-storage-prefix base64//LnNuYWtlbWFrZS9zdG9yYWdl --scheduler-solver-path '/home/eloualic/eloualic/.local/uv/tools/snakemake/bin' --default-resources base64//dG1wZGlyPXN5c3RlbV90bXBkaXI= --executor slurm-jobstep --jobs 1 --mode 'remote'"

The vllm server started and gpus did show up together.

=== Job Started: Thu May 29 23:36:00 CDT 2025 ===                                                                                                                                                                                                       │
=== Slurm GPU Allocation Debug ===                                                                                                                                                                                                                      │
SLURM_JOB_GPUS: 0,2                                                                                                                                                                                                                                     │
SLURM_STEP_GPUS: 0,2                                                                                                                                                                                                                                    │
SLURM_GPUS_ON_NODE: 2                                                                                                                                                                                                                                   │
SLURM_JOB_ID: 36160324                                                                                                                                                                                                                                  │
SLURM_NODELIST: agc03                                                                                                                                                                                                                                   │
=== All GPUs on this node ===                                                                                                                                                                                                                           │
GPU 0: NVIDIA A40 (UUID: GPU-9b70f8e4-5777-df04-dc25-ed7316d3335f)                                                                                                                                                                                      │
GPU 1: NVIDIA A40 (UUID: GPU-8bc7ea13-3b8f-69ea-6322-0c6cb001f22a)                                                                                                                                                                                      │
index, name, memory.total [MiB], memory.used [MiB]                                                                                                                                                                                                      │
0, NVIDIA A40, 46068 MiB, 1 MiB                                                                                                                                                                                                                         │
1, NVIDIA A40, 46068 MiB, 1 MiB                                                                                                                                                                                                                         │
Test 1 - Default CUDA_VISIBLE_DEVICES: 0,1                                                                                                                                                                                                              │
Test 1 torch views: 2 GPUs                                                                                                                                                                                                                              │
=== GPU Debug Complete - NOT starting VLLM yet ===                                                                                                                                                                                                      │
=== Starting VLLM with all GPUs: Thu May 29 23:36:02 CDT 2025 ===                                                                                                                                                                                       │
INFO 05-29 23:36:11 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                             │
INFO 05-29 23:36:14 [api_server.py:1043] vLLM API server version 0.8.5.post1                                                                                                                                                                            │
INFO 05-29 23:36:14 [api_server.py:1044] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=No│
INFO 05-29 23:36:26 [config.py:717] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.                                                                                                 │
INFO 05-29 23:36:26 [config.py:1770] Defaulting to use mp for distributed inference                                                                                                                                                                     │
INFO 05-29 23:36:26 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.                                                                                                                                                       │
/scratch.global/eloualic/llm-in-finance/config/python_llm_env/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:23: FutureWarning: It is strongly recommended to run mistral models with `--tokenizer-mode "mistral"` to ens│
  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)                                                                                                                                                                                 │
INFO 05-29 23:36:34 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                             │
INFO 05-29 23:36:37 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='mistralai/Mistral-Nemo-Instruct-2407', speculative_config=None, tokenizer='mistralai/Mistral-Nemo-Instruct-2407', skip_tokenizer_init=False, tokenizer_│
INFO 05-29 23:36:37 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_8d0f238d'), local_subscribe_addr='ipc:///tmp/7ab51e1d-62cb-472c-805e-c9de12556940', remote_su│
INFO 05-29 23:36:46 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                             │
INFO 05-29 23:36:46 [__init__.py:239] Automatically detected platform cuda.                                                                                                                                                                             │
WARNING 05-29 23:36:52 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fdaf8d246e0>                                  │
WARNING 05-29 23:36:52 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f56f61ab380>                                  │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_361fa885'), local_subscribe_addr='ipc:///tmp/112587c6-3966-4a8c-│
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_aeb12a1f'), local_subscribe_addr='ipc:///tmp/10ffba7d-06d5-4373-│
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:53 [utils.py:1055] Found nccl from library libnccl.so.2                                                                                                                                                │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:53 [pynccl.py:69] vLLM is using nccl==2.21.5                                                                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:53 [utils.py:1055] Found nccl from library libnccl.so.2                                                                                                                                                │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:53 [pynccl.py:69] vLLM is using nccl==2.21.5                                                                                                                                                           │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /users/7/eloualic/.cache/vllm/gpu_p2p_access_cache_for_0,1.json                                                                  │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /users/7/eloualic/.cache/vllm/gpu_p2p_access_cache_for_0,1.json                                                                  │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_1fc07353'), local_subscribe_addr='ipc:///tmp/3e56dcfa-5ae6-4944-9b│
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0                                                                                                      │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [cuda.py:221] Using Flash Attention backend on V1 engine.                                                                                                                                           │
(VllmWorker rank=0 pid=3008574) WARNING 05-29 23:36:54 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.         │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1                                                                                                      │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [cuda.py:221] Using Flash Attention backend on V1 engine.                                                                                                                                           │
(VllmWorker rank=1 pid=3008575) WARNING 05-29 23:36:54 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.         │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [gpu_model_runner.py:1329] Starting to load model mistralai/Mistral-Nemo-Instruct-2407...                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [gpu_model_runner.py:1329] Starting to load model mistralai/Mistral-Nemo-Instruct-2407...                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:55 [weight_utils.py:265] Using model weights format ['*.safetensors']                                                                                                                                  │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:55 [weight_utils.py:265] Using model weights format ['*.safetensors']                                                                                                                                  │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]                                                                                                                                           │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.83it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.19it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.10it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.07it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.09it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)  Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.13it/s]                                                                                                                                   │
(VllmWorker rank=0 pid=3008574)                                                                                                                                                                                                                         │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:00 [loader.py:458] Loading weights took 4.60 seconds                                                                                                                                                   │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:00 [loader.py:458] Loading weights took 4.74 seconds                                                                                                                                                   │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:00 [gpu_model_runner.py:1347] Model loading took 11.4384 GiB and 5.338305 seconds                                                                                                                      │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:00 [gpu_model_runner.py:1347] Model loading took 11.4384 GiB and 5.682200 seconds                                                                                                                      │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:32 [backends.py:420] Using cache directory: /users/7/eloualic/.cache/vllm/torch_compile_cache/7a51309e4c/rank_0_0 for vLLM's torch.compile                                                             │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:32 [backends.py:420] Using cache directory: /users/7/eloualic/.cache/vllm/torch_compile_cache/7a51309e4c/rank_1_0 for vLLM's torch.compile                                                             │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:32 [backends.py:430] Dynamo bytecode transform time: 31.41 s                                                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:32 [backends.py:430] Dynamo bytecode transform time: 31.42 s                                                                                                                                           │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:52 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 18.803 s                                                                                                  │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:52 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 18.827 s                                                                                                  │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:57 [monitor.py:33] torch.compile takes 31.41 s in total                                                                                                                                                │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:57 [monitor.py:33] torch.compile takes 31.42 s in total                                                                                                                                                │
INFO 05-29 23:37:59 [kv_cache_utils.py:634] GPU KV cache size: 350,128 tokens                                                                                                                                                                           │
INFO 05-29 23:37:59 [kv_cache_utils.py:637] Maximum concurrency for 128,000 tokens per request: 2.74x                                                                                                                                                   │
INFO 05-29 23:37:59 [kv_cache_utils.py:634] GPU KV cache size: 350,128 tokens                                                                                                                                                                           │
INFO 05-29 23:37:59 [kv_cache_utils.py:637] Maximum concurrency for 128,000 tokens per request: 2.74x                                                                                                                                                   │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:38:54 [custom_all_reduce.py:195] Registering 5427 cuda graph addresses                                                                                                                                    │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:38:54 [custom_all_reduce.py:195] Registering 5427 cuda graph addresses                                                                                                                                    │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:38:54 [gpu_model_runner.py:1686] Graph capturing finished in 55 secs, took 0.63 GiB                                                                                                                       │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:38:54 [gpu_model_runner.py:1686] Graph capturing finished in 55 secs, took 0.63 GiB                                                                                                                       │
INFO 05-29 23:38:54 [core.py:159] init engine (profile, create kv cache, warmup model) took 113.97 seconds                                                                                                                                              │
INFO 05-29 23:38:54 [core_client.py:439] Core engine process 0 ready.                                                                                                                                                                                   │
INFO 05-29 23:38:54 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8000                                                                                                                                                                │
INFO 05-29 23:38:54 [launcher.py:28] Available routes are:                                                                                                                                                                                              │
INFO 05-29 23:38:54 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD                                                                                                                                                                           │
INFO 05-29 23:38:54 [launcher.py:36] Route: /docs, Methods: GET, HEAD                                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /redoc, Methods: GET, HEAD                                                                                                                                                                                  │
INFO 05-29 23:38:54 [launcher.py:36] Route: /health, Methods: GET                                                                                                                                                                                       │
INFO 05-29 23:38:54 [launcher.py:36] Route: /load, Methods: GET                                                                                                                                                                                         │
INFO 05-29 23:38:54 [launcher.py:36] Route: /ping, Methods: GET, POST                                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /tokenize, Methods: POST                                                                                                                                                                                    │
INFO 05-29 23:38:54 [launcher.py:36] Route: /detokenize, Methods: POST                                                                                                                                                                                  │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/models, Methods: GET                                                                                                                                                                                    │
INFO 05-29 23:38:54 [launcher.py:36] Route: /version, Methods: GET                                                                                                                                                                                      │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/chat/completions, Methods: POST                                                                                                                                                                         │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/completions, Methods: POST                                                                                                                                                                              │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/embeddings, Methods: POST                                                                                                                                                                               │
INFO 05-29 23:38:54 [launcher.py:36] Route: /pooling, Methods: POST                                                                                                                                                                                     │
INFO 05-29 23:38:54 [launcher.py:36] Route: /score, Methods: POST                                                                                                                                                                                       │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/score, Methods: POST                                                                                                                                                                                    │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST                                                                                                                                                                     │
INFO 05-29 23:38:54 [launcher.py:36] Route: /rerank, Methods: POST                                                                                                                                                                                      │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/rerank, Methods: POST                                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v2/rerank, Methods: POST                                                                                                                                                                                   │
INFO 05-29 23:38:54 [launcher.py:36] Route: /invocations, Methods: POST                                                                                                                                                                                 │
INFO 05-29 23:38:54 [launcher.py:36] Route: /metrics, Methods: GET                                                                                                                                                                                      │
INFO:     Started server process [3008420]                                                                                                                                                                                                              │
INFO:     Waiting for application startup.                                                                                                                                                                                                              │
INFO:     Application startup complete.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug/fixSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions