-
Notifications
You must be signed in to change notification settings - Fork 34
Description
Preamble
Versions
$ snakemake --version
$ uv tool run --from snakemake python -c "import importlib.metadata; print(f'snakemake-executor-plugin-slurm: {importlib.metadata.version(\"snakemake-executor-plugin-slurm\")}')"
snakemake-executor-plugin-slurm: 1.3.6
$ sinfo --version
slurm 23.11.8
Description
Slurm executor adds a ntasks-per-gpu=1
option as a default. I cannot find a way to disable it.
This leads to issues on jobs submitted with 2 gpus.
An easy fix could be (non-breaking) to allow a flag value that disables the option
call += f" --ntasks-per-gpu={job.resources.get('tasks', 1)}" |
if gpu_job:
ntasks_per_gpu_val = job.resources.get('ntasks_per_gpu', job.resources.get('tasks', 1))
if ntasks_per_gpu_val != 0: # or whichever flag is appropriate to remove the command
call += f" --ntasks-per-gpu={ntasks_per_gpu_val}"
else:
call += f" --ntasks={job.resources.get('tasks', 1)}"
This is just a sketch as I dont know enough about how this plugin has decided to handle flags etc...
Below are the logs.
The Rule
rule TEST:
output: "output/JSON/structured_10k_analysis.json"
params:
model="mistralai/Mistral-Nemo-Instruct-2407",
cik=66740,
daterange=["2000-01-01", "2002-01-01"]
resources: jobs=1, nodes=1, ntasks=1, tasks=1, cpus_per_gpu=2, mem_mb=64000, tmp=32000, slurm_partition="preempt-gpu,msigpu", gres="gpu:a40:2", runtime=30, slurm_account="eloualic"
log: "log/TEST_VLLM_10K.log"
shell: """
echo "=== Job Started: $(date) ===" &> {log}
source {python_venv}/bin/activate # activate uv python env
echo "=== Slurm GPU Allocation Debug ===" &>> {log}
echo "SLURM_JOB_GPUS: ${{SLURM_JOB_GPUS:-'not set'}}" &>> {log}
echo "SLURM_STEP_GPUS: ${{SLURM_STEP_GPUS:-'not set'}}" &>> {log}
echo "SLURM_GPUS_ON_NODE: ${{SLURM_GPUS_ON_NODE:-'not set'}}" &>> {log}
echo "SLURM_JOB_ID: ${{SLURM_JOB_ID:-'not set'}}" &>> {log}
echo "SLURM_NODELIST: ${{SLURM_NODELIST:-'not set'}}" &>> {log}
echo "=== All GPUs on this node ===" &>> {log}
nvidia-smi -L &>> {log}
nvidia-smi --query-gpu=index,name,memory.total,memory.used --format=csv &>> {log}
# Test 1: Default (what Slurm set)
echo "Test 1 - Default CUDA_VISIBLE_DEVICES: ${{CUDA_VISIBLE_DEVICES:-'not set'}}" &>> {log}
python -c "import torch; print(f'Test 1 torch views: {{torch.cuda.device_count()}} GPUs')" &>> {log}
echo "=== GPU Debug Complete - NOT starting VLLM yet ===" &>> {log}
echo "=== Starting VLLM with all GPUs: $(date) ===" &>> {log}
uv run --project {python_project} python -m vllm.entrypoints.openai.api_server --model {params.model} --port 8000 --host 0.0.0.0 --max-model-len 128000 --tensor-parallel-size 2 --gpu-memory-utilization 0.9 &>> {log}
"""
Snakemake execution
I executed the rule with
snakemake --executor slurm -j1 -R TEST_VLLM_10K --verbose
Log
=== All GPUs on this node === │
│=== Slurm GPU Allocation Debug === │
│SLURM_JOB_GPUS: 0,2 │
│SLURM_STEP_GPUS: 0 │
│SLURM_GPUS_ON_NODE: 2 │
│SLURM_JOB_ID: 36160287 │
│SLURM_NODELIST: agc03 │
│=== All GPUs on this node === │
│GPU 0: NVIDIA A40 (UUID: GPU-8bc7ea13-3b8f-69ea-6322-0c6cb001f22a) │
│GPU 0: NVIDIA A40 (UUID: GPU-9b70f8e4-5777-df04-dc25-ed7316d3335f) │
│index, name, memory.total [MiB], memory.used [MiB] │
│0, NVIDIA A40, 46068 MiB, 1 MiB │
│Test 1 - Default CUDA_VISIBLE_DEVICES: 0 │
│index, name, memory.total [MiB], memory.used [MiB] │
│0, NVIDIA A40, 46068 MiB, 1 MiB │
│Test 1 - Default CUDA_VISIBLE_DEVICES: 0 │
│Test 1 torch views: 1 GPUs │
│Test 1 torch views: 1 GPUs │
│=== GPU Debug Complete - NOT starting VLLM yet === │
│=== Starting VLLM with all GPUs: Thu May 29 23:31:41 CDT 2025 === │
│=== GPU Debug Complete - NOT starting VLLM yet === │
│=== Starting VLLM with all GPUs: Thu May 29 23:31:41 CDT 2025 === │
│INFO 05-29 23:31:52 [__init__.py:239] Automatically detected platform cuda. │
│INFO 05-29 23:31:52 [__init__.py:239] Automatically detected platform cuda. │
│INFO 05-29 23:31:56 [api_server.py:1043] vLLM API server version 0.8.5.post1 │
│INFO 05-29 23:31:56 [api_server.py:1044] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_method│
│INFO 05-29 23:31:56 [api_server.py:1043] vLLM API server version 0.8.5.post1 │
│INFO 05-29 23:31:56 [api_server.py:1044] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_method│
│INFO 05-29 23:32:07 [config.py:717] This model supports multiple tasks: {'classify', 'score', 'generate', 'embed', 'reward'}. Defaulting to 'generate'. │
│INFO 05-29 23:32:07 [config.py:717] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'. │
│INFO 05-29 23:32:07 [config.py:1770] Defaulting to use ray for distributed inference │
│INFO 05-29 23:32:07 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048. │
│INFO 05-29 23:32:07 [config.py:1770] Defaulting to use ray for distributed inference │
│INFO 05-29 23:32:07 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048. │
│/scratch.global/eloualic/llm-in-finance/config/python_llm_env/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:23: FutureWarning: It is strongly recommended to run mistral model│
│ self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config) │
│/scratch.global/eloualic/llm-in-finance/config/python_llm_env/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:23: FutureWarning: It is strongly recommended to run mistral model│
│ self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config) │
│INFO 05-29 23:32:15 [__init__.py:239] Automatically detected platform cuda. │
│INFO 05-29 23:32:15 [__init__.py:239] Automatically detected platform cuda. │
│INFO 05-29 23:32:19 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='mistralai/Mistral-Nemo-Instruct-2407', speculative_config=None, tokenizer='mistralai/Mistral-Nemo-Instruct-24│
│INFO 05-29 23:32:19 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='mistralai/Mistral-Nemo-Instruct-2407', speculative_config=None, tokenizer='mistralai/Mistral-Nemo-Instruct-24│
│2025-05-29 23:32:23,970 INFO worker.py:1888 -- Started a local Ray instance. │
│2025-05-29 23:32:23,979 INFO worker.py:1888 -- Started a local Ray instance.
What seems to happen is that this runs the code twice on two instances, each having access to a different gpu (see the uuid).
Torch only ever sees one of the gpu at a time which means it never pools memory.
Srun execution
I copy pasted the rule from the verbose log of snakemake. I only removed ntasks-per-gpu=1
option
sbatch --parsable --job-name f7d8b643-6865-4907-81d8-da0ca1357747 --output "/scratch.global/eloualic/llm-in-finance/llm_testing/.snakemake/slurm_logs/rule_TEST_VLLM_10K/%j.log" --export=ALL --comment "TEST" -A 'eloualic' -p preempt-gpu,msigpu -t 30 --mem 64000 --nodes=1 --cpus-per-gpu=2 -D '/scratch.global/eloualic/llm-in-finance/llm_testing' --gres=gpu:a40:2 --wrap="/home/eloualic/eloualic/.local/uv/tools/snakemake/bin/python -m snakemake --snakefile '/scratch.global/eloualic/llm-in-finance/llm_testing/Snakefile' --target-jobs 'TEST_VLLM_10K:' --allowed-rules TEST_VLLM_10K --cores 'all' --attempt 1 --force-use-threads --resources 'jobs=1' 'nodes=1' 'ntasks=1' 'tasks=1' 'cpus_per_gpu=2' 'mem_mb=64000' 'mem_mib=61036' 'tmp=32000' --wait-for-files '/scratch.global/eloualic/llm-in-finance/llm_testing/.snakemake/tmp.14gj3wul' 'src/test_ollama_api.jl' 'src/jl_routines/VLLMInterface.jl' 'src/jl_routines/M_PULL10K.jl' --force --target-files-omit-workdir-adjustment --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose --rerun-triggers params software-env mtime input code --conda-frontend 'conda' --shared-fs-usage sources storage-local-copies software-deployment persistence input-output source-cache --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 5 --scheduler 'greedy' --local-storage-prefix base64//LnNuYWtlbWFrZS9zdG9yYWdl --scheduler-solver-path '/home/eloualic/eloualic/.local/uv/tools/snakemake/bin' --default-resources base64//dG1wZGlyPXN5c3RlbV90bXBkaXI= --executor slurm-jobstep --jobs 1 --mode 'remote'"
The vllm server started and gpus did show up together.
=== Job Started: Thu May 29 23:36:00 CDT 2025 === │
=== Slurm GPU Allocation Debug === │
SLURM_JOB_GPUS: 0,2 │
SLURM_STEP_GPUS: 0,2 │
SLURM_GPUS_ON_NODE: 2 │
SLURM_JOB_ID: 36160324 │
SLURM_NODELIST: agc03 │
=== All GPUs on this node === │
GPU 0: NVIDIA A40 (UUID: GPU-9b70f8e4-5777-df04-dc25-ed7316d3335f) │
GPU 1: NVIDIA A40 (UUID: GPU-8bc7ea13-3b8f-69ea-6322-0c6cb001f22a) │
index, name, memory.total [MiB], memory.used [MiB] │
0, NVIDIA A40, 46068 MiB, 1 MiB │
1, NVIDIA A40, 46068 MiB, 1 MiB │
Test 1 - Default CUDA_VISIBLE_DEVICES: 0,1 │
Test 1 torch views: 2 GPUs │
=== GPU Debug Complete - NOT starting VLLM yet === │
=== Starting VLLM with all GPUs: Thu May 29 23:36:02 CDT 2025 === │
INFO 05-29 23:36:11 [__init__.py:239] Automatically detected platform cuda. │
INFO 05-29 23:36:14 [api_server.py:1043] vLLM API server version 0.8.5.post1 │
INFO 05-29 23:36:14 [api_server.py:1044] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=No│
INFO 05-29 23:36:26 [config.py:717] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'. │
INFO 05-29 23:36:26 [config.py:1770] Defaulting to use mp for distributed inference │
INFO 05-29 23:36:26 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048. │
/scratch.global/eloualic/llm-in-finance/config/python_llm_env/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py:23: FutureWarning: It is strongly recommended to run mistral models with `--tokenizer-mode "mistral"` to ens│
self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config) │
INFO 05-29 23:36:34 [__init__.py:239] Automatically detected platform cuda. │
INFO 05-29 23:36:37 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='mistralai/Mistral-Nemo-Instruct-2407', speculative_config=None, tokenizer='mistralai/Mistral-Nemo-Instruct-2407', skip_tokenizer_init=False, tokenizer_│
INFO 05-29 23:36:37 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_8d0f238d'), local_subscribe_addr='ipc:///tmp/7ab51e1d-62cb-472c-805e-c9de12556940', remote_su│
INFO 05-29 23:36:46 [__init__.py:239] Automatically detected platform cuda. │
INFO 05-29 23:36:46 [__init__.py:239] Automatically detected platform cuda. │
WARNING 05-29 23:36:52 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fdaf8d246e0> │
WARNING 05-29 23:36:52 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f56f61ab380> │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_361fa885'), local_subscribe_addr='ipc:///tmp/112587c6-3966-4a8c-│
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_aeb12a1f'), local_subscribe_addr='ipc:///tmp/10ffba7d-06d5-4373-│
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:53 [utils.py:1055] Found nccl from library libnccl.so.2 │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:53 [pynccl.py:69] vLLM is using nccl==2.21.5 │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:53 [utils.py:1055] Found nccl from library libnccl.so.2 │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:53 [pynccl.py:69] vLLM is using nccl==2.21.5 │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /users/7/eloualic/.cache/vllm/gpu_p2p_access_cache_for_0,1.json │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /users/7/eloualic/.cache/vllm/gpu_p2p_access_cache_for_0,1.json │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_1fc07353'), local_subscribe_addr='ipc:///tmp/3e56dcfa-5ae6-4944-9b│
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0 │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [cuda.py:221] Using Flash Attention backend on V1 engine. │
(VllmWorker rank=0 pid=3008574) WARNING 05-29 23:36:54 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer. │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1 │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [cuda.py:221] Using Flash Attention backend on V1 engine. │
(VllmWorker rank=1 pid=3008575) WARNING 05-29 23:36:54 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer. │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:54 [gpu_model_runner.py:1329] Starting to load model mistralai/Mistral-Nemo-Instruct-2407... │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:54 [gpu_model_runner.py:1329] Starting to load model mistralai/Mistral-Nemo-Instruct-2407... │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:36:55 [weight_utils.py:265] Using model weights format ['*.safetensors'] │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:36:55 [weight_utils.py:265] Using model weights format ['*.safetensors'] │
(VllmWorker rank=0 pid=3008574) Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] │
(VllmWorker rank=0 pid=3008574) Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:02, 1.83it/s] │
(VllmWorker rank=0 pid=3008574) Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.19it/s] │
(VllmWorker rank=0 pid=3008574) Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.10it/s] │
(VllmWorker rank=0 pid=3008574) Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:03<00:00, 1.07it/s] │
(VllmWorker rank=0 pid=3008574) Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00, 1.09it/s] │
(VllmWorker rank=0 pid=3008574) Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00, 1.13it/s] │
(VllmWorker rank=0 pid=3008574) │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:00 [loader.py:458] Loading weights took 4.60 seconds │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:00 [loader.py:458] Loading weights took 4.74 seconds │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:00 [gpu_model_runner.py:1347] Model loading took 11.4384 GiB and 5.338305 seconds │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:00 [gpu_model_runner.py:1347] Model loading took 11.4384 GiB and 5.682200 seconds │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:32 [backends.py:420] Using cache directory: /users/7/eloualic/.cache/vllm/torch_compile_cache/7a51309e4c/rank_0_0 for vLLM's torch.compile │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:32 [backends.py:420] Using cache directory: /users/7/eloualic/.cache/vllm/torch_compile_cache/7a51309e4c/rank_1_0 for vLLM's torch.compile │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:32 [backends.py:430] Dynamo bytecode transform time: 31.41 s │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:32 [backends.py:430] Dynamo bytecode transform time: 31.42 s │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:52 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 18.803 s │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:52 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 18.827 s │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:37:57 [monitor.py:33] torch.compile takes 31.41 s in total │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:37:57 [monitor.py:33] torch.compile takes 31.42 s in total │
INFO 05-29 23:37:59 [kv_cache_utils.py:634] GPU KV cache size: 350,128 tokens │
INFO 05-29 23:37:59 [kv_cache_utils.py:637] Maximum concurrency for 128,000 tokens per request: 2.74x │
INFO 05-29 23:37:59 [kv_cache_utils.py:634] GPU KV cache size: 350,128 tokens │
INFO 05-29 23:37:59 [kv_cache_utils.py:637] Maximum concurrency for 128,000 tokens per request: 2.74x │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:38:54 [custom_all_reduce.py:195] Registering 5427 cuda graph addresses │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:38:54 [custom_all_reduce.py:195] Registering 5427 cuda graph addresses │
(VllmWorker rank=1 pid=3008575) INFO 05-29 23:38:54 [gpu_model_runner.py:1686] Graph capturing finished in 55 secs, took 0.63 GiB │
(VllmWorker rank=0 pid=3008574) INFO 05-29 23:38:54 [gpu_model_runner.py:1686] Graph capturing finished in 55 secs, took 0.63 GiB │
INFO 05-29 23:38:54 [core.py:159] init engine (profile, create kv cache, warmup model) took 113.97 seconds │
INFO 05-29 23:38:54 [core_client.py:439] Core engine process 0 ready. │
INFO 05-29 23:38:54 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8000 │
INFO 05-29 23:38:54 [launcher.py:28] Available routes are: │
INFO 05-29 23:38:54 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD │
INFO 05-29 23:38:54 [launcher.py:36] Route: /docs, Methods: GET, HEAD │
INFO 05-29 23:38:54 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD │
INFO 05-29 23:38:54 [launcher.py:36] Route: /redoc, Methods: GET, HEAD │
INFO 05-29 23:38:54 [launcher.py:36] Route: /health, Methods: GET │
INFO 05-29 23:38:54 [launcher.py:36] Route: /load, Methods: GET │
INFO 05-29 23:38:54 [launcher.py:36] Route: /ping, Methods: GET, POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /tokenize, Methods: POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /detokenize, Methods: POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/models, Methods: GET │
INFO 05-29 23:38:54 [launcher.py:36] Route: /version, Methods: GET │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/chat/completions, Methods: POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/completions, Methods: POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/embeddings, Methods: POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /pooling, Methods: POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /score, Methods: POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/score, Methods: POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /rerank, Methods: POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v1/rerank, Methods: POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /v2/rerank, Methods: POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /invocations, Methods: POST │
INFO 05-29 23:38:54 [launcher.py:36] Route: /metrics, Methods: GET │
INFO: Started server process [3008420] │
INFO: Waiting for application startup. │
INFO: Application startup complete.