-
Notifications
You must be signed in to change notification settings - Fork 102
Description
CUDA_VISIBLE_DEVICES="2,3,4,5" ACCELERATE_LOG_LEVEL=info
accelerate launch
--config_file recipes/zero3.yaml
--num_processes=3
src/x_r1/grpo.py
--config recipes/X_R1_zero_0dot5B_config.yaml \
output/x_r1_0dotB_sampling.log 2>&1
[2025-07-11 09:43:51,572] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W0711 09:43:54.566000 617545 site-packages/torch/distributed/run.py:792]
W0711 09:43:54.566000 617545 site-packages/torch/distributed/run.py:792] *****************************************
W0711 09:43:54.566000 617545 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0711 09:43:54.566000 617545 site-packages/torch/distributed/run.py:792] *****************************************
[2025-07-11 09:44:01,187] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-07-11 09:44:01,194] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-07-11 09:44:01,195] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 07-11 09:44:03 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-11 09:44:03 [init.py:239] Automatically detected platform cuda.
INFO 07-11 09:44:03 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-11 09:44:03 [init.py:239] Automatically detected platform cuda.
INFO 07-11 09:44:03 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-11 09:44:03 [init.py:239] Automatically detected platform cuda.
[2025-07-11 09:44:04,482] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-07-11 09:44:04,482] [INFO] [comm.py:700:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
2025-07-11 09:44:04 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False
2025-07-11 09:44:04 - INFO - main - Model parameters ModelConfig(model_name_or_path='/data2/jcxy/llm_model/Qwen2.5-0.5B', model_revision='main', torch_dtype='bfloat16', trust_remote_code=False, attn_implementation='flash_attention_2', use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, lora_task_type='CAUSAL_LM', use_rslora=False, use_dora=False, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False)
2025-07-11 09:44:04 - INFO - main - Script parameters GRPOScriptArguments(dataset_name='xiaodongguaAIGC/X-R1-750', dataset_config=None, dataset_train_split='train', dataset_test_split='test', gradient_checkpointing_use_reentrant=False, ignore_bias_buffers=False, reward_funcs=['accuracy', 'format'], cosine_min_value_wrong=0.0, cosine_max_value_wrong=-0.5, cosine_min_value_correct=0.5, cosine_max_value_correct=1.0, cosine_max_len=1000, repetition_n_grams=3, repetition_max_penalty=-1.0)
2025-07-11 09:44:04 - INFO - main - Data parameters GRPOConfig(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
benchmarks=[],
beta=0.04,
bf16=True,
bf16_full_eval=False,
cache_implementation=None,
callbacks=[],
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
delta=None,
disable_dropout=False,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
ds3_gather_for_generation=True,
epsilon=0.2,
epsilon_high=None,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=10,
eval_strategy=no,
eval_use_gather_object=False,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_batch_size=96,
gradient_accumulation_steps=8,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={'use_reentrant': False},
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_model_revision=main,
hub_private_repo=None,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_for_metrics=[],
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=3e-06,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_completions=False,
log_level=info,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=output/X-R1-0.5B-bs4-numgen12-gas8-gpu3/runs/Jul11_09-44-04_ubuntu,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1,
logging_strategy=steps,
loss_type=bnpo,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
mask_truncated_completions=False,
max_completion_length=1024,
max_grad_norm=1.0,
max_prompt_length=256,
max_steps=-1,
metric_for_best_model=None,
min_p=None,
model_init_kwargs=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_completions_to_print=None,
num_generations=12,
num_iterations=1,
num_train_epochs=3,
optim=adamw_torch,
optim_args=None,
optim_target_modules=None,
output_dir=output/X-R1-0.5B-bs4-numgen12-gas8-gpu3,
overwrite_hub_revision=False,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=4,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_revision=False,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
ref_model_mixup_alpha=0.6,
ref_model_sync_steps=512,
remove_unused_columns=False,
repetition_penalty=1.0,
report_to=['wandb'],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
reward_weights=None,
run_name=output/X-R1-0.5B-bs4-numgen12-gas8-gpu3,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=epoch,
save_total_limit=None,
scale_rewards=True,
seed=42,
shuffle_dataset=True,
skip_memory_metrics=True,
steps_per_generation=8,
sync_ref_model=False,
system_prompt=None,
temperature=1.0,
tf32=None,
top_k=None,
top_p=1.0,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tp_size=0,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_liger_loss=False,
use_mps_device=False,
use_vllm=True,
vllm_gpu_memory_utilization=0.7,
vllm_guided_decoding_regex=None,
vllm_mode=server,
vllm_server_base_url=None,
vllm_server_host=0.0.0.0,
vllm_server_port=8000,
vllm_server_timeout=240.0,
vllm_tensor_parallel_size=1,
wandb_entity=None,
wandb_log_unique_prompts=False,
wandb_project=None,
warmup_ratio=0.1,
warmup_steps=0,
weight_decay=0.0,
)
[2025-07-11 09:44:04,609] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-07-11 09:44:04,856] [INFO] [comm.py:669:init_distributed] cdb=None
2025-07-11 09:44:05 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1 distributed training: True, 16-bits training: False
2025-07-11 09:44:05 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1 distributed training: True, 16-bits training: False
[2025-07-11 09:44:12,718] [INFO] [config.py:735:init] Config mesh_device None world_size = 3
[WARNING|logging.py:328] 2025-07-11 09:44:12,722 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
Overwrite dataset info from restored data version if exists.
2025-07-11 09:44:12 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists.
Loading Dataset info from /home/haolu/.cache/huggingface/datasets/xiaodongguaAIGC___x-r1-750/default/0.0.0/1a2e75b1147e199697374f5decea05e3b13d42ec
2025-07-11 09:44:12 - INFO - datasets.info - Loading Dataset info from /home/haolu/.cache/huggingface/datasets/xiaodongguaAIGC___x-r1-750/default/0.0.0/1a2e75b1147e199697374f5decea05e3b13d42ec
Found cached dataset x-r1-750 (/home/haolu/.cache/huggingface/datasets/xiaodongguaAIGC___x-r1-750/default/0.0.0/1a2e75b1147e199697374f5decea05e3b13d42ec)
2025-07-11 09:44:12 - INFO - datasets.builder - Found cached dataset x-r1-750 (/home/haolu/.cache/huggingface/datasets/xiaodongguaAIGC___x-r1-750/default/0.0.0/1a2e75b1147e199697374f5decea05e3b13d42ec)
Loading Dataset info from /home/haolu/.cache/huggingface/datasets/xiaodongguaAIGC___x-r1-750/default/0.0.0/1a2e75b1147e199697374f5decea05e3b13d42ec
2025-07-11 09:44:12 - INFO - datasets.info - Loading Dataset info from /home/haolu/.cache/huggingface/datasets/xiaodongguaAIGC___x-r1-750/default/0.0.0/1a2e75b1147e199697374f5decea05e3b13d42ec
Loading cached processed dataset at /home/haolu/.cache/huggingface/datasets/xiaodongguaAIGC___x-r1-750/default/0.0.0/1a2e75b1147e199697374f5decea05e3b13d42ec/cache-37799212ae05f40f.arrow
2025-07-11 09:44:13 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /home/haolu/.cache/huggingface/datasets/xiaodongguaAIGC___x-r1-750/default/0.0.0/1a2e75b1147e199697374f5decea05e3b13d42ec/cache-37799212ae05f40f.arrow
Loading cached processed dataset at /home/haolu/.cache/huggingface/datasets/xiaodongguaAIGC___x-r1-750/default/0.0.0/1a2e75b1147e199697374f5decea05e3b13d42ec/cache-2fef31b174fffacc.arrow
2025-07-11 09:44:13 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /home/haolu/.cache/huggingface/datasets/xiaodongguaAIGC___x-r1-750/default/0.0.0/1a2e75b1147e199697374f5decea05e3b13d42ec/cache-2fef31b174fffacc.arrow
2025-07-11 09:44:13 - INFO - main - *** Initializing model kwargs ***
[INFO|configuration_utils.py:691] 2025-07-11 09:44:13,012 >> loading configuration file /data2/jcxy/llm_model/Qwen2.5-0.5B/config.json
[INFO|configuration_utils.py:765] 2025-07-11 09:44:13,015 >> Model config Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 896,
"initializer_range": 0.02,
"intermediate_size": 4864,
"max_position_embeddings": 32768,
"max_window_layers": 24,
"model_type": "qwen2",
"num_attention_heads": 14,
"num_hidden_layers": 24,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.51.3",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 151936
}
[INFO|modeling_utils.py:1121] 2025-07-11 09:44:13,070 >> loading weights file /data2/jcxy/llm_model/Qwen2.5-0.5B/model.safetensors
[INFO|modeling_utils.py:2167] 2025-07-11 09:44:13,071 >> Instantiating Qwen2ForCausalLM model under default dtype torch.bfloat16.
[INFO|modeling_utils.py:3726] 2025-07-11 09:44:13,071 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[2025-07-11 09:44:13,071] [INFO] [config.py:735:init] Config mesh_device None world_size = 3
[WARNING|logging.py:328] 2025-07-11 09:44:13,075 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
[INFO|configuration_utils.py:1142] 2025-07-11 09:44:13,082 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151643
}
[2025-07-11 09:44:13,183] [INFO] [config.py:735:init] Config mesh_device None world_size = 3
[WARNING|logging.py:328] 2025-07-11 09:44:13,188 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
[2025-07-11 09:44:15,017] [INFO] [partition_parameters.py:348:exit] finished initializing model - num_params = 291, num_elems = 0.63B
[2025-07-11 09:44:15,733] [INFO] [config.py:735:init] Config mesh_device None world_size = 3
[2025-07-11 09:44:15,734] [INFO] [config.py:735:init] Config mesh_device None world_size = 3
[INFO|modeling_utils.py:4930] 2025-07-11 09:44:15,755 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM.
[INFO|modeling_utils.py:4938] 2025-07-11 09:44:15,755 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /data2/jcxy/llm_model/Qwen2.5-0.5B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
[INFO|configuration_utils.py:1095] 2025-07-11 09:44:15,757 >> loading configuration file /data2/jcxy/llm_model/Qwen2.5-0.5B/generation_config.json
[INFO|configuration_utils.py:1142] 2025-07-11 09:44:15,757 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151643,
"max_new_tokens": 2048
}
[INFO|configuration_utils.py:691] 2025-07-11 09:44:15,758 >> loading configuration file /data2/jcxy/llm_model/Qwen2.5-0.5B/config.json
[INFO|configuration_utils.py:765] 2025-07-11 09:44:15,759 >> Model config Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 896,
"initializer_range": 0.02,
"intermediate_size": 4864,
"max_position_embeddings": 32768,
"max_window_layers": 24,
"model_type": "qwen2",
"num_attention_heads": 14,
"num_hidden_layers": 24,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.51.3",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 151936
}
[INFO|modeling_utils.py:1121] 2025-07-11 09:44:15,760 >> loading weights file /data2/jcxy/llm_model/Qwen2.5-0.5B/model.safetensors
[INFO|modeling_utils.py:2167] 2025-07-11 09:44:15,760 >> Instantiating Qwen2ForCausalLM model under default dtype torch.bfloat16.
[INFO|modeling_utils.py:3726] 2025-07-11 09:44:15,760 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[2025-07-11 09:44:15,760] [INFO] [config.py:735:init] Config mesh_device None world_size = 3
[INFO|configuration_utils.py:1142] 2025-07-11 09:44:15,767 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151643
}
[2025-07-11 09:44:16,232] [INFO] [partition_parameters.py:348:exit] finished initializing model - num_params = 582, num_elems = 1.26B
[INFO|modeling_utils.py:4930] 2025-07-11 09:44:16,955 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM.
[INFO|modeling_utils.py:4938] 2025-07-11 09:44:16,955 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /data2/jcxy/llm_model/Qwen2.5-0.5B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
[INFO|configuration_utils.py:1095] 2025-07-11 09:44:16,957 >> loading configuration file /data2/jcxy/llm_model/Qwen2.5-0.5B/generation_config.json
[INFO|configuration_utils.py:1142] 2025-07-11 09:44:16,957 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151643,
"max_new_tokens": 2048
}
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:16,966 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:16,966 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:16,966 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:16,966 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:16,966 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:16,966 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:16,966 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2323] 2025-07-11 09:44:17,211 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|trainer.py:748] 2025-07-11 09:44:17,225 >> Using auto half precision backend
[INFO|configuration_utils.py:691] 2025-07-11 09:44:17,677 >> loading configuration file /data2/jcxy/llm_model/Qwen2.5-0.5B/config.json
[INFO|configuration_utils.py:691] 2025-07-11 09:44:17,677 >> loading configuration file /data2/jcxy/llm_model/Qwen2.5-0.5B/config.json
[INFO|configuration_utils.py:765] 2025-07-11 09:44:17,678 >> Model config Qwen2Config {
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 896,
"initializer_range": 0.02,
"intermediate_size": 4864,
"max_position_embeddings": 32768,
"max_window_layers": 24,
"model_type": "qwen2",
"num_attention_heads": 14,
"num_hidden_layers": 24,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.51.3",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 151936
}
[INFO|image_processing_auto.py:311] 2025-07-11 09:44:17,678 >> Could not locate the image processor configuration file, will try to use the model config instead.
INFO 07-11 09:44:24 [config.py:717] This model supports multiple tasks: {'score', 'classify', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.
INFO 07-11 09:44:24 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=16384.
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:24,509 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:24,509 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:24,509 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:24,509 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:24,509 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:24,509 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2058] 2025-07-11 09:44:24,509 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2323] 2025-07-11 09:44:24,836 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:1095] 2025-07-11 09:44:24,913 >> loading configuration file /data2/jcxy/llm_model/Qwen2.5-0.5B/generation_config.json
[INFO|configuration_utils.py:1142] 2025-07-11 09:44:24,914 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151643,
"max_new_tokens": 2048
}
WARNING 07-11 09:44:24 [utils.py:2382] We must use the spawn multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized
[2025-07-11 09:44:30,024] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 07-11 09:44:32 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-11 09:44:32 [init.py:239] Automatically detected platform cuda.
INFO 07-11 09:44:33 [core.py:58] Initializing a V1 LLM engine (v0.8.5) with config: model='/data2/jcxy/llm_model/Qwen2.5-0.5B', speculative_config=None, tokenizer='/data2/jcxy/llm_model/Qwen2.5-0.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda:3, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/data2/jcxy/llm_model/Qwen2.5-0.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 07-11 09:44:33 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f638869b3d0>
[W711 09:45:29.018593983 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3