-
Notifications
You must be signed in to change notification settings - Fork 29
Open
Description
Hi, I'm trying to fine-tune a Qwen3 Moe AWQ model and found a big difference between AWQ compressed-tensor models & AWQ GEMM models
Is this a bug in compressed tensor implementation?
Hardware: 6xA6000 = 288GB OOM with compressed tensor
while 1x A6000= 48Gb works perfectly with AWQ GEMM
Here's my SFT config:
# GEMM AWQ works no OOM
model_name_or_path: ELVISIO/Qwen3-30B-A3B-Instruct-2507-AWQ
# Compressed-tensor AWQ get OOM
model_name_or_path: cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit
# dataset
dataset_name: ...
# Lora
use_peft: True
lora_target_modules:
- "q_proj"
- "k_proj"
- "v_proj"
- "o_proj"
# training
learning_rate: 2.0e-05
num_train_epochs: 1
packing: true
per_device_train_batch_size: 1
per_device_eval_batch_size: 1
gradient_accumulation_steps: 16
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: true
logging_steps: 1
logging_strategy: "steps"
log_level: "info"
max_length: 8000
warmup_ratio: 0.03
lr_scheduler_type: 'cosine'
bf16: true
bf16_full_eval: true
fp16: false
attn_implementation: 'flash_attention_2'
Note: In serving with vllm, compressed-tensor models seem to have better throughput. So if possible, fine-tuning with compressed-tensor format is better & more future-proof since autoawq is deprecated.
Metadata
Metadata
Assignees
Labels
No labels