-
Notifications
You must be signed in to change notification settings - Fork 151
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
I cloned the latest SpecForge codebase and noticed that it now supports training with DFlash. Based on this, I launched a DFlash training job using the script below.
During training, both loss and accuracy behaved normally. On the evaluation set, there was only mild overfitting, which did not seem significant. However, when I loaded the trained weights into the official DFlash benchmark script on gsm8k dataset , (https://github.com/z-lab/dflash/blob/main/run_benchmark.sh), I observed an acceptance rate of only 1.29/(1+3), which is extremely low. This suggests that the training has effectively failed.
I would like to ask whether anyone has successfully trained and inferred a DFlash model. Any discussion or help in locating the root cause would be greatly appreciated.
For context, I have previously successfully trained an Eagle3 model, which indicates that my data preprocessing, training pipeline, and evaluation setup should generally be correct.
Below is the training script I used:
#!/bin/bash
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
ROOT_DIR=$(dirname $SCRIPT_DIR)
export TORCHINDUCTOR_CACHE_DIR=$ROOT_DIR/cache/compiled_kernels
export SPECFORGE_DATA_NUM_PROC=32
NUM_GPUS=16
ATTENTION_BACKEND=sdpa
torchrun \
--standalone \
--nproc_per_node $NUM_GPUS \
$ROOT_DIR/scripts/train_dflash.py \
--target-model-path /data/weights/qwen3-8b \
--draft-config-path $ROOT_DIR/configs/qwen3-8b-dflash.json \
--train-data-path $ROOT_DIR/cache/dataset/perfectblend_train.jsonl \
--output-dir $ROOT_DIR/outputs/qwen3-8b-dflash-perfectblend-baseline \
--num-epochs 15 \
--batch-size 1 \
--learning-rate 1e-4 \
--max-length 2048 \
--chat-template qwen \
--attention-backend $ATTENTION_BACKEND \
--log-interval 100 \
--eval-interval 5000 \
--save-interval 10000 \
--eval-data-path $ROOT_DIR/cache/dataset/opc_test.jsonl \
--cache-dir $ROOT_DIR/cache \
--report-to tensorboard \
--target-model-backend sglang \
--resumeLooking forward to any insights or suggestions. Thanks!
Best regards,
BAI Fan
Reproduction
Below is the training script I used:
#!/bin/bash
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
ROOT_DIR=$(dirname $SCRIPT_DIR)
export TORCHINDUCTOR_CACHE_DIR=$ROOT_DIR/cache/compiled_kernels
export SPECFORGE_DATA_NUM_PROC=32
NUM_GPUS=16
ATTENTION_BACKEND=sdpa
torchrun \
--standalone \
--nproc_per_node $NUM_GPUS \
$ROOT_DIR/scripts/train_dflash.py \
--target-model-path /data/weights/qwen3-8b \
--draft-config-path $ROOT_DIR/configs/qwen3-8b-dflash.json \
--train-data-path $ROOT_DIR/cache/dataset/perfectblend_train.jsonl \
--output-dir $ROOT_DIR/outputs/qwen3-8b-dflash-perfectblend-baseline \
--num-epochs 15 \
--batch-size 1 \
--learning-rate 1e-4 \
--max-length 2048 \
--chat-template qwen \
--attention-backend $ATTENTION_BACKEND \
--log-interval 100 \
--eval-interval 5000 \
--save-interval 10000 \
--eval-data-path $ROOT_DIR/cache/dataset/opc_test.jsonl \
--cache-dir $ROOT_DIR/cache \
--report-to tensorboard \
--target-model-backend sglang \
--resumeEnvironment
SpecForge-main