[EMNLP 2025] Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers
Shwai He, Tao Ge, Guoheng Sun, Bowei Tian, Xiaoyang Wang, Dong Yu
π Introduction β’ π° News β’ β¨ Why β’ π Results β’ π Quick Start β’ π Citation
This is the official implementation of the paper Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers, accepted at EMNLP 2025. We provide a practical framework for efficient dynamic-depth training and inference in Transformers.
Router-Tuning enables dynamic-depth inference by fine-tuning only router-related parameters. Compared with standard MoD-style full tuning, it significantly reduces training cost while keeping model quality competitive.
- Aug 2025: Router-Tuning accepted to EMNLP 2025 main conference.
- Oct 2024: arXiv preprint and code release.
Traditional transformers execute a fixed number of layers for every token, which wastes computation on easy tokens.
Mixture of Depths (MoD) addresses this by dynamically skipping less important computations, but two practical issues remain:
- Existing methods usually tune the whole model, causing high training cost.
- Aggressive skipping can hurt quality if routing is not well calibrated.
Router-Tuning tackles both by focusing optimization on routing components and introducing routing strategies that better preserve performance-efficiency tradeoffs.
Router-Tuning consistently improves the efficiency-quality tradeoff over full-parameter MoD tuning baselines. The reported best setting reaches notable speedup while keeping quality degradation small. Router specialization becomes clearer after tuning: the model learns more stable token-to-layer routing patterns. This supports dynamic-depth execution with lower unnecessary computation. Router-Tuning is compatible with LoRA-based adaptation and can be composed for a better efficiency-performance balance. In practice, this enables lightweight deployment recipes without full-model retraining.- Router-Only Fine-Tuning
- Tune router-related parameters instead of full-model updates.
- Strongly reduces optimization cost for dynamic-depth adaptation.
- MoD Attention Routing
- Uses attention-based routing granularity to improve compute and memory efficiency.
- Preserves output quality under dynamic-depth execution.
entrypoints/finetune/finetune_mod.py: main training entrypoint.scripts/finetune_mod.sh: reproducible launcher withaccelerate+ DeepSpeed.entrypoints/data/reformat_datasets.py: convert raw datasets to unifiedmessagesformat.entrypoints/data/mix_datasets.py: build mixed instruction-tuning data.utils/pipeline/customized_trainer.py: router-focused trainer logic.configs/accelerate/: distributed training launcher configs.configs/deepspeed/: DeepSpeed runtime configs.ckpt/: model config/tokenizer files for supported MoD variants.
conda create -n router-tuning python=3.10 -y
conda activate router-tuning
git clone https://github.com/CASE-Lab-UMD/Router-Tuning-Mixture-of-Depths.git
cd Router-Tuning-Mixture-of-Depths
pip install -r requirements.txtPut raw datasets under data/raw/ using the expected subdirectory names:
vicuna_sharegptevol_instructslim_orcameta_math_qaevol_code_alpacaalpaca
Then run:
python entrypoints/data/reformat_datasets.py \
--raw_data_root ./data/raw \
--save_path ./data/reformatted
python entrypoints/data/mix_datasets.py \
--reformatted_dir ./data/reformatted \
--save_path ./data/mixedbash scripts/finetune_mod.shNUM_PROCESSES=4 PORT=29501 bash scripts/finetune_mod.shfinetune_mod.sh is the recommended launcher. Commonly adjusted fields:
folder_name: base checkpoint directory underckpt/.data_type: one dataset underdata/reformatted/ormixed.max_train_samples: training subset size for quick experiments.mod_n: MoD keep ratio control (legacy aliasmindskip_nis still supported in Python entrypoint).granularity: routing granularity (attn_sequenceormlp_sequence).router_only: enable router-only training (defaultTrue).learning_rate,weight_decay,num_epochs.
Distributed launch overrides:
NUM_PROCESSES: number of GPU processes.PORT: distributed master port.
| Knob | Where | Typical Values | Effect |
|---|---|---|---|
folder_name |
scripts/finetune_mod.sh |
mistral-7b-mod, qwen-2.5-7b-mod, llama3-8b-instruct-mod |
Selects base checkpoint under ckpt/ |
data_type |
scripts/finetune_mod.sh |
alpaca, mixed, ... |
Chooses training data source |
mod_n |
scripts/finetune_mod.sh / CLI |
8, 16, 32 |
Controls dynamic-depth sparsity/keep behavior |
granularity |
scripts/finetune_mod.sh |
attn_sequence, mlp_sequence |
Chooses routing granularity |
max_train_samples |
scripts/finetune_mod.sh |
1000, 5000, all (by removing cap) |
Controls quick debug vs full tuning |
NUM_PROCESSES |
shell env | 1, 4, 8 |
Number of distributed workers |
PORT |
shell env | e.g., 29501 |
Master communication port |
Evaluation is compatible with EleutherAI/lm-evaluation-harness. For strict reproduction used in earlier experiments, see s1ghhh/lm-evaluation-harness.
- Python 3.10 environment with
pip install -r requirements.txt. - Valid local model path under
ckpt/(or customizefolder_name). - Reformatted/mixed data exists at
data/reformatted/*/data.jsonlordata/mixed/data.jsonl. accelerateconfig selected inscripts/finetune_mod.shmatches your hardware.NUM_PROCESSESand GPU memory are consistent withmax_seq_lengthand batch setup.
@misc{he2024routertuningsimpleeffectiveapproach,
title={Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers},
author={Shwai He and Tao Ge and Guoheng Sun and Bowei Tian and Xiaoyang Wang and Dong Yu},
year={2024},
eprint={2410.13184},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.13184}
}- Shwai He:
shwaihe@umd.edu


