[EMNLP 2025] Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

Shwai He, Tao Ge, Guoheng Sun, Bowei Tian, Xiaoyang Wang, Dong Yu

📖 Introduction • 📰 News • ✨ Why • 📈 Results • 🚀 Quick Start • 📄 Citation

📖 Introduction

This is the official implementation of the paper Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers, accepted at EMNLP 2025. We provide a practical framework for efficient dynamic-depth training and inference in Transformers.

Router-Tuning enables dynamic-depth inference by fine-tuning only router-related parameters. Compared with standard MoD-style full tuning, it significantly reduces training cost while keeping model quality competitive.

📰 News

Aug 2025: Router-Tuning accepted to EMNLP 2025 main conference.
Oct 2024: arXiv preprint and code release.

✨ Why This Repo

Traditional transformers execute a fixed number of layers for every token, which wastes computation on easy tokens.

Mixture of Depths (MoD) addresses this by dynamically skipping less important computations, but two practical issues remain:

Existing methods usually tune the whole model, causing high training cost.
Aggressive skipping can hurt quality if routing is not well calibrated.

Router-Tuning tackles both by focusing optimization on routing components and introducing routing strategies that better preserve performance-efficiency tradeoffs.

📈 Results

🏁 Main Results

Router-Tuning consistently improves the efficiency-quality tradeoff over full-parameter MoD tuning baselines. The reported best setting reaches notable speedup while keeping quality degradation small.

🔬 Expert Routing Analysis

Router specialization becomes clearer after tuning: the model learns more stable token-to-layer routing patterns. This supports dynamic-depth execution with lower unnecessary computation.

🔗 LoRA Compatibility

Router-Tuning is compatible with LoRA-based adaptation and can be composed for a better efficiency-performance balance. In practice, this enables lightweight deployment recipes without full-model retraining.

🔍 Core Methods

Router-Only Fine-Tuning

Tune router-related parameters instead of full-model updates.
Strongly reduces optimization cost for dynamic-depth adaptation.

MoD Attention Routing

Uses attention-based routing granularity to improve compute and memory efficiency.
Preserves output quality under dynamic-depth execution.

📦 Repository Layout

entrypoints/finetune/finetune_mod.py: main training entrypoint.
scripts/finetune_mod.sh: reproducible launcher with accelerate + DeepSpeed.
entrypoints/data/reformat_datasets.py: convert raw datasets to unified messages format.
entrypoints/data/mix_datasets.py: build mixed instruction-tuning data.
utils/pipeline/customized_trainer.py: router-focused trainer logic.
configs/accelerate/: distributed training launcher configs.
configs/deepspeed/: DeepSpeed runtime configs.
ckpt/: model config/tokenizer files for supported MoD variants.

⚙️ Installation

conda create -n router-tuning python=3.10 -y
conda activate router-tuning

git clone https://github.com/CASE-Lab-UMD/Router-Tuning-Mixture-of-Depths.git
cd Router-Tuning-Mixture-of-Depths

pip install -r requirements.txt

🚀 Quick Start

1) 🧹 Prepare Data

Put raw datasets under data/raw/ using the expected subdirectory names:

vicuna_sharegpt
evol_instruct
slim_orca
meta_math_qa
evol_code_alpaca
alpaca

Then run:

python entrypoints/data/reformat_datasets.py \
  --raw_data_root ./data/raw \
  --save_path ./data/reformatted

python entrypoints/data/mix_datasets.py \
  --reformatted_dir ./data/reformatted \
  --save_path ./data/mixed

2) 🏃 Run Router-Tuning

bash scripts/finetune_mod.sh

3) 🖥️ Minimal Single-Node Override Example

NUM_PROCESSES=4 PORT=29501 bash scripts/finetune_mod.sh

🎛️ Training Knobs

finetune_mod.sh is the recommended launcher. Commonly adjusted fields:

folder_name: base checkpoint directory under ckpt/.
data_type: one dataset under data/reformatted/ or mixed.
max_train_samples: training subset size for quick experiments.
mod_n: MoD keep ratio control (legacy alias mindskip_n is still supported in Python entrypoint).
granularity: routing granularity (attn_sequence or mlp_sequence).
router_only: enable router-only training (default True).
learning_rate, weight_decay, num_epochs.

Distributed launch overrides:

NUM_PROCESSES: number of GPU processes.
PORT: distributed master port.

🧭 Knob Matrix

Knob	Where	Typical Values	Effect
`folder_name`	`scripts/finetune_mod.sh`	`mistral-7b-mod`, `qwen-2.5-7b-mod`, `llama3-8b-instruct-mod`	Selects base checkpoint under `ckpt/`
`data_type`	`scripts/finetune_mod.sh`	`alpaca`, `mixed`, ...	Chooses training data source
`mod_n`	`scripts/finetune_mod.sh` / CLI	`8`, `16`, `32`	Controls dynamic-depth sparsity/keep behavior
`granularity`	`scripts/finetune_mod.sh`	`attn_sequence`, `mlp_sequence`	Chooses routing granularity
`max_train_samples`	`scripts/finetune_mod.sh`	`1000`, `5000`, `all` (by removing cap)	Controls quick debug vs full tuning
`NUM_PROCESSES`	shell env	`1`, `4`, `8`	Number of distributed workers
`PORT`	shell env	e.g., `29501`	Master communication port

🧪 Evaluation

Evaluation is compatible with EleutherAI/lm-evaluation-harness. For strict reproduction used in earlier experiments, see s1ghhh/lm-evaluation-harness.

✅ Repro Checklist

Python 3.10 environment with pip install -r requirements.txt.
Valid local model path under ckpt/ (or customize folder_name).
Reformatted/mixed data exists at data/reformatted/*/data.jsonl or data/mixed/data.jsonl.
accelerate config selected in scripts/finetune_mod.sh matches your hardware.
NUM_PROCESSES and GPU memory are consistent with max_seq_length and batch setup.

📄 Citation

@misc{he2024routertuningsimpleeffectiveapproach,
  title={Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers},
  author={Shwai He and Tao Ge and Guoheng Sun and Bowei Tian and Xiaoyang Wang and Dong Yu},
  year={2024},
  eprint={2410.13184},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2410.13184}
}

📬 Contact

Shwai He: shwaihe@umd.edu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[EMNLP 2025] Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

📖 Introduction

📰 News

✨ Why This Repo

📈 Results

🏁 Main Results

🔬 Expert Routing Analysis

🔗 LoRA Compatibility

🔍 Core Methods

📦 Repository Layout

⚙️ Installation

🚀 Quick Start

1) 🧹 Prepare Data

2) 🏃 Run Router-Tuning

3) 🖥️ Minimal Single-Node Override Example

🎛️ Training Knobs

🧭 Knob Matrix

🧪 Evaluation

✅ Repro Checklist

📄 Citation

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
ckpt		ckpt
configs		configs
data		data
entrypoints		entrypoints
figures		figures
scripts		scripts
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
customized_trainer.py		customized_trainer.py
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

[EMNLP 2025] Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

📖 Introduction

📰 News

✨ Why This Repo

📈 Results

🏁 Main Results

🔬 Expert Routing Analysis

🔗 LoRA Compatibility

🔍 Core Methods

📦 Repository Layout

⚙️ Installation

🚀 Quick Start

1) 🧹 Prepare Data

2) 🏃 Run Router-Tuning

3) 🖥️ Minimal Single-Node Override Example

🎛️ Training Knobs

🧭 Knob Matrix

🧪 Evaluation

✅ Repro Checklist

📄 Citation

📬 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages