Step-Audio-R1

🔥🔥🔥 News!!

Nov 19, 2025: 🎉 We release the Demo Page
Nov 19, 2025: 👋 We release the technical report of Step-Audio-R1.

📑 Open-source Plan

Inference Code (vLLM)
Online demo (Gradio)
Model Checkpoints

Overview

Introduction

Step-Audio-R1 is the first audio language model to successfully unlock test-time compute scaling. It decisively solves the "inverted scaling" anomaly plaguing existing models, where performance paradoxically degrades with longer reasoning chains.

We identify the root cause of this failure as Textual Surrogate Reasoning: conventional models, due to text-based initialization, rely on linguistic abstractions (analyzing transcripts) rather than genuine acoustic properties. To resolve this modality mismatch, we introduce Modality-Grounded Reasoning Distillation (MGRD), an iterative training framework that shifts the model's reasoning focus from textual surrogates to acoustic analysis.

This new approach allows us to create Step-Audio-R1, which:

Is the first audio reasoning model that successfully benefits from test-time compute scaling.
Surpasses Gemini 2.5 Pro and is comparable to Gemini 3 across comprehensive audio benchmarks.
Transforms extended deliberation from a liability into a powerful asset for audio intelligence.

Model Architecture

Step-Audio-R1 builds on the architecture of our previous StepAudio 2 and consists of three main components:

Audio Encoder: We use the pre-trained Qwen2 audio encoder. It operates at a 25 Hz frame rate and is frozen during training.
Audio Adaptor: A simple adaptor (identical to Step-Audio 2) connects the encoder to the LLM and downsamples the feature frame rate to 12.5 Hz.
LLM Decoder: We use Qwen2.5 32B as the core reasoning component. It directly takes the latent audio features from the adaptor to generate a purely textual output (first the reasoning, then the final reply).

The key innovation is our training method, Modality-Grounded Reasoning Distillation (MGRD). This process iteratively refines the model's thoughts, progressively strengthening their connection to the underlying audio features until they evolve into "native audio think."

This ensures the model's reasoning is not merely about the transcribed text but is deeply grounded in the acoustic nuances of the audio itself.

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
__pycache__		__pycache__
assets		assets
LICENSE		LICENSE
README.md		README.md
Step-Audio-R1.pdf		Step-Audio-R1.pdf
examples-vllm_r1.py		examples-vllm_r1.py
stepaudior1vllm.py		stepaudior1vllm.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Step-Audio-R1

🔥🔥🔥 News!!

📑 Open-source Plan

Overview

Introduction

Model Architecture

Citation

Star History

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

stepfun-ai/Step-Audio-R1

Folders and files

Latest commit

History

Repository files navigation

Step-Audio-R1

🔥🔥🔥 News!!

📑 Open-source Plan

Overview

Introduction

Model Architecture

Citation

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages