Step-Audio-Edit-Benchmark

Introduction

Evaluating controllable speech synthesis remains challenging due to the lack of a single comprehensive benchmark that simultaneously covers fine-grained attributes like emotion, speaking style, and paralinguistics.

We introduce step-audio-edit-benchmark, a comprehensive evaluation framework covering emotion, speaking style, and paralinguistics, as introduced in technical report Step-Audio-EditX.

Dataset

Prompt Audios

We selected 8 speakers in total (4 Chinese and 4 English), balanced with two males and two females per language. The Chinese data is sourced from WenetSpeech4TTS, while the English data comes from GLOBE_V2 and Libri-Light.

Additionally, we provide the long audio samples referenced in our paper Step-Audio-EditX, designed for evaluating the voice cloning capabilities of closed-source models.

lang	speaker	gender	from
zh	Y0000004339_A_SMNK0c4uM_S00403-S00406	Female	WenetSpeech4TTS
	X0000015410_331546220_S00073-S00074	Female	WenetSpeech4TTS
	X0000005119_6330761_S01227-S01229	Male	WenetSpeech4TTS
	X0000000863_279853194_S00611-S00612	Male	WenetSpeech4TTS
en	7859-102518-0004	Female	Libri-Light
	20870	Female	GLOBE_V2
	167	Male	GLOBE_V2
	502292	Male	GLOBE_V2

Transcripts

Transcripts files comprises emotion.jsonl, style.jsonl, and paralinguistic.jsonl.

emotion

The emotion data consists of 2,000 text samples, covering five distinct emotional categories: Happy, Sad, Angry, Surprised, and Fearful.
speaking style

The speaking style dataset comprises 2,800 text samples, encompassing seven distinct styles: Child, Exaggerated, Recite, Generous, Act Coy, Older, and Whisper.
paralinguistic

The paralinguistic data features 4,000 text samples, covering phenomena such as Breathing, Laughter, and Surprise-oh, among others.

The data fields are structured as follows:

{
    "id": "happy-X0000015410_331546220_S00073-S00074-0", 
    "speaker": "X0000015410_331546220_S00073-S00074", 
    "gen_text": "三年没见，你终于回国了！走走走，今晚我请客，咱们不醉不归！", 
    "prompt_audio": "prompt_audios/X0000015410_331546220_S00073-S00074.wav", 
    "prompt_text": "为什么不要随便把父母接到身边，撒贝宁给出了自己的解释。", 
    "lang": "zh", 
    "task": "emotion", 
    "task_sub": "happy",
    "audio_path": "gen_audio/iter_0/happy-X0000015410_331546220_S00073-S00074-0.wav", # Not present in original JSONL; generated by step-audio-editx
    "gemini_res": "happy"   # Not present in original JSONL; generated by Gemini
}

Evaluation Methodology

We utilize Gemini-2.5-Pro as the core model for all our evaluation tasks. The assessment protocol and metrics vary based on the specific attribute:

Metric

We employ Classification Accuracy for the evaluation of Emotion and Speaking Style. For Paralinguistics, a Rating Score is used.
Procedure for Emotion and Style

For these tasks, Gemini-2.5-Pro performs a forced-choice classification. The model is provided with a fixed set of predefined categories and is instructed to select the most suitable label after listening to the audio sample.
Procedure for Paralinguistics

Conversely, the Paralinguistic task employs a scoring methodology. The model is asked to rate the paralinguistic phenomenon in the audio on a defined scale, specifically ranging from 1 to 3 points.

Detailed prompt content and specific instructions provided to the model can be found in the gemini_prompt.json file.

Script

The evaluation scripts must be executed only upon completion of the data generation process.

gemini_infer.py

When generating audio, step-audio-editx appends an audio_path field to the JSONL file to record the location of the generated file. This script allows Gemini to read the updated JSONL and perform evaluation based on the specific task.
```
python3 script/gemini_infer.py \
  --input_jsonl dataset.jsonl \
  --task_type emotion \
  --api_key ${Your Gemini Key} \
  --prompt_file script/gemini_prompt.json \
  --num_workers 10
```

get_gemini_emotion_style_acc.py

Calculates the accuracy for emotion and speaking style tasks.

python3 get_gemini_emotion_style_acc.py \
  --gemini_res_jsonl dataset_gemini.jsonl \
  --iters "0,1,2,3" \
  --output_excel dataset_gemini.xlsx

get_gemini_paralingustic_score.py

Calculates the scoring metrics for paralinguistic tasks.

python3 get_gemini_paralingustic_score.py \
  --gemini_res_jsonl paralingustic_res.jsonl \
  --output_excel paralingustic_metric.xlsx

Acknowledgements

The prompt audio for this project is sourced from:

WenetSpeech4TTS

Libri-Light

GLOBE_V2

We sincerely thank the authors of these open-source projects for their contributions!

License Agreement

The code in this open-source repository is licensed under the Apache 2.0 License.

Citation

@misc{yan2025stepaudioeditxtechnicalreport,
      title={Step-Audio-EditX Technical Report}, 
      author={Chao Yan and Boyong Wu and Peng Yang and Pengfei Tan and Guoqiang Hu and Yuxin Zhang and Xiangyu and Zhang and Fei Tian and Xuerui Yang and Xiangyu Zhang and Daxin Jiang and Gang Yu},
      year={2025},
      eprint={2511.03601},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.03601}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
data		data
prompt_audios		prompt_audios
script		script
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Step-Audio-Edit-Benchmark

Introduction

Dataset

Prompt Audios

Transcripts

Evaluation Methodology

Script

Acknowledgements

License Agreement

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

stepfun-ai/Step-Audio-Edit-Benchmark

Folders and files

Latest commit

History

Repository files navigation

Step-Audio-Edit-Benchmark

Introduction

Dataset

Prompt Audios

Transcripts

Evaluation Methodology

Script

Acknowledgements

License Agreement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages