Evaluating controllable speech synthesis remains challenging due to the lack of a single comprehensive benchmark that simultaneously covers fine-grained attributes like emotion, speaking style, and paralinguistics.
We introduce step-audio-edit-benchmark, a comprehensive evaluation framework covering emotion, speaking style, and paralinguistics, as introduced in technical report Step-Audio-EditX.
We selected 8 speakers in total (4 Chinese and 4 English), balanced with two males and two females per language. The Chinese data is sourced from WenetSpeech4TTS, while the English data comes from GLOBE_V2 and Libri-Light.
Additionally, we provide the long audio samples referenced in our paper Step-Audio-EditX, designed for evaluating the voice cloning capabilities of closed-source models.
| lang | speaker | gender | from |
|---|---|---|---|
| zh | Y0000004339_A_SMNK0c4uM_S00403-S00406 | Female | WenetSpeech4TTS |
| X0000015410_331546220_S00073-S00074 | Female | WenetSpeech4TTS | |
| X0000005119_6330761_S01227-S01229 | Male | WenetSpeech4TTS | |
| X0000000863_279853194_S00611-S00612 | Male | WenetSpeech4TTS | |
| en | 7859-102518-0004 | Female | Libri-Light |
| 20870 | Female | GLOBE_V2 | |
| 167 | Male | GLOBE_V2 | |
| 502292 | Male | GLOBE_V2 |
Transcripts files comprises emotion.jsonl, style.jsonl, and paralinguistic.jsonl.
-
emotion
The emotion data consists of 2,000 text samples, covering five distinct emotional categories: Happy, Sad, Angry, Surprised, and Fearful.
-
speaking style
The speaking style dataset comprises 2,800 text samples, encompassing seven distinct styles: Child, Exaggerated, Recite, Generous, Act Coy, Older, and Whisper.
-
paralinguistic
The paralinguistic data features 4,000 text samples, covering phenomena such as Breathing, Laughter, and Surprise-oh, among others.
The data fields are structured as follows:
{
"id": "happy-X0000015410_331546220_S00073-S00074-0",
"speaker": "X0000015410_331546220_S00073-S00074",
"gen_text": "三年没见,你终于回国了!走走走,今晚我请客,咱们不醉不归!",
"prompt_audio": "prompt_audios/X0000015410_331546220_S00073-S00074.wav",
"prompt_text": "为什么不要随便把父母接到身边,撒贝宁给出了自己的解释。",
"lang": "zh",
"task": "emotion",
"task_sub": "happy",
"audio_path": "gen_audio/iter_0/happy-X0000015410_331546220_S00073-S00074-0.wav", # Not present in original JSONL; generated by step-audio-editx
"gemini_res": "happy" # Not present in original JSONL; generated by Gemini
}
We utilize Gemini-2.5-Pro as the core model for all our evaluation tasks. The assessment protocol and metrics vary based on the specific attribute:
-
Metric
We employ Classification Accuracy for the evaluation of Emotion and Speaking Style. For Paralinguistics, a Rating Score is used.
-
Procedure for Emotion and Style
For these tasks, Gemini-2.5-Pro performs a forced-choice classification. The model is provided with a fixed set of predefined categories and is instructed to select the most suitable label after listening to the audio sample.
-
Procedure for Paralinguistics
Conversely, the Paralinguistic task employs a scoring methodology. The model is asked to rate the paralinguistic phenomenon in the audio on a defined scale, specifically ranging from 1 to 3 points.
Detailed prompt content and specific instructions provided to the model can be found in the gemini_prompt.json file.
The evaluation scripts must be executed only upon completion of the data generation process.
-
gemini_infer.py
When generating audio, step-audio-editx appends an audio_path field to the JSONL file to record the location of the generated file. This script allows Gemini to read the updated JSONL and perform evaluation based on the specific task.
python3 script/gemini_infer.py \ --input_jsonl dataset.jsonl \ --task_type emotion \ --api_key ${Your Gemini Key} \ --prompt_file script/gemini_prompt.json \ --num_workers 10 -
get_gemini_emotion_style_acc.py
Calculates the accuracy for emotion and speaking style tasks.
python3 get_gemini_emotion_style_acc.py \ --gemini_res_jsonl dataset_gemini.jsonl \ --iters "0,1,2,3" \ --output_excel dataset_gemini.xlsx -
get_gemini_paralingustic_score.py
Calculates the scoring metrics for paralinguistic tasks.
python3 get_gemini_paralingustic_score.py \ --gemini_res_jsonl paralingustic_res.jsonl \ --output_excel paralingustic_metric.xlsx
The prompt audio for this project is sourced from:
We sincerely thank the authors of these open-source projects for their contributions!
- The code in this open-source repository is licensed under the Apache 2.0 License.
@misc{yan2025stepaudioeditxtechnicalreport,
title={Step-Audio-EditX Technical Report},
author={Chao Yan and Boyong Wu and Peng Yang and Pengfei Tan and Guoqiang Hu and Yuxin Zhang and Xiangyu and Zhang and Fei Tian and Xuerui Yang and Xiangyu Zhang and Daxin Jiang and Gang Yu},
year={2025},
eprint={2511.03601},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.03601},
}
