Skip to content

stepfun-ai/Step-Audio-Edit-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Step-Audio-Edit-Benchmark

Introduction

Evaluating controllable speech synthesis remains challenging due to the lack of a single comprehensive benchmark that simultaneously covers fine-grained attributes like emotion, speaking style, and paralinguistics.

We introduce step-audio-edit-benchmark, a comprehensive evaluation framework covering emotion, speaking style, and paralinguistics, as introduced in technical report Step-Audio-EditX.

Dataset

Prompt Audios

We selected 8 speakers in total (4 Chinese and 4 English), balanced with two males and two females per language. The Chinese data is sourced from WenetSpeech4TTS, while the English data comes from GLOBE_V2 and Libri-Light.

Additionally, we provide the long audio samples referenced in our paper Step-Audio-EditX, designed for evaluating the voice cloning capabilities of closed-source models.

lang speaker gender from
zh Y0000004339_A_SMNK0c4uM_S00403-S00406 Female WenetSpeech4TTS
X0000015410_331546220_S00073-S00074 Female WenetSpeech4TTS
X0000005119_6330761_S01227-S01229 Male WenetSpeech4TTS
X0000000863_279853194_S00611-S00612 Male WenetSpeech4TTS
en 7859-102518-0004 Female Libri-Light
20870 Female GLOBE_V2
167 Male GLOBE_V2
502292 Male GLOBE_V2

Transcripts

Transcripts files comprises emotion.jsonl, style.jsonl, and paralinguistic.jsonl.

  • emotion

    The emotion data consists of 2,000 text samples, covering five distinct emotional categories: Happy, Sad, Angry, Surprised, and Fearful.

  • speaking style

    The speaking style dataset comprises 2,800 text samples, encompassing seven distinct styles: Child, Exaggerated, Recite, Generous, Act Coy, Older, and Whisper.

  • paralinguistic

    The paralinguistic data features 4,000 text samples, covering phenomena such as Breathing, Laughter, and Surprise-oh, among others.

The data fields are structured as follows:

{
    "id": "happy-X0000015410_331546220_S00073-S00074-0", 
    "speaker": "X0000015410_331546220_S00073-S00074", 
    "gen_text": "三年没见,你终于回国了!走走走,今晚我请客,咱们不醉不归!", 
    "prompt_audio": "prompt_audios/X0000015410_331546220_S00073-S00074.wav", 
    "prompt_text": "为什么不要随便把父母接到身边,撒贝宁给出了自己的解释。", 
    "lang": "zh", 
    "task": "emotion", 
    "task_sub": "happy",
    "audio_path": "gen_audio/iter_0/happy-X0000015410_331546220_S00073-S00074-0.wav", # Not present in original JSONL; generated by step-audio-editx
    "gemini_res": "happy"   # Not present in original JSONL; generated by Gemini
}

Evaluation Methodology

We utilize Gemini-2.5-Pro as the core model for all our evaluation tasks. The assessment protocol and metrics vary based on the specific attribute:

  • Metric

    We employ Classification Accuracy for the evaluation of Emotion and Speaking Style. For Paralinguistics, a Rating Score is used.

  • Procedure for Emotion and Style

    For these tasks, Gemini-2.5-Pro performs a forced-choice classification. The model is provided with a fixed set of predefined categories and is instructed to select the most suitable label after listening to the audio sample.

  • Procedure for Paralinguistics

    Conversely, the Paralinguistic task employs a scoring methodology. The model is asked to rate the paralinguistic phenomenon in the audio on a defined scale, specifically ranging from 1 to 3 points.

Detailed prompt content and specific instructions provided to the model can be found in the gemini_prompt.json file.

Script

The evaluation scripts must be executed only upon completion of the data generation process.

  • gemini_infer.py

    When generating audio, step-audio-editx appends an audio_path field to the JSONL file to record the location of the generated file. This script allows Gemini to read the updated JSONL and perform evaluation based on the specific task.

    python3 script/gemini_infer.py \
      --input_jsonl dataset.jsonl \
      --task_type emotion \
      --api_key ${Your Gemini Key} \
      --prompt_file script/gemini_prompt.json \
      --num_workers 10
    
  • get_gemini_emotion_style_acc.py

    Calculates the accuracy for emotion and speaking style tasks.

    python3 get_gemini_emotion_style_acc.py \
      --gemini_res_jsonl dataset_gemini.jsonl \
      --iters "0,1,2,3" \
      --output_excel dataset_gemini.xlsx
    
  • get_gemini_paralingustic_score.py

    Calculates the scoring metrics for paralinguistic tasks.

    python3 get_gemini_paralingustic_score.py \
      --gemini_res_jsonl paralingustic_res.jsonl \
      --output_excel paralingustic_metric.xlsx
    

Acknowledgements

The prompt audio for this project is sourced from:

WenetSpeech4TTS

Libri-Light

GLOBE_V2

We sincerely thank the authors of these open-source projects for their contributions!

License Agreement

  • The code in this open-source repository is licensed under the Apache 2.0 License.

Citation

@misc{yan2025stepaudioeditxtechnicalreport,
      title={Step-Audio-EditX Technical Report}, 
      author={Chao Yan and Boyong Wu and Peng Yang and Pengfei Tan and Guoqiang Hu and Yuxin Zhang and Xiangyu and Zhang and Fei Tian and Xuerui Yang and Xiangyu Zhang and Daxin Jiang and Gang Yu},
      year={2025},
      eprint={2511.03601},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.03601}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages