Shuntaro Suzuki1,*, Kento Tokura1,*, Daichi Yashima1,*, Kanon Amemiya1,*, Komei Sugiura1, Shinnosuke Takamichi1
1 Keio University * Equal contribution.
Interspeech 2026
This repository provides the implementation of ELSA, as presented in our paper: "ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation". It includes code, dataset preparation instructions, and scripts for evaluation.
ELSA consists of two main components:
- Event-Aware Audio Representation Extractor: Retrieves event-relevant audio segments via a language-queried audio source separation model and encodes them in a shared text–audio embedding space.
- Hierarchical Semantic Alignment Module: Integrates global text–audio similarity and event-level matching to compute the final evaluation score.
- uv
- CUDA 11.8+ (for GPU acceleration)
- 12GB+ VRAM (20GB+ recommended for some features)
# Clone the repository
git clone git@github.com:kento2247/TTAEval.git
cd TTAEval
uv sync# Download pretrained models
sh scripts/download_model.shThis will download:
- SAM-Audio model for audio segmentation
- CLAP embeddings for semantic understanding
Run a single audio/text pair through the evaluation model:
python src/oneshot.py \
--audio_file_path data/wav/tango/train/23.wav \
--text "A dog barking and a car honking." \
--metric RELArguments:
--audio_file_path: Path to the input audio file.--text: Text description of the audio.--metric: Evaluation metric,RELorOVL(default:REL).
We gratefully acknowledge the following GitHub repositories for providing valuable code and resources that contributed to this work:
- AudioBERTScore (https://github.com/lourson1091/audiobertscore)
- SAM-Audio (https://github.com/facebookresearch/sam-audio)
@inproceedings{suzuki2026elsa,
title = {ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation},
author = {Shuntaro Suzuki and Kento Tokura and Daichi Yashima and Kanon Amemiya and Komei Sugiura and Shinnosuke Takamichi},
year = {2026},
booktitle = {The 27th Interspeech Conference (INTERSPEECH 2026)},
}