GitHub - kento2247/ELSA: ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation

Shuntaro Suzuki^1,*, Kento Tokura^1,*, Daichi Yashima^1,*, Kanon Amemiya^1,*, Komei Sugiura¹, Shinnosuke Takamichi¹

¹ Keio University ^* Equal contribution.

Interspeech 2026

💡 About

This repository provides the implementation of ELSA, as presented in our paper: "ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation". It includes code, dataset preparation instructions, and scripts for evaluation.

⚙️ Architecture

ELSA consists of two main components:

Event-Aware Audio Representation Extractor: Retrieves event-relevant audio segments via a language-queried audio source separation model and encodes them in a shared text–audio embedding space.
Hierarchical Semantic Alignment Module: Integrates global text–audio similarity and event-level matching to compute the final evaluation score.

🚀 Getting Started

Prerequisites

uv
CUDA 11.8+ (for GPU acceleration)
12GB+ VRAM (20GB+ recommended for some features)

Installation

# Clone the repository
git clone git@github.com:kento2247/TTAEval.git
cd TTAEval
uv sync

Download Pretrained Models

# Download pretrained models
sh scripts/download_model.sh

This will download:

SAM-Audio model for audio segmentation
CLAP embeddings for semantic understanding

🎯 One-shot Evaluation

Run a single audio/text pair through the evaluation model:

python src/oneshot.py \
  --audio_file_path data/wav/tango/train/23.wav \
  --text "A dog barking and a car honking." \
  --metric REL

Arguments:

--audio_file_path: Path to the input audio file.
--text: Text description of the audio.
--metric: Evaluation metric, REL or OVL (default: REL).

🙌 Acknowledgment

We gratefully acknowledge the following GitHub repositories for providing valuable code and resources that contributed to this work:

AudioBERTScore (https://github.com/lourson1091/audiobertscore)
SAM-Audio (https://github.com/facebookresearch/sam-audio)

📄 Citation

@inproceedings{suzuki2026elsa,
  title = {ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation},
  author = {Shuntaro Suzuki and Kento Tokura and Daichi Yashima and Kanon Amemiya and Komei Sugiura and Shinnosuke Takamichi},
  year = {2026},
  booktitle = {The 27th Interspeech Conference (INTERSPEECH 2026)},
}

Name		Name	Last commit message	Last commit date
Latest commit History 233 Commits
assets		assets
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💡 About

⚙️ Architecture

🚀 Getting Started

Prerequisites

Installation

Download Pretrained Models

🎯 One-shot Evaluation

🙌 Acknowledgment

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

💡 About

⚙️ Architecture

🚀 Getting Started

Prerequisites

Installation

Download Pretrained Models

🎯 One-shot Evaluation

🙌 Acknowledgment

📄 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages