Skip to content

kento2247/ELSA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

233 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ELSA Architecture Overview

Shuntaro Suzuki1,*, Kento Tokura1,*, Daichi Yashima1,*, Kanon Amemiya1,*, Komei Sugiura1, Shinnosuke Takamichi1

1 Keio University * Equal contribution.

Interspeech 2026

INTERSPEECH 2026 Project Page arXiv

💡 About

This repository provides the implementation of ELSA, as presented in our paper: "ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation". It includes code, dataset preparation instructions, and scripts for evaluation.

⚙️ Architecture

ELSA Model Architecture

ELSA consists of two main components:

  1. Event-Aware Audio Representation Extractor: Retrieves event-relevant audio segments via a language-queried audio source separation model and encodes them in a shared text–audio embedding space.
  2. Hierarchical Semantic Alignment Module: Integrates global text–audio similarity and event-level matching to compute the final evaluation score.

🚀 Getting Started

Prerequisites

  • uv
  • CUDA 11.8+ (for GPU acceleration)
  • 12GB+ VRAM (20GB+ recommended for some features)

Installation

# Clone the repository
git clone git@github.com:kento2247/TTAEval.git
cd TTAEval
uv sync

Download Pretrained Models

# Download pretrained models
sh scripts/download_model.sh

This will download:

  • SAM-Audio model for audio segmentation
  • CLAP embeddings for semantic understanding

🎯 One-shot Evaluation

Run a single audio/text pair through the evaluation model:

python src/oneshot.py \
  --audio_file_path data/wav/tango/train/23.wav \
  --text "A dog barking and a car honking." \
  --metric REL

Arguments:

  • --audio_file_path: Path to the input audio file.
  • --text: Text description of the audio.
  • --metric: Evaluation metric, REL or OVL (default: REL).

🙌 Acknowledgment

We gratefully acknowledge the following GitHub repositories for providing valuable code and resources that contributed to this work:

📄 Citation

@inproceedings{suzuki2026elsa,
  title = {ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation},
  author = {Shuntaro Suzuki and Kento Tokura and Daichi Yashima and Kanon Amemiya and Komei Sugiura and Shinnosuke Takamichi},
  year = {2026},
  booktitle = {The 27th Interspeech Conference (INTERSPEECH 2026)},
}

About

ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors