Fine-Grained Soundscape Control for Augmented Hearing

TSE (Target Sound Extraction) and SED (Sound Event Detection) pipelines for binaural augmented hearing with on-the-fly spatial audio synthesis using Head-Related Transfer Functions (HRTF).

Paper: arXiv:2603.00395 | MobiSys 2026

Quick Start

1. Install

git clone --recursive https://github.com/ooshyun/fine_grained_soundscape_control.git
cd fine_grained_soundscape_control
pip install -r requirements.txt

If you already cloned without --recursive, initialize the required submodule:

git submodule update --init third_party/SemanticHearing

Note: Only third_party/SemanticHearing is required for evaluation (Waveformer baseline). Other submodules are private and can be ignored.

2. Setup dataset

Downloads the public BinauralCuratedDataset tar (~125GB) and builds noise_scaper_fmt/ for TAU noise backgrounds.

bash scripts/setup_dataset.sh --output_dir /path/to/output

If you already have the tar extracted, or TAU raw data at a separate path:

# Skip download (tar already extracted)
bash scripts/setup_dataset.sh --output_dir /path/to/output --skip_download

# TAU raw data at separate path
bash scripts/setup_dataset.sh --output_dir /path/to/output \
    --tau_raw_dir /path/to/TAU-2019

After setup:

/path/to/output/
  BinauralCuratedDataset/
    scaper_fmt/{train,val,test}/{class}/        # foreground audio symlinks
    bg_scaper_fmt/{train,val,test}/{class}/     # background audio symlinks
    noise_scaper_fmt/{train,val,test}/{scene}/  # TAU noise symlinks
    hrtf/CIPIC/{*.sofa, *_hrtf.txt}             # HRTF data
    FSD50K/, ESC-50/, musdb18/, disco_noises/   # raw audio datasets
    TAU-acoustic-sounds/                         # TAU metadata + audio
    start_times.csv                              # silence trimming metadata

Note on --data_dir: All train/eval scripts expect --data_dir to point to the parent of BinauralCuratedDataset/, because the configs reference paths like BinauralCuratedDataset/scaper_fmt/.... This matches setup_dataset.sh's --output_dir, so you can pass the same path to both.

Prebuilt metadata (`data/prebuilt/metadata.tar.gz`)

The public tar does not include noise_scaper_fmt/ (TAU noise symlinks) or per-dataset train/val/test CSV splits. Building these from scratch requires creating ~21,600 symlinks which can be very slow on network filesystems (e.g. Lustre).

To avoid this, we ship a prebuilt metadata archive (898KB) that contains:

noise_scaper_fmt/{train,val,test}/{scene}/ — relative symlinks to TAU audio
{FSD50K,ESC-50,musdb18,disco_noises,TAU-acoustic-sounds}/{train,val,test}.csv
hrtf/CIPIC/{train,val,test}_hrtf.txt

setup_dataset.sh automatically extracts this if found. To apply manually:

tar xzf data/prebuilt/metadata.tar.gz -C /path/to/BinauralCuratedDataset/

Datasets & Licenses

Dataset	License	Source
FSD50K	Mixed CC (CC0/BY/BY-NC)	Zenodo
ESC-50	CC-BY-NC 3.0	GitHub
musdb18	Academic/non-commercial	Zenodo
DISCO	CC-BY 4.0	Zenodo
TAU-2019	Tampere Univ. custom (NC)	Zenodo
CIPIC HRTF	Public Domain	UC Davis

3. Train

All scripts take <data_dir> as the first argument — this should be the parent of BinauralCuratedDataset/ (i.e. the same path as --output_dir from Step 2).

# Example: dataset at /path/to/data_dir/BinauralCuratedDataset/
#          → data_dir = /path/to/data_dir

# TSE (default: Orange Pi config)
bash scripts/train/run_tse.sh /path/to/data_dir [orange_pi|raspberry_pi|neuralaid]

# SED (default: AST finetune config)
bash scripts/train/run_sed.sh /path/to/data_dir [ast_finetune]

4. Evaluate (reproduce paper tables)

Same <data_dir> convention as training.

# Table 1: TSE model comparison (Orange Pi, Raspberry Pi, NeuralAids)
bash scripts/eval/run_tse.sh /path/to/data_dir

# Table 2: Multi-output TSE (5-out, 20-out)
bash scripts/eval/run_multiout.sh /path/to/data_dir

# Table 3: FiLM ablation (first / all / all-except-first)
bash scripts/eval/run_ablation.sh /path/to/data_dir

# Table 4, Figure 4: SED (Fine-tuned AST)
bash scripts/eval/run_sed.sh /path/to/data_dir

5. Docker (recommended for reproducibility)

Build and run all evaluations in an isolated Docker environment. Requires NVIDIA Container Toolkit.

# Build image
bash docker/build.sh

# Run all paper evaluations (Tables 1-4)
bash docker/eval_all.sh /path/to/data_dir ./eval_results

# Run a single model
bash docker/eval_single.sh tse orange_pi /path/to/data_dir
bash docker/eval_single.sh sed finetuned_ast /path/to/data_dir ./results \
    --num_fg_min 1 --num_fg_max 1 --num_bg_min 1 --num_bg_max 1

# Train inside Docker
bash docker/train_single.sh tse configs/tse/orange_pi.yaml /path/to/data_dir

Reproducing Paper Results

The evaluation pipeline reproduces the metrics from Table 3 of the paper (Orange Pi, FiLM=All blocks, 5 output channels, 1-5 targets in mixture):

Metric	This Repo	Paper
SNRi (dB)	12.31 ± 4.08	12.26 ± 4.38
SI-SNRi (dB)	10.18 ± 5.43	10.16 ± 5.72

Evaluated on 2000 on-the-fly synthesized test samples with 1-5 target sources, 1-3 interfering sources, and urban noise backgrounds.

Pretrained Models

Task	HuggingFace Repository	Models
TSE	ooshyun/fine_grained_soundscape_control	11 models
SED	ooshyun/sound_event_detection	Fine-tuned AST

See docs/pretrained_models.md for full model details, download instructions, and STFT configuration.

TSE Models

Name	Architecture	D	H	B	Outputs	FiLM
Orange Pi	TFGridNet	32	64	6	5	All
Raspberry Pi	TFGridNet	16	64	3	5	All
NeuralAids	TFMLPNet	32	32	6	5	All

SED Models

Model	Source	Config
AST (pretrained)	MIT/ast-finetuned-audioset-10-10-0.4593	--
Fine-tuned AST	ooshyun/sound_event_detection	`configs/sed/ast_finetune.yaml`

Datasets

The training pipeline uses six public datasets synthesized into binaural mixtures on-the-fly:

Dataset	Description
FSD50K	Freesound Dataset -- 50k clips of diverse sound events
ESC-50	Environmental Sound Classification -- 2k clips, 50 classes
musdb18	Music source separation dataset -- 150 tracks
DISCO	Diverse Indoor Sound Corpus -- environmental noise recordings
TAU-2019	TAU Urban Acoustic Scenes 2019 -- urban noise backgrounds
CIPIC HRTF	Head-Related Transfer Function database -- 45 subjects, 1250 directions

Project Structure

fine_grained_soundscape_control_for_augmented_hearing/
├── configs/
│   ├── tse/                    # TSE training configs
│   │   ├── orange_pi.yaml
│   │   ├── raspberry_pi.yaml
│   │   └── neuralaid.yaml
│   └── sed/                    # SED training configs
│       └── ast_finetune.yaml
├── data/
│   ├── setup_data.py           # CLI entrypoint for data pipeline
│   ├── class_map.yaml          # Sound class definitions
│   ├── ontology.json           # AudioSet ontology
│   ├── pipeline/               # Modular data pipeline
│   │   ├── download.py         # Stage 1: download datasets
│   │   ├── collect.py          # Stage 2: collect + split CSVs
│   │   ├── prepare.py          # Stage 3: Scaper format + HRTF
│   │   ├── ontology.py         # AudioSet ontology wrapper
│   │   ├── silence.py          # Silence trimming utility
│   │   └── sources/            # Per-dataset logic
│   │       ├── fsd50k.py, esc50.py, disco.py
│   │       ├── cipic.py, musdb18.py, tau.py
│   │       └── base.py         # BaseSource ABC
│   └── hf_upload/              # HuggingFace dataset upload
│       ├── README.md           # Dataset Card
│       └── upload.py           # Upload script
├── src/
│   ├── datasets/
│   │   ├── MisophoniaDataset.py    # On-the-fly binaural synthesis
│   │   ├── soundscape_dataset.py   # Simplified dataset interface
│   │   ├── multi_ch_simulator.py   # HRTF spatialization (CIPIC, etc.)
│   │   ├── motion_simulator.py     # Sound source motion
│   │   ├── augmentations/          # Audio augmentations (speed, pitch, etc.)
│   │   └── gen/                    # Dataset generation utilities
│   ├── trainer/
│   │   ├── base.py                 # Base trainer interface
│   │   ├── lightning.py            # PyTorch Lightning backend
│   │   └── fabric.py              # Lightning Fabric backend
│   ├── metrics/
│   │   ├── tse.py                  # SI-SNRi, SNRi, per-channel metrics
│   │   └── sed.py                  # mAP, F1, AUC-ROC, d-prime
│   ├── tse/
│   │   ├── model.py                # Pretrained model loading
│   │   ├── net.py                  # TFGridNet STFT wrapper
│   │   ├── multiflim_guided_tfnet.py  # FiLM-conditioned separator
│   │   ├── gridnet_block.py        # Time-frequency processing block
│   │   ├── loss.py                 # Multi-resolution STFT + L1 loss
│   │   ├── train.py                # TSE training entry
│   │   └── eval.py                 # TSE evaluation entry
│   └── sed/
│       ├── model.py                # Pretrained AST loading
│       ├── ast_hf.py               # HuggingFace AST wrapper
│       ├── loss.py                 # BCE + Focal loss
│       ├── train.py                # SED training entry
│       └── eval.py                 # SED evaluation entry
├── scripts/
│   ├── setup_dataset.sh           # Full dataset setup (download + extract + noise)
│   ├── build_noise_scaper_fmt.py  # Build TAU noise symlinks
│   ├── train/
│   │   ├── run_tse.sh             # Train TSE model
│   │   └── run_sed.sh             # Train SED model
│   └── eval/
│       ├── run_tse.sh             # Table 1: TSE model comparison
│       ├── run_multiout.sh        # Table 2: Multi-output TSE
│       ├── run_ablation.sh        # Table 3: FiLM ablation
│       └── run_sed.sh             # Table 4, Fig 4: SED
├── docker/
│   ├── build.sh                 # Build Docker image
│   ├── eval_all.sh              # Run all paper evals (Tables 1-4)
│   ├── eval_single.sh           # Run single TSE/SED eval
│   └── train_single.sh          # Run single training job
├── Dockerfile
├── requirements.txt
└── README.md

Trainer Backends

The training pipeline supports two backends, configurable via the YAML config:

training:
  backend: "lightning"   # or "fabric"

Lightning (src/trainer/lightning.py): Full PyTorch Lightning Trainer with built-in logging, checkpointing, and multi-GPU support. Recommended for standard training.
Fabric (src/trainer/fabric.py): Lightweight Lightning Fabric backend with manual training loop control. Useful for custom training logic or debugging.

Both backends share the same base interface (src/trainer/base.py) and are interchangeable without modifying model or dataset code.

Citation

@article{oh2026fine,
  title   = {Fine-Grained Soundscape Control for Augmented Hearing},
  author  = {Oh, Seunghyun and others},
  journal = {arXiv preprint arXiv:2603.00395},
  year    = {2026},
  url     = {https://arxiv.org/abs/2603.00395},
}

License

MIT

Appendix: Downloading Raw Datasets Individually

If you need the raw source datasets (e.g. for custom preprocessing), you can download them individually:

# FSD50K (~59GB, split zip)
# Requires manual merge of split archives
wget https://zenodo.org/records/4060432/files/FSD50K.dev_audio.zip
wget https://zenodo.org/records/4060432/files/FSD50K.eval_audio.zip
wget https://zenodo.org/records/4060432/files/FSD50K.metadata.zip

# ESC-50 (~600MB)
wget https://github.com/karolpiczak/ESC-50/archive/refs/heads/master.zip -O ESC-50.zip

# musdb18 (~5GB, academic use only)
wget https://zenodo.org/records/1117372/files/musdb18.zip

# DISCO (~3GB)
wget https://zenodo.org/api/records/4019030/files/disco_noises.zip/content -O disco_noises.zip

# TAU-2019 (~20GB, 10 parts + meta, non-commercial)
for i in $(seq 1 10); do
  wget "https://zenodo.org/records/2589280/files/TAU-urban-acoustic-scenes-2019-development.audio.${i}.zip"
done
wget https://zenodo.org/records/2589280/files/TAU-urban-acoustic-scenes-2019-development.meta.zip

# CIPIC HRTF (~183MB, SOFA files)
# Available from our HuggingFace dataset repo:
pip install huggingface_hub
python -c "from huggingface_hub import snapshot_download; snapshot_download('ooshyun/fine_grained_soundscape_control', repo_type='dataset', allow_patterns='cipic_hrtf/**', local_dir='.')"

Alternatively, use the automated pipeline downloader:

python data/setup_data.py --output_dir ./data --stage download
python data/setup_data.py --output_dir ./data --datasets fsd50k,esc50 --stage download
python data/setup_data.py --output_dir ./data --dry-run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-Grained Soundscape Control for Augmented Hearing

Quick Start

1. Install

2. Setup dataset

Prebuilt metadata (`data/prebuilt/metadata.tar.gz`)

Datasets & Licenses

3. Train

4. Evaluate (reproduce paper tables)

5. Docker (recommended for reproducibility)

Reproducing Paper Results

Pretrained Models

TSE Models

SED Models

Datasets

Project Structure

Trainer Backends

Citation

License

Appendix: Downloading Raw Datasets Individually

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
configs		configs
data		data
docker		docker
docs		docs
notebooks		notebooks
scripts		scripts
src		src
tests		tests
third_party		third_party
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Fine-Grained Soundscape Control for Augmented Hearing

Quick Start

1. Install

2. Setup dataset

Prebuilt metadata (data/prebuilt/metadata.tar.gz)

Datasets & Licenses

3. Train

4. Evaluate (reproduce paper tables)

5. Docker (recommended for reproducibility)

Reproducing Paper Results

Pretrained Models

TSE Models

SED Models

Datasets

Project Structure

Trainer Backends

Citation

License

Appendix: Downloading Raw Datasets Individually

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Prebuilt metadata (`data/prebuilt/metadata.tar.gz`)

Packages