End-to-end, fully automated dialect embedding + coarse-label classification pipeline:
Audio QC + preprocessing
→ WavLM-Large embeddings (chunked aggregation)
→ speaker-disjoint split (by uploader_id)
→ label coarsening (train-only label-centroid KMeans)
→ coarse-label training (stacked: Linear SVM + MLP → meta LogisticRegression)
→ evaluation + visualizations + report artifacts
make smokeArtifacts are written to artifacts/<run_name>/ (for smoke: artifacts/smoke/).
make ui CONFIG=configs/smoke.jsonThe UI includes a Realtime page: streaming capture of fixed-length chunks from the microphone, progressively outputting confidence line charts for all candidate clusters, facilitating real-time visualization.
- Python 3.10+
make(GNU Make). If you don't have it, either install it (Linux:sudo apt-get install make) or run the CLI commands directly (see below /RUN_WINDOWS.md).ffmpegis required for.oggdecoding + silence trimming. The Makefile bootstraps a local staticffmpeginto.cache/ffmpeg/if you don't have a systemffmpeg.- Python deps:
make deps(handled automatically bymake smoke/make ui)
The Makefile is the recommended “one-command” runner on Linux/macOS/WSL:
make smoke
make ui CONFIG=configs/smoke.json
make clean CONFIG=configs/smoke.jsonYou can switch configs via CONFIG=...:
make smoke CONFIG=configs/smoke.json
make preprocess embed split coarsen train eval report CONFIG=configs/full.jsonconfigs/full.json by default uses a stacked coarse classifier (SVM + MLP → meta LR) to improve Accuracy. If you only want to retrain the model and evaluate:
make train eval report CONFIG=configs/full.jsonIf your environment does not have make (common on Windows), follow RUN_WINDOWS.md and run the Python CLI commands instead.
After make smoke, look at:
artifacts/smoke/audio_qc.csv(per-clip preprocessing/QC decisions)artifacts/smoke/splits.csv(speaker-disjoint train/val/test)artifacts/smoke/label_to_cluster.json+artifacts/smoke/cluster_summary.md(coarse mapping)artifacts/smoke/models/coarse_model.joblib(trained coarse classifier)artifacts/smoke/report_coarse.json+artifacts/smoke/top_confusions.csv(metrics + confusions)artifacts/smoke/figures/(PNG plots: UMAP/t-SNE, confusion matrix, QC plots, etc.)
Each stage is runnable independently (and reuses cached artifacts when present):
.venv/bin/python -m dialectsense.cli preprocess --config configs/smoke.json
.venv/bin/python -m dialectsense.cli embed --config configs/smoke.json
.venv/bin/python -m dialectsense.cli split --config configs/smoke.json
.venv/bin/python -m dialectsense.cli coarsen --config configs/smoke.json
.venv/bin/python -m dialectsense.cli train --config configs/smoke.json
.venv/bin/python -m dialectsense.cli eval --config configs/smoke.json
.venv/bin/python -m dialectsense.cli report --config configs/smoke.json
.venv/bin/python -m dialectsense.cli ui --config configs/smoke.json