Graduate student in Shanghai working on audio and sound foundation models.
I build small, hackable, pure-PyTorch tooling for the audio-LM stack — from self-supervised representation learning, through neural audio codecs and tokenizers, to connecting audio encoders to language models. The guiding principles across everything below are the same:
- Offline by default — the core paths depend on little more than
torchandnumpy; notorchaudio/librosa/libsndfilerequired, and every project runs end-to-end on CPU with built-in synthetic data, no weights to download. - Reproducible & typed — clean APIs,
ruff+mypy, tested across recent Python versions, CI on every push.
The three projects compose into one pipeline — learn representations, discretize audio into tokens, then bridge audio into a language model:
audio ──► resona ──► pretrained encoders
audio ──► acoustok ─► discrete acoustic tokens
audio ──► auralink ─► captions + sound-event tags (via an LLM)
| Project | What it is |
|---|---|
| resona | Self-supervised audio representation learning. A pure-PyTorch log-mel frontend (built on torch.stft + a hand-rolled Slaney/HTK filterbank), a spectrogram-transformer encoder, and three pretraining objectives over one backbone: masked-spectrogram (MAE), contrastive (SimCLR/NT-Xent), and BYOL. |
| acoustok | Neural audio codec and tokenizer for audio LMs. A SEANet-style convolutional encoder/decoder with residual vector quantization that turns waveforms into compact discrete acoustic tokens — adjustable bitrate from a single model, LM-ready flatten/unflatten helpers, and stable EMA codebooks with dead-code expiry. |
| auralink | Bridges audio encoders to language models for audio captioning and sound-event understanding. A connector turns an encoder's frame sequence into prefix tokens in the LM's embedding space; a tiny built-in LM runs the whole encode→bridge→generate→train loop on CPU, and you can swap in a Hugging Face LLM to scale up. |
Self-supervised audio representation learning · neural audio codecs & tokenization · audio–language models · reproducible, dependency-light research tooling.
I care about implementations that are easy to read, easy to extend, and that run without a GPU or a download before you can see them work.