Qu Wang jamesa94

Qu Wang

Graduate student in Shanghai working on audio and sound foundation models.

I build small, hackable, pure-PyTorch tooling for the audio-LM stack — from self-supervised representation learning, through neural audio codecs and tokenizers, to connecting audio encoders to language models. The guiding principles across everything below are the same:

Offline by default — the core paths depend on little more than torch and numpy; no torchaudio/librosa/libsndfile required, and every project runs end-to-end on CPU with built-in synthetic data, no weights to download.
Reproducible & typed — clean APIs, ruff + mypy, tested across recent Python versions, CI on every push.

The stack

The three projects compose into one pipeline — learn representations, discretize audio into tokens, then bridge audio into a language model:

audio ──► resona ──►  pretrained encoders
audio ──► acoustok ─► discrete acoustic tokens
audio ──► auralink ─► captions + sound-event tags (via an LLM)

Project	What it is
resona	Self-supervised audio representation learning. A pure-PyTorch log-mel frontend (built on `torch.stft` + a hand-rolled Slaney/HTK filterbank), a spectrogram-transformer encoder, and three pretraining objectives over one backbone: masked-spectrogram (MAE), contrastive (SimCLR/NT-Xent), and BYOL.
acoustok	Neural audio codec and tokenizer for audio LMs. A SEANet-style convolutional encoder/decoder with residual vector quantization that turns waveforms into compact discrete acoustic tokens — adjustable bitrate from a single model, LM-ready flatten/unflatten helpers, and stable EMA codebooks with dead-code expiry.
auralink	Bridges audio encoders to language models for audio captioning and sound-event understanding. A connector turns an encoder's frame sequence into prefix tokens in the LM's embedding space; a tiny built-in LM runs the whole encode→bridge→generate→train loop on CPU, and you can swap in a Hugging Face LLM to scale up.

Research interests

Self-supervised audio representation learning · neural audio codecs & tokenization · audio–language models · reproducible, dependency-light research tooling.

I care about implementations that are easy to read, easy to extend, and that run without a GPU or a download before you can see them work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qu Wang jamesa94

Block or report jamesa94

Qu Wang

The stack

Research interests

Pinned Loading

Uh oh!