Skip to content
View jamesa94's full-sized avatar
  • Shanghai Jiao Tong University
  • Shanghai, China
  • Joined Jun 8, 2026

Block or report jamesa94

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
jamesa94/README.md

Qu Wang

Graduate student in Shanghai working on audio and sound foundation models.

I build small, hackable, pure-PyTorch tooling for the audio-LM stack — from self-supervised representation learning, through neural audio codecs and tokenizers, to connecting audio encoders to language models. The guiding principles across everything below are the same:

  • Offline by default — the core paths depend on little more than torch and numpy; no torchaudio/librosa/libsndfile required, and every project runs end-to-end on CPU with built-in synthetic data, no weights to download.
  • Reproducible & typed — clean APIs, ruff + mypy, tested across recent Python versions, CI on every push.

PyTorch Python Self-Supervised Learning Neural Audio Codec Audio-Language Models License: MIT


The stack

The three projects compose into one pipeline — learn representations, discretize audio into tokens, then bridge audio into a language model:

audio ──► resona ──►  pretrained encoders
audio ──► acoustok ─► discrete acoustic tokens
audio ──► auralink ─► captions + sound-event tags (via an LLM)
Project What it is
resona Self-supervised audio representation learning. A pure-PyTorch log-mel frontend (built on torch.stft + a hand-rolled Slaney/HTK filterbank), a spectrogram-transformer encoder, and three pretraining objectives over one backbone: masked-spectrogram (MAE), contrastive (SimCLR/NT-Xent), and BYOL.
acoustok Neural audio codec and tokenizer for audio LMs. A SEANet-style convolutional encoder/decoder with residual vector quantization that turns waveforms into compact discrete acoustic tokens — adjustable bitrate from a single model, LM-ready flatten/unflatten helpers, and stable EMA codebooks with dead-code expiry.
auralink Bridges audio encoders to language models for audio captioning and sound-event understanding. A connector turns an encoder's frame sequence into prefix tokens in the LM's embedding space; a tiny built-in LM runs the whole encode→bridge→generate→train loop on CPU, and you can swap in a Hugging Face LLM to scale up.

Research interests

Self-supervised audio representation learning · neural audio codecs & tokenization · audio–language models · reproducible, dependency-light research tooling.

I care about implementations that are easy to read, easy to extend, and that run without a GPU or a download before you can see them work.

Pinned Loading

  1. acoustok acoustok Public

    Neural audio codec and tokenizer for audio language models — SEANet encoder with residual vector quantization in PyTorch

    Python 1

  2. auralink auralink Public

    Bridge audio encoders to LLMs for audio captioning and sound-event understanding — pure-PyTorch, offline-by-default

    Python 1

  3. resona resona Public

    Self-supervised audio representation learning toolkit in PyTorch — masked-spectrogram (MAE), contrastive (NT-Xent) and BYOL pretraining on a pure-torch log-mel frontend.

    Python 1