ai-audio-tools

Community list of open-source AI tools, models, and datasets for audio, music, and speech applications

To contribute to the list

Edit the README and make a PR

Audio

Benchmark

PaperWithCode: SoTA audio benchmarks on Paper with Code
PaperWithCode: SoTA music benchmarks on Paper with Code
SAILResearch: A list of audio model ranking from Sail Research
HEAR: Holistic Evaluation of Audio Representations
MSEB: The Massive Sound Embedding Benchmark (MSEB) from Google Research

Dataset

HuggingFace: datasets with tag "audio" on Hugging Face
PaperWithCode: datasets with tag "audio" on Paper with Code
Kaggle: datasets with tag "audio" on Kaggle

Annotation

audino: Open source audio annotation tool for humans
audiomentations: A Python library for audio data augmentation.

Model

HuggingFace: models with tag "audio" on Hugging Face
Kaggle: models with tag "audio" on Kaggle

Security

Wavmark: AI-based Audio Watermarking Tool
audio-steganography-algorithms: A Library of Audio Steganography & Watermarking Algorithms

Music

Benchmark

PaperWithCode: Text-to-Music Generation on MusicCaps
MusGO_framework: MusGO Framework- Assessing Openness in Music-Generative AI

Analysis

Essentia: open-source C++ library for audio analysis and audio-based music information retrieval
Librosa: Python library for audio and music analysis
DDSP: DDSP is a library of differentiable versions of common DSP functions (such as synthesizers, waveshapers, and filters). This allows these interpretable elements to be used as part of an deep learning model, especially as the output layers for audio generation
MIDI-DDSP: MIDI-DDSP is a hierarchical audio generation model for synthesizing MIDI expanded from DDSP
TorchAudio: Data manipulation and transformation for audio signal processing, powered by PyTorch
nnAudio: Audio processing by using pytorch 1D convolution network
pyAudioAnalysis: Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications
mutagen: a Python module to handle audio metadata
dejavu: Audio fingerprinting and recognition in Python
audiomentations: A Python library for audio data augmentation. Inspired by albumentations. Useful for machine learning
soundata: Python library for downloading, loading, and working with sound datasets
EfficientAT: This repository aims at providing efficient CNNs for Audio Tagging. We provide AudioSet pre-trained models ready for downstream training and extraction of audio embeddings
AugLy: A data augmentations library for audio, image, text, and video
Pedalboard: A Python library for working with audio
TinyTag: a Python library for reading audio file metadata
OpenSmile: The Munich Open-Source Large-Scale Multimedia Feature Extractor
Madmom: Python audio and music signal processing library
Beets: a music library manager and MusicBrainz tagger
Mirdata: Python library for working with Music Information Retrieval datasets
Partitura: A python package for handling modern staff notation of music
msaf: a python package for the analysis of music structural segmentation algorithms
basic-pitch: A lightweight yet powerful audio-to-MIDI converter with pitch bend detection
jams: A JSON Annotated Music Specification for Reproducible MIR Research
Audio-Flamingo-2: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
machinehearing: Machine Learning applied to sound
machinelistening: The slides & materials for the lecture Computational Analysis of Sound and Music at TU Ilmenau
Audio-Flamingo-3: Audio Flamingo 3 (AF3) is a fully open, state-of-the-art Large Audio-Language Model (LALM) that advances reasoning and understanding across speech, sounds, and music
Sci-Phi: A Large Language Model Spatial Audio Descriptor

Production

OpenVINO: OpenVINO AI effects for Audacity (Windows, Linux)
TuneFlow: TuneFlow is a next-gen DAW that aims to boost music making productivity through the power of AI
Spleeter: Deezer source separation library including pretrained models
DeepAFx: Third-party audio effects plugins as differentiable layers within deep neural networks
matchering: open source audio matching and mastering
AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec
USS: This is the PyTorch implementation of the Universal Source Separation with Weakly labelled Data
FAST-RIR: This is the official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given rectangular acoustic environment
FoleyCrafter: FoleyCrafter is a video-to-audio generation framework which can produce realistic sound effects semantically relevant and synchronized with videos.
TuneFlow: TuneFlow is a next-gen DAW that aims to boost music making productivity through the power of AI
Woosh: Public release of the Sound Effect Foundation model by Sony AI.

Generation

StableAudio: Generative models for conditional audio generation
AudioCraft: a PyTorch library for deep learning research on audio generation. AudioCraft contains inference and training code for two state-of-the-art AI generative models producing high-quality audio: AudioGen and MusicGen.
Jukebox: A generative model for music
Magenta: symbolic music generation with diffusion models
TorchSynth: A GPU-optional modular synthesizer in pytorch, 16200x faster than realtime, for audio ML researchers
audiobox: Audiobox is Meta’s new foundation research model for audio generation. It can generate voices and sound effects using a combination of voice inputs and natural language text prompts
Amphion: Amphion is a toolkit for Audio, Music, and Speech Generation
AudioGPT: AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
WaveGAN: WaveGAN: Learn to synthesize raw audio with generative adversarial networks
RAVE: Official implementation of the RAVE model: a Realtime Audio Variational autoEncoder
AudioLDM: This toolbox aims to unify audio generation model evaluation for easier comparison
Make-An-Audio: a conditional diffusion probabilistic model capable of generating high fidelity audio efficiently from X modality
Diffuser: Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules
stable-audio-tools: Generative models for conditional audio generation
MidiTok: MIDI / symbolic music tokenizers for Deep Learning models
muspy: an open source Python library for symbolic music generation
MusicLM: a model generating high-fidelity music from text descriptions
riffusion: Stable diffusion for real-time music generation
muzic: Music Understanding and Generation with Artificial Intelligence
midi-lm: Generative modeling of MIDI files
UniAudio: The Open Source Code of UniAudio
MuseGAN: An AI for Music Generation
YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
Bark: Bark is Suno's open-source text-to-speech+ model. Text-Prompted Generative Audio Model
MG²: Awesome music generation model——MG²
MusicGPT: Generate music based on natural language prompts using LLMs running locally
InspireMusic: InspireMusic: A Unified Framework for Music, Song, Audio Generation.
riffusion-hobby: Stable diffusion for real-time music generation
MusicVAE: A hierarchical recurrent variational autoencoder for music
MagentaRT: a Python library for live music audio generation on your local device
OmniAudio: a model for generating spatial audio from 360-degree videos.
MidiLLM: an LLM for generating multitrack MIDI music from free-form text prompts
Vision2Audio: A curated list of Vision (video/image) to Audio Generation
AudioX: A Unified Framework for Anything-to-Audio Generation
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
ACE-Step-1.5: a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware.

Speech

Benchmark

ArtificialAnalysis: Speech to Text AI Model & Provider Leaderboard on Aritifical Analysis
ArtificialAnalysis: Text to Speech AI Model & Provider Leaderboard on Aritifical Analysis

Recognition

Whisper: a multitasking model that can perform multilingual speech recognition, speech translation, and language identification
Deep Speech: Mozilla's open-source speech-to-text engine
Kaldi ASR: open-source speech recognition toolkit written in C++
PaddleSpeech: Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting
NeMo: a framework for generative AI
julius: Open-Source Large Vocabulary Continuous Speech Recognition Engine
speechbrain: an open-source and all-in-one conversational AI toolkit based on PyTorch
pocketsphinx: A small speech recognizer
FunASR: A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models
NeuralSpeech: a research project at Microsoft Research Asia, which focuses on neural network based speech processing, including automatic speech recognition (ASR), text-to-speech synthesis (TTS), spatial audio synthesis, video dubbing, etc
espnet: End-to-End Speech Processing Toolkit
RealTimeSTT: A robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription
AudenAI: A comprehensive toolbox for audio & multimodal understanding tasks including ASR, CLAP, audio captioning, speaker identification, speech-llm and more.

Production

Descript audio codec: State-of-the-art audio codec with 90x compression factor. Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio
Descript audio tools: Object-oriented handling of audio data, with GPU-powered augmentations, and more
Meta encodec: State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio
pipecat: Open Source framework for voice and multimodal conversational AI

Synthesis

Coqui TTS: a deep learning toolkit for Text-to-Speech, battle-tested in research and production
DiffSinger: singing voice synthesis via shallow diffusion mechanism
Real-Time-Voice-Cloning: Clone a voice in 5 seconds to generate arbitrary speech in real-time
wavenet: A TensorFlow implementation of DeepMind's WaveNet paper
FastSpeech2: An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
MelGAN: Unofficial PyTorch implementation of MelGAN vocoder
hifi-gan: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
elevenlabs-pythons: The official Python API for ElevenLabs Text to Speech.
tortoise-tts: A multi-voice TTS system trained with an emphasis on quality
lyrebird: Simple and powerful voice changer for Linux, written with Python & GTK
elevenlabs: The official Python API for ElevenLabs Text to Speech
piper: A fast, local neural text to speech system
tts-generation-webui: TTS Generation Web UI (Bark, MusicGen + AudioGen, Tortoise, RVC, Vocos, Demucs, SeamlessM4T, MAGNet)
GPT-SoVITS: 1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
metavoice-src: Foundational model for human-like, expressive TTS
Real-Time-Voice-Cloning: Clone a voice in 5 seconds to generate arbitrary speech in real-time
Retrieval-based-Voice-Conversion-WebUI: Voice data <= 10 mins can also be used to train a good VC model!
midi2voice: Singing synthesis from MIDI file
OpenVoice: Instant voice cloning by MyShell
ChatTTS: A generative speech model for daily dialogue
csm: A Conversational Speech Generation Model
chatterbox: Resemble AI's first production-grade open source TTS model
Fish-Audio: Fish Audio S2 Pro is the most advanced multimodal model developed by Fish Audio
KittenTTS: State-of-the-art TTS model under 25MB
Covo-Audio: Covo-Audio is a 7B-parameter end-to-end large audio language model that directly processes continuous audio inputs and generates audio outputs within a single unified architecture.
TTS-arxiv-daily: Automatically Update Text-to-speech (TTS) Papers Daily using Github Actions (Update Every 12th hours)
kokoro: an open-weight TTS model with 82 million parameters

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ai-audio-tools

Table of Contents

To contribute to the list

Audio

Benchmark

Dataset

Annotation

Model

Security

Music

Benchmark

Analysis

Production

Generation

Speech

Benchmark

Recognition

Production

Synthesis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ai-audio-tools

Table of Contents

To contribute to the list

Audio

Benchmark

Dataset

Annotation

Model

Security

Music

Benchmark

Analysis

Production

Generation

Speech

Benchmark

Recognition

Production

Synthesis

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages