Skip to content

yyf/ai-audio-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 

Repository files navigation

ai-audio-tools

Community list of open-source AI tools, models, and datasets for audio, music, and speech applications

Table of Contents

To contribute to the list

Edit the README and make a PR

Audio

Benchmark

  • PaperWithCode: SoTA audio benchmarks on Paper with Code
  • PaperWithCode: SoTA music benchmarks on Paper with Code
  • SAILResearch: A list of audio model ranking from Sail Research
  • HEAR: Holistic Evaluation of Audio Representations
  • MSEB: The Massive Sound Embedding Benchmark (MSEB) from Google Research

Dataset

  • HuggingFace: datasets with tag "audio" on Hugging Face
  • PaperWithCode: datasets with tag "audio" on Paper with Code
  • Kaggle: datasets with tag "audio" on Kaggle

Annotation

  • audino: Open source audio annotation tool for humans
  • audiomentations: A Python library for audio data augmentation.

Model

  • HuggingFace: models with tag "audio" on Hugging Face
  • Kaggle: models with tag "audio" on Kaggle

Security

Music

Benchmark

Analysis

  • Essentia: open-source C++ library for audio analysis and audio-based music information retrieval
  • Librosa: Python library for audio and music analysis
  • DDSP: DDSP is a library of differentiable versions of common DSP functions (such as synthesizers, waveshapers, and filters). This allows these interpretable elements to be used as part of an deep learning model, especially as the output layers for audio generation
  • MIDI-DDSP: MIDI-DDSP is a hierarchical audio generation model for synthesizing MIDI expanded from DDSP
  • TorchAudio: Data manipulation and transformation for audio signal processing, powered by PyTorch
  • nnAudio: Audio processing by using pytorch 1D convolution network
  • pyAudioAnalysis: Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications
  • mutagen: a Python module to handle audio metadata
  • dejavu: Audio fingerprinting and recognition in Python
  • audiomentations: A Python library for audio data augmentation. Inspired by albumentations. Useful for machine learning
  • soundata: Python library for downloading, loading, and working with sound datasets
  • EfficientAT: This repository aims at providing efficient CNNs for Audio Tagging. We provide AudioSet pre-trained models ready for downstream training and extraction of audio embeddings
  • AugLy: A data augmentations library for audio, image, text, and video
  • Pedalboard: A Python library for working with audio
  • TinyTag: a Python library for reading audio file metadata
  • OpenSmile: The Munich Open-Source Large-Scale Multimedia Feature Extractor
  • Madmom: Python audio and music signal processing library
  • Beets: a music library manager and MusicBrainz tagger
  • Mirdata: Python library for working with Music Information Retrieval datasets
  • Partitura: A python package for handling modern staff notation of music
  • msaf: a python package for the analysis of music structural segmentation algorithms
  • basic-pitch: A lightweight yet powerful audio-to-MIDI converter with pitch bend detection
  • jams: A JSON Annotated Music Specification for Reproducible MIR Research
  • Audio-Flamingo-2: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
  • machinehearing: Machine Learning applied to sound
  • machinelistening: The slides & materials for the lecture Computational Analysis of Sound and Music at TU Ilmenau
  • Audio-Flamingo-3: Audio Flamingo 3 (AF3) is a fully open, state-of-the-art Large Audio-Language Model (LALM) that advances reasoning and understanding across speech, sounds, and music
  • Sci-Phi: A Large Language Model Spatial Audio Descriptor

Production

  • OpenVINO: OpenVINO AI effects for Audacity (Windows, Linux)
  • TuneFlow: TuneFlow is a next-gen DAW that aims to boost music making productivity through the power of AI
  • Spleeter: Deezer source separation library including pretrained models
  • DeepAFx: Third-party audio effects plugins as differentiable layers within deep neural networks
  • matchering: open source audio matching and mastering
  • AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec
  • USS: This is the PyTorch implementation of the Universal Source Separation with Weakly labelled Data
  • FAST-RIR: This is the official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given rectangular acoustic environment
  • FoleyCrafter: FoleyCrafter is a video-to-audio generation framework which can produce realistic sound effects semantically relevant and synchronized with videos.
  • TuneFlow: TuneFlow is a next-gen DAW that aims to boost music making productivity through the power of AI
  • Woosh: Public release of the Sound Effect Foundation model by Sony AI.

Generation

  • StableAudio: Generative models for conditional audio generation
  • AudioCraft: a PyTorch library for deep learning research on audio generation. AudioCraft contains inference and training code for two state-of-the-art AI generative models producing high-quality audio: AudioGen and MusicGen.
  • Jukebox: A generative model for music
  • Magenta: symbolic music generation with diffusion models
  • TorchSynth: A GPU-optional modular synthesizer in pytorch, 16200x faster than realtime, for audio ML researchers
  • audiobox: Audiobox is Meta’s new foundation research model for audio generation. It can generate voices and sound effects using a combination of voice inputs and natural language text prompts
  • Amphion: Amphion is a toolkit for Audio, Music, and Speech Generation
  • AudioGPT: AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
  • WaveGAN: WaveGAN: Learn to synthesize raw audio with generative adversarial networks
  • RAVE: Official implementation of the RAVE model: a Realtime Audio Variational autoEncoder
  • AudioLDM: This toolbox aims to unify audio generation model evaluation for easier comparison
  • Make-An-Audio: a conditional diffusion probabilistic model capable of generating high fidelity audio efficiently from X modality
  • Diffuser: Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules
  • stable-audio-tools: Generative models for conditional audio generation
  • MidiTok: MIDI / symbolic music tokenizers for Deep Learning models
  • muspy: an open source Python library for symbolic music generation
  • MusicLM: a model generating high-fidelity music from text descriptions
  • riffusion: Stable diffusion for real-time music generation
  • muzic: Music Understanding and Generation with Artificial Intelligence
  • midi-lm: Generative modeling of MIDI files
  • UniAudio: The Open Source Code of UniAudio
  • MuseGAN: An AI for Music Generation
  • YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
  • Bark: Bark is Suno's open-source text-to-speech+ model. Text-Prompted Generative Audio Model
  • MG²: Awesome music generation model——MG²
  • MusicGPT: Generate music based on natural language prompts using LLMs running locally
  • InspireMusic: InspireMusic: A Unified Framework for Music, Song, Audio Generation.
  • riffusion-hobby: Stable diffusion for real-time music generation
  • MusicVAE: A hierarchical recurrent variational autoencoder for music
  • MagentaRT: a Python library for live music audio generation on your local device
  • OmniAudio: a model for generating spatial audio from 360-degree videos.
  • MidiLLM: an LLM for generating multitrack MIDI music from free-form text prompts
  • Vision2Audio: A curated list of Vision (video/image) to Audio Generation
  • AudioX: A Unified Framework for Anything-to-Audio Generation
  • MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
  • ACE-Step-1.5: a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware.

Speech

Benchmark

  • ArtificialAnalysis: Speech to Text AI Model & Provider Leaderboard on Aritifical Analysis
  • ArtificialAnalysis: Text to Speech AI Model & Provider Leaderboard on Aritifical Analysis

Recognition

  • Whisper: a multitasking model that can perform multilingual speech recognition, speech translation, and language identification
  • Deep Speech: Mozilla's open-source speech-to-text engine
  • Kaldi ASR: open-source speech recognition toolkit written in C++
  • PaddleSpeech: Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting
  • NeMo: a framework for generative AI
  • julius: Open-Source Large Vocabulary Continuous Speech Recognition Engine
  • speechbrain: an open-source and all-in-one conversational AI toolkit based on PyTorch
  • pocketsphinx: A small speech recognizer
  • FunASR: A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models
  • NeuralSpeech: a research project at Microsoft Research Asia, which focuses on neural network based speech processing, including automatic speech recognition (ASR), text-to-speech synthesis (TTS), spatial audio synthesis, video dubbing, etc
  • espnet: End-to-End Speech Processing Toolkit
  • RealTimeSTT: A robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription
  • AudenAI: A comprehensive toolbox for audio & multimodal understanding tasks including ASR, CLAP, audio captioning, speaker identification, speech-llm and more.

Production

  • Descript audio codec: State-of-the-art audio codec with 90x compression factor. Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio
  • Descript audio tools: Object-oriented handling of audio data, with GPU-powered augmentations, and more
  • Meta encodec: State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio
  • pipecat: Open Source framework for voice and multimodal conversational AI

Synthesis

  • Coqui TTS: a deep learning toolkit for Text-to-Speech, battle-tested in research and production
  • DiffSinger: singing voice synthesis via shallow diffusion mechanism
  • Real-Time-Voice-Cloning: Clone a voice in 5 seconds to generate arbitrary speech in real-time
  • wavenet: A TensorFlow implementation of DeepMind's WaveNet paper
  • FastSpeech2: An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
  • MelGAN: Unofficial PyTorch implementation of MelGAN vocoder
  • hifi-gan: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
  • elevenlabs-pythons: The official Python API for ElevenLabs Text to Speech.
  • tortoise-tts: A multi-voice TTS system trained with an emphasis on quality
  • lyrebird: Simple and powerful voice changer for Linux, written with Python & GTK
  • elevenlabs: The official Python API for ElevenLabs Text to Speech
  • piper: A fast, local neural text to speech system
  • tts-generation-webui: TTS Generation Web UI (Bark, MusicGen + AudioGen, Tortoise, RVC, Vocos, Demucs, SeamlessM4T, MAGNet)
  • GPT-SoVITS: 1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
  • metavoice-src: Foundational model for human-like, expressive TTS
  • Real-Time-Voice-Cloning: Clone a voice in 5 seconds to generate arbitrary speech in real-time
  • Retrieval-based-Voice-Conversion-WebUI: Voice data <= 10 mins can also be used to train a good VC model!
  • midi2voice: Singing synthesis from MIDI file
  • OpenVoice: Instant voice cloning by MyShell
  • ChatTTS: A generative speech model for daily dialogue
  • csm: A Conversational Speech Generation Model
  • chatterbox: Resemble AI's first production-grade open source TTS model
  • Fish-Audio: Fish Audio S2 Pro is the most advanced multimodal model developed by Fish Audio
  • KittenTTS: State-of-the-art TTS model under 25MB
  • Covo-Audio: Covo-Audio is a 7B-parameter end-to-end large audio language model that directly processes continuous audio inputs and generates audio outputs within a single unified architecture.
  • TTS-arxiv-daily: Automatically Update Text-to-speech (TTS) Papers Daily using Github Actions (Update Every 12th hours)
  • kokoro: an open-weight TTS model with 82 million parameters

About

Community list of AI tools , models, and datasets for audio, music, and speech applications

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors