AI Voice Agents - Exploring the Next Generation of Human-Machine Interaction! ๐๏ธ๐ค๐ง
| Source | Description | Code | Paper | Model |
|---|---|---|---|---|
| Bland AI | Bland AI - Automate Phone Calls with Conversational AI. Transform your enterprise communication with Bland AI. Automate inbound and outbound phone calls using AI that sounds human. Bland is a platform for AI phone calling. Using our API, you can easily send or receive phone calls with a programmable voice agent. | API | ||
| GPT-4o | GPT-4o (โoโ for โomniโ) is a step towards much more natural human-computer interactionโit accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. | API | ||
| Retell AI | Retell AI -Build Advanced Voice AI, Powered by LLM. | API |
| Source | Description | Code | Paper | Model |
|---|---|---|---|---|
| ChatTTS | ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant. | GitHub | Hugging Face | |
| CosyVoice | Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability. | GitHub | ||
| ElevenLabs | ElevenLabs: Text to Speech & AI Voice Generator. | API | ||
| Matcha-TTS | Matcha-TTS: A fast TTS architecture with conditional flow matching. | GitHub | arXiv | |
| StyleTTS 2 | Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. | GitHub | arXiv | |
| XTTS | ๐ธTTS is a library for advanced Text-to-Speech generation. | GitHub |
| Source | Description | Code | Paper | Model |
|---|---|---|---|---|
| SenseVoice | SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED). | GitHub | Hugging Face | |
| TeleSpeech-ASR | Large speech model-super multi-dialect ASR. | GitHub | Hugging Face | |
| Whisper | Whisper is a general-purpose speech recognition model. | GitHub | arXiv | Hugging Face |
| Source | Description | Code | Paper | Model |
|---|---|---|---|---|
| Make-An-Audio 3 | Transforming Text into Audio via Flow-based Large Diffusion Transformers. | GitHub | arXiv | Hugging Face |
