๐ Live on: https://whosaywhat.duckdns.org This is the live production version of the project. You can directly upload your audio files and test the AI-based speech recognition and speaker diarization system from this link.
1225.mp4
This project is a full-stack, production-ready AI-powered speech analysis system. Users can upload audio files, transcribe speech to text using state-of-the-art ASR (Automatic Speech Recognition) models, and automatically identify distinct Speaker Diarization. The system features a dual-mode processing engine (Fast vs. Pro) and exports color-coded transcripts in Word format.
The project is designed as a complete End-to-End AI application, covering:
- Model serving & optimization
- Audio signal processing
- Speaker clustering algorithms
- Containerization
- Cloud deployment on limited resources (CPU-only)
The main goal of this project is to build a robust, scalable, and efficient AI application that converts spoken language into structured, speaker-labeled text. The focus is not only on machine learning performance but also on:
- Resource Optimization: Running heavy AI models on CPU-only infrastructure.
- User Experience: Providing distinct modes for speed vs. accuracy.
- Real-world Utility: Generating usable, formatted reports (.docx).
- Production Deployment: Dockerized environment on AWS EC2.
- Task: Automatic Speech Recognition (ASR) & Speaker Diarization
- Models:
WhisperX(High-precision alignment & diarization)Faster-Whisper(Optimized inference speed)Pyannote.audio(Speaker segmentation & embedding)
- Algorithms:
K-Means Clustering(For custom speaker separation in Fast Mode) - Libraries:
PyTorch,Torchaudio,Librosa,Scikit-learn
- Language:
Python 3.13 - Logic: Custom pipelines for audio chunking, feature extraction (MFCC), and transcript alignment.
- Export:
python-docx(For generating formatted Word documents) - Environment:
Python-Dotenv(Configuration management)
- Framework:
Streamlit - Design: Custom layout with distinct processing modes (Fast/Pro)
- Interactivity: Real-time audio playback and file handling
- Containerization:
Docker(Optimized Slim Image) - Cloud Provider:
AWS EC2(Ubuntu - Free Tier Optimized) - Network:
DuckDNS(Dynamic DNS) - Port Management: Docker Port Mapping (80 -> 8501)
- Model Auth: Hugging Face Token Authentication
ai/main.pyโ Core logic containingslow_model(WhisperX) andfast_model(Faster-Whisper + Custom Clustering).- Model loaders and caching mechanisms (
@st.cache_resource).
frontend/the_page.pyโ UI layout, file uploader, and mode selection buttons.
app.py- Main entry point for the Streamlit application.
Dockerfilepython:3.13.3image with CPU-only PyTorch build.
requirements.txt- List of dependencies optimized for cloud deployment.
.env- Environment variables (Contains Hugging Face Token).
You can run the system in two different ways:
- Using Docker (recommended for isolation)
- Running manually with Python
You need a Hugging Face token to access segmentation models (pyannote/speaker-diarization). Create a file named .env in the project root:
HUGGING_FACE_TOKEN=hf_your_hugging_face_token_hereEnsure you have accepted the user agreement for pyannote/speaker-diarization-3.1 on Hugging Face.
This is the most stable way to run the project, ensuring all system dependencies (FFmpeg, etc.) are installed.
Build the image:
docker build -t speech-app .Run the container:
docker run -p 8501:8501 --env-file .env speech-appOpen your browser:
http://localhost:8501
You must have FFmpeg installed on your system.
pip install -r requirements.txtstreamlit run app.pyFrontend runs at:
http://localhost:8501
Since Pyannote.audio models are gated, the application requires authentication:
- Application starts and looks for
HUGGING_FACE_TOKEN. - Fast Mode: Uses
Faster-Whisper(No auth required) + Custom KMeans clustering. - Pro Mode: Authenticates with Hugging Face via the token to download/load
pyannote/speaker-diarizationmodels. - If the token is invalid or missing, the Pro model will fail to load.
- Deployed on:
AWS EC2 (Ubuntu) - Optimization: Configured with Swap Memory (8GB) to handle model loading on limited RAM.
- Docker Optimization: Uses
torch --index-url .../cputo reduce image size by removing CUDA dependencies. - Access: Served via
DuckDNSwith Port 80 redirection.
This project is a sophisticated audio analysis tool. It is a complete, cloud-deployed AI system that demonstrates:
- Advanced Audio Processing: Combining ASR and Diarization.
- Resource Management: Running Large Language Models (LLMs) on CPU.
- Containerization: Custom Docker optimization for AI workloads.
- Full Stack Integration: From raw audio processing to user-friendly UI.