Skip to content

AliDmrcIo/speech_recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

7 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Speech Recognition & Speaker Diarization (End-to-End AI Project)

๐Ÿ”— Live on: https://whosaywhat.duckdns.org This is the live production version of the project. You can directly upload your audio files and test the AI-based speech recognition and speaker diarization system from this link.


1225.mp4

Project Description

This project is a full-stack, production-ready AI-powered speech analysis system. Users can upload audio files, transcribe speech to text using state-of-the-art ASR (Automatic Speech Recognition) models, and automatically identify distinct Speaker Diarization. The system features a dual-mode processing engine (Fast vs. Pro) and exports color-coded transcripts in Word format.

The project is designed as a complete End-to-End AI application, covering:

  • Model serving & optimization
  • Audio signal processing
  • Speaker clustering algorithms
  • Containerization
  • Cloud deployment on limited resources (CPU-only)

Project Goal

The main goal of this project is to build a robust, scalable, and efficient AI application that converts spoken language into structured, speaker-labeled text. The focus is not only on machine learning performance but also on:

  • Resource Optimization: Running heavy AI models on CPU-only infrastructure.
  • User Experience: Providing distinct modes for speed vs. accuracy.
  • Real-world Utility: Generating usable, formatted reports (.docx).
  • Production Deployment: Dockerized environment on AWS EC2.

๐Ÿ› ๏ธ Technologies Used

AI & Audio Processing

  • Task: Automatic Speech Recognition (ASR) & Speaker Diarization
  • Models:
    • WhisperX (High-precision alignment & diarization)
    • Faster-Whisper (Optimized inference speed)
    • Pyannote.audio (Speaker segmentation & embedding)
  • Algorithms: K-Means Clustering (For custom speaker separation in Fast Mode)
  • Libraries: PyTorch, Torchaudio, Librosa, Scikit-learn

Application Logic & Backend

  • Language: Python 3.13
  • Logic: Custom pipelines for audio chunking, feature extraction (MFCC), and transcript alignment.
  • Export: python-docx (For generating formatted Word documents)
  • Environment: Python-Dotenv (Configuration management)

Frontend (User Interface)

  • Framework: Streamlit
  • Design: Custom layout with distinct processing modes (Fast/Pro)
  • Interactivity: Real-time audio playback and file handling

DevOps, Container & Cloud

  • Containerization: Docker (Optimized Slim Image)
  • Cloud Provider: AWS EC2 (Ubuntu - Free Tier Optimized)
  • Network: DuckDNS (Dynamic DNS)
  • Port Management: Docker Port Mapping (80 -> 8501)
  • Model Auth: Hugging Face Token Authentication

Libraries Used

PyTorch Torchaudio WhisperX Faster-Whisper Pyannote Streamlit Librosa Scikit-Learn NumPy Docker AWS Python-Docx


๐Ÿ“‚ Project Structure

  • ai/
    • main.py โ†’ Core logic containing slow_model (WhisperX) and fast_model (Faster-Whisper + Custom Clustering).
    • Model loaders and caching mechanisms (@st.cache_resource).
  • frontend/
    • the_page.py โ†’ UI layout, file uploader, and mode selection buttons.
  • app.py
    • Main entry point for the Streamlit application.
  • Dockerfile
    • python:3.13.3 image with CPU-only PyTorch build.
  • requirements.txt
    • List of dependencies optimized for cloud deployment.
  • .env
    • Environment variables (Contains Hugging Face Token).

How to Run Locally

You can run the system in two different ways:

  • Using Docker (recommended for isolation)
  • Running manually with Python

1. Environment Variable Configuration (.env)

You need a Hugging Face token to access segmentation models (pyannote/speaker-diarization). Create a file named .env in the project root:

HUGGING_FACE_TOKEN=hf_your_hugging_face_token_here

Ensure you have accepted the user agreement for pyannote/speaker-diarization-3.1 on Hugging Face.


2. Option A: Run with Docker (Recommended)

This is the most stable way to run the project, ensuring all system dependencies (FFmpeg, etc.) are installed.

Steps

Build the image:

docker build -t speech-app .

Run the container:

docker run -p 8501:8501 --env-file .env speech-app

Open your browser:

http://localhost:8501

3. Option B: Run Without Docker (Manual Setup)

System Requirements

You must have FFmpeg installed on your system.

Install Dependencies

pip install -r requirements.txt

Start the Application

streamlit run app.py

Frontend runs at:

http://localhost:8501

Model Authentication Flow

Since Pyannote.audio models are gated, the application requires authentication:

  1. Application starts and looks for HUGGING_FACE_TOKEN.
  2. Fast Mode: Uses Faster-Whisper (No auth required) + Custom KMeans clustering.
  3. Pro Mode: Authenticates with Hugging Face via the token to download/load pyannote/speaker-diarization models.
  4. If the token is invalid or missing, the Pro model will fail to load.

Production Deployment

  • Deployed on: AWS EC2 (Ubuntu)
  • Optimization: Configured with Swap Memory (8GB) to handle model loading on limited RAM.
  • Docker Optimization: Uses torch --index-url .../cpu to reduce image size by removing CUDA dependencies.
  • Access: Served via DuckDNS with Port 80 redirection.

Summary

This project is a sophisticated audio analysis tool. It is a complete, cloud-deployed AI system that demonstrates:

  • Advanced Audio Processing: Combining ASR and Diarization.
  • Resource Management: Running Large Language Models (LLMs) on CPU.
  • Containerization: Custom Docker optimization for AI workloads.
  • Full Stack Integration: From raw audio processing to user-friendly UI.

About

AI-Powered Speech Recognition & Diarization: A robust Streamlit application leveraging WhisperX and Faster-Whisper for accurate transcription and speaker separation. Features dual-mode processing (Fast/Pro), automatic speaker identification, color-coded Word (.docx) export, and CPU-optimized Docker deployment on AWS EC2.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors