Skip to content

Raahim24/VizzAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VizzAI - MultiModal Video Analysis Platform

An intelligent video analysis platform that combines AI transcript processing with visual frame analysis. Users can ask natural language questions about YouTube videos and get precise answers with clickable timestamps.

What VizzAI Does

VizzAI is a full-stack multimodal AI application that revolutionizes how people interact with video content. Instead of watching entire videos, users can simply ask questions and get instant, precise answers.

The Problem It Solves

  • People waste time watching long videos to find specific information
  • Existing solutions only work with text, not visual content
  • No easy way to jump to relevant moments in videos
  • Manual timestamp creation is time-consuming

The Solution

  • Smart Question Routing: AI automatically detects if questions need visual analysis
  • Multi-Modal Analysis: Combines transcript (what people say) with visual frames (what they show)
  • Intelligent Caching: Process videos once, get instant answers forever
  • Auto-Generated Chapters: Creates clickable timestamps automatically
  • 3-Layer Fallback System: Ensures high transcript extraction success rate

Key Features

Intelligent Question Classification

"What did he say about cars?"Text Analysis (faster)
"What color is the car?"Visual Analysis (comprehensive)

Multi-Method Transcript Extraction

# 3-layer fallback system for maximum reliability
Method 1: YouTube API (fastest) → 
Method 2: yt-dlp captionsMethod 3: Whisper AI (most reliable)

Smart Caching Architecture

# Intelligent caching reduces processing time significantly
First Request:  YouTube URLProcessCacheAnswer
Future Requests: CacheAnswer (instant)

Auto-Chapter Generation

  • AI creates video chapters with clickable timestamps
  • Output: "0:00:30 - Introduction, 0:02:15 - Main Topic"
  • Frontend converts to clickable buttons that jump to exact moments

Tech Stack

Frontend

  • React.js - Component-based UI with state management
  • Custom CSS - Cyberpunk design with glassmorphism effects
  • Real-time Notifications - Loading states and error handling
  • Responsive Design - Mobile and desktop optimized

Backend

  • FastAPI - High-performance async web framework
  • Google Gemini AI - Latest multimodal AI for text + vision
  • OpenCV - Video frame extraction and processing
  • Whisper AI - Audio transcription fallback
  • yt-dlp - YouTube video/caption downloading
  • Smart Caching - In-memory data persistence

System Architecture

┌─────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   React     │    │   FastAPI        │    │   AI Processors │
│   Frontend  │◄──►│   Backend        │◄──►│   (4 Engines)   │
└─────────────┘    └──────────────────┘    └─────────────────┘
       │                     │                       │
       │                     │                       │
   User Input           API Routes              ┌──────────┐
   Video URLs           Caching                 │ YouTube  │
   Questions            Validation              │ Gemini   │
   Timestamps                                   │ Whisper  │
                                               │ OpenCV   │
                                               └──────────┘

Project Structure

Frontend (frontend/)

src/
├── App.js         # Main UI component with video + chat interface
├── App.css        # Cyberpunk styling with animations
├── index.js       # React app entry point
└── package.json   # Dependencies and scripts

Backend (backend/)

api/
├── main.py        # FastAPI server setup and CORS configuration
├── endpoints.py   # All API routes (/smart-question, /health, etc.)
├── models.py      # Data structures for requests/responses
└── cache.py       # Smart caching system for processed videos

processors/
├── video_analysis_coordinator.py  # Main brain - routes all requests
├── ai_processor.py                # Gemini AI integration + question routing
├── youtube_processor.py           # 3-method transcript extraction
└── visual_processor.py            # Video frame extraction + processing

How It Works

1. Video Processing Pipeline

User Input: YouTube URL
    ↓
Check Cache: Already processed?
    ↓ (if new)
Extract Transcript: YouTube API → yt-dlp → Whisper AI
    ↓
Generate Chapters: AI creates timestamps
    ↓
Cache Everything: Store for future use
    ↓
Return: Video ready for questions

2. Question Analysis Pipeline

User Question: "What color is the car at 2:30?"
    ↓
AI Detection: Needs visual analysis
    ↓
Frame Extraction: Get video frames around timestamp
    ↓
Multimodal AI: Analyze transcript + frames
    ↓
Generate Answer: With precise timestamps
    ↓
Frontend: Makes timestamps clickable

3. Smart Caching Logic

class VideoCache:
    def process_video(self, url):
        if url in cache:
            return cache[url]  # Instant response
        
        # Process fresh
        transcript = extract_transcript(url)
        frames = extract_frames(url) if needed
        
        cache[url] = {transcript, frames, metadata}
        return cache[url]

Installation & Setup

Prerequisites

# Required software
Node.js 16+
Python 3.8+
Git

Quick Start

1. Clone repository

git clone https://github.com/your-username/vizzai.git
cd vizzai

2. Backend setup

cd backend
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

3. Environment configuration

cp .env.example .env
GEMINI_API_KEY=your_gemini_api_key_here
# Add your GEMINI_API_KEY to .env file
4. Start backend server
python -m api.main # IMP: You should be in backend folder

5. Frontend setup (new terminal)

cd frontend
npm install
npm start

6. Open application

# http://localhost:3000

How to Use

1. Load a Video

  1. Paste YouTube URL in the input field
  2. Click "Load Video"
  3. Wait for transcript extraction
  4. Video chapters appear automatically

2. Ask Questions

Text Questions (fast):

  • "What is this video about?"
  • "What did the speaker say about AI?"
  • "Summarize the main points"

Visual Questions (comprehensive):

  • "What color is the car?"
  • "How many people are in the video?"
  • "What's written on the screen at 2:30?"

3. Navigate with Timestamps

  • AI generates answers with timestamps: (0:02:30)
  • Click any timestamp to jump to that moment
  • Auto-generated chapters provide video overview

API Endpoints

Core Endpoints

POST /smart-question
# Main endpoint - handles all question types
# Automatically routes to text or visual analysis

POST /process-youtube  
# Extract transcript only (for caching)

GET /health
# System status and processor availability

GET /cache-stats
# Current cache usage statistics

Example API Usage

// Ask a question about a video
const response = await fetch('/smart-question', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    url: 'https://youtube.com/watch?v=example',
    question: 'What color is the car?'
  })
});

const result = await response.json();
// Returns: AI answer with clickable timestamps

Technical Challenges Solved

Multimodal AI Integration

  • Problem: Combining text and visual analysis seamlessly
  • Solution: Created intelligent routing system that detects question types
  • Implementation:
def check_needs_visual_analysis(self, question):
    # Custom AI prompt to classify questions
    if "color" in question or "appearance" in question:
        return "visual"
    return "transcript"

Video Processing Performance

  • Problem: Frame extraction is slow and resource-intensive
  • Solution: Smart frame sampling (1 frame per 5 seconds) + caching
  • Result: Optimized processing time and resource usage

Transcript Reliability

  • Problem: Not all YouTube videos have captions
  • Solution: Built 3-layer fallback system
  • Implementation:
def process_youtube_video(self, url):
    # Method 1: Official YouTube captions (fastest)
    result = self.try_youtube_api(url)
    if result: return result
    
    # Method 2: Downloaded caption files (medium)
    result = self.try_ytdlp_captions(url)
    if result: return result
    
    # Method 3: AI audio transcription (most reliable)
    return self.try_whisper(url)

Real-time User Experience

  • Problem: Long processing times create poor UX
  • Solution: Real-time notifications + progress tracking
  • Features: Loading animations, progress updates, error handling

Future Enhancements

Planned Features

  • Multi-language support - Analyze videos in different languages
  • Batch processing - Handle multiple videos simultaneously
  • Advanced search - Natural language search within video libraries
  • Export functionality - Generate video summaries and reports
  • Real-time analysis - Live stream processing capabilities

Architecture Decisions

Why FastAPI?

  • Async support for handling multiple video processing requests
  • Automatic API documentation with OpenAPI/Swagger
  • Type hints integration for better code reliability
  • High performance for AI workloads

Why React?

  • Component reusability for video player and chat interface
  • Rich ecosystem for real-time updates and animations
  • Excellent development experience with hot reloading
  • Industry standard for modern web applications

Why Gemini AI?

  • Multimodal capabilities built-in (text + vision)
  • Cost-effective for high-volume processing
  • Latest technology with superior performance
  • Google ecosystem integration

Code Quality & Best Practices

Backend Architecture

  • Separation of concerns with dedicated processors
  • Error handling at every level with graceful fallbacks
  • Type hints throughout for code reliability
  • Async/await for non-blocking operations
  • Comprehensive logging for debugging and monitoring

Frontend Architecture

  • Component-based design with clear state management
  • Error boundaries for graceful error handling
  • Responsive design with mobile-first approach
  • Performance optimization with React best practices
  • Accessibility with semantic HTML and ARIA labels

About

Created by: Raahim Khan
Tech Stack: React.js, FastAPI, Google Gemini AI, OpenCV, Whisper
Architecture: Full-stack multimodal AI application


License

This project is licensed under the MIT License - see the LICENSE file for details.


VizzAI - Transforming how we interact with video content through intelligent AI analysis.

About

VizzAI - MultiModal Video Analysis Platform

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors