VizzAI - MultiModal Video Analysis Platform

An intelligent video analysis platform that combines AI transcript processing with visual frame analysis. Users can ask natural language questions about YouTube videos and get precise answers with clickable timestamps.

What VizzAI Does

VizzAI is a full-stack multimodal AI application that revolutionizes how people interact with video content. Instead of watching entire videos, users can simply ask questions and get instant, precise answers.

The Problem It Solves

People waste time watching long videos to find specific information
Existing solutions only work with text, not visual content
No easy way to jump to relevant moments in videos
Manual timestamp creation is time-consuming

The Solution

Smart Question Routing: AI automatically detects if questions need visual analysis
Multi-Modal Analysis: Combines transcript (what people say) with visual frames (what they show)
Intelligent Caching: Process videos once, get instant answers forever
Auto-Generated Chapters: Creates clickable timestamps automatically
3-Layer Fallback System: Ensures high transcript extraction success rate

Key Features

Intelligent Question Classification

"What did he say about cars?" → Text Analysis (faster)
"What color is the car?"     → Visual Analysis (comprehensive)

Multi-Method Transcript Extraction

# 3-layer fallback system for maximum reliability
Method 1: YouTube API (fastest) → 
Method 2: yt-dlp captions → 
Method 3: Whisper AI (most reliable)

Smart Caching Architecture

# Intelligent caching reduces processing time significantly
First Request:  YouTube URL → Process → Cache → Answer
Future Requests: Cache → Answer (instant)

Auto-Chapter Generation

AI creates video chapters with clickable timestamps
Output: "0:00:30 - Introduction, 0:02:15 - Main Topic"
Frontend converts to clickable buttons that jump to exact moments

Tech Stack

Frontend

React.js - Component-based UI with state management
Custom CSS - Cyberpunk design with glassmorphism effects
Real-time Notifications - Loading states and error handling
Responsive Design - Mobile and desktop optimized

Backend

FastAPI - High-performance async web framework
Google Gemini AI - Latest multimodal AI for text + vision
OpenCV - Video frame extraction and processing
Whisper AI - Audio transcription fallback
yt-dlp - YouTube video/caption downloading
Smart Caching - In-memory data persistence

System Architecture

┌─────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   React     │    │   FastAPI        │    │   AI Processors │
│   Frontend  │◄──►│   Backend        │◄──►│   (4 Engines)   │
└─────────────┘    └──────────────────┘    └─────────────────┘
       │                     │                       │
       │                     │                       │
   User Input           API Routes              ┌──────────┐
   Video URLs           Caching                 │ YouTube  │
   Questions            Validation              │ Gemini   │
   Timestamps                                   │ Whisper  │
                                               │ OpenCV   │
                                               └──────────┘

Project Structure

Frontend (`frontend/`)

src/
├── App.js         # Main UI component with video + chat interface
├── App.css        # Cyberpunk styling with animations
├── index.js       # React app entry point
└── package.json   # Dependencies and scripts

Backend (`backend/`)

api/
├── main.py        # FastAPI server setup and CORS configuration
├── endpoints.py   # All API routes (/smart-question, /health, etc.)
├── models.py      # Data structures for requests/responses
└── cache.py       # Smart caching system for processed videos

processors/
├── video_analysis_coordinator.py  # Main brain - routes all requests
├── ai_processor.py                # Gemini AI integration + question routing
├── youtube_processor.py           # 3-method transcript extraction
└── visual_processor.py            # Video frame extraction + processing

How It Works

1. Video Processing Pipeline

User Input: YouTube URL
    ↓
Check Cache: Already processed?
    ↓ (if new)
Extract Transcript: YouTube API → yt-dlp → Whisper AI
    ↓
Generate Chapters: AI creates timestamps
    ↓
Cache Everything: Store for future use
    ↓
Return: Video ready for questions

2. Question Analysis Pipeline

User Question: "What color is the car at 2:30?"
    ↓
AI Detection: Needs visual analysis
    ↓
Frame Extraction: Get video frames around timestamp
    ↓
Multimodal AI: Analyze transcript + frames
    ↓
Generate Answer: With precise timestamps
    ↓
Frontend: Makes timestamps clickable

3. Smart Caching Logic

class VideoCache:
    def process_video(self, url):
        if url in cache:
            return cache[url]  # Instant response
        
        # Process fresh
        transcript = extract_transcript(url)
        frames = extract_frames(url) if needed
        
        cache[url] = {transcript, frames, metadata}
        return cache[url]

Installation & Setup

Prerequisites

# Required software
Node.js 16+
Python 3.8+
Git

Quick Start

1. Clone repository

git clone https://github.com/your-username/vizzai.git
cd vizzai

2. Backend setup

cd backend
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

3. Environment configuration

cp .env.example .env
GEMINI_API_KEY=your_gemini_api_key_here
# Add your GEMINI_API_KEY to .env file

4. Start backend server

python -m api.main # IMP: You should be in backend folder

5. Frontend setup (new terminal)

cd frontend
npm install
npm start

6. Open application

# http://localhost:3000

How to Use

1. Load a Video

Paste YouTube URL in the input field
Click "Load Video"
Wait for transcript extraction
Video chapters appear automatically

2. Ask Questions

Text Questions (fast):

"What is this video about?"
"What did the speaker say about AI?"
"Summarize the main points"

Visual Questions (comprehensive):

"What color is the car?"
"How many people are in the video?"
"What's written on the screen at 2:30?"

3. Navigate with Timestamps

AI generates answers with timestamps: (0:02:30)
Click any timestamp to jump to that moment
Auto-generated chapters provide video overview

API Endpoints

Core Endpoints

POST /smart-question
# Main endpoint - handles all question types
# Automatically routes to text or visual analysis

POST /process-youtube  
# Extract transcript only (for caching)

GET /health
# System status and processor availability

GET /cache-stats
# Current cache usage statistics

Example API Usage

// Ask a question about a video
const response = await fetch('/smart-question', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    url: 'https://youtube.com/watch?v=example',
    question: 'What color is the car?'
  })
});

const result = await response.json();
// Returns: AI answer with clickable timestamps

Technical Challenges Solved

Multimodal AI Integration

Problem: Combining text and visual analysis seamlessly
Solution: Created intelligent routing system that detects question types
Implementation:

def check_needs_visual_analysis(self, question):
    # Custom AI prompt to classify questions
    if "color" in question or "appearance" in question:
        return "visual"
    return "transcript"

Video Processing Performance

Problem: Frame extraction is slow and resource-intensive
Solution: Smart frame sampling (1 frame per 5 seconds) + caching
Result: Optimized processing time and resource usage

Transcript Reliability

Problem: Not all YouTube videos have captions
Solution: Built 3-layer fallback system
Implementation:

def process_youtube_video(self, url):
    # Method 1: Official YouTube captions (fastest)
    result = self.try_youtube_api(url)
    if result: return result
    
    # Method 2: Downloaded caption files (medium)
    result = self.try_ytdlp_captions(url)
    if result: return result
    
    # Method 3: AI audio transcription (most reliable)
    return self.try_whisper(url)

Real-time User Experience

Problem: Long processing times create poor UX
Solution: Real-time notifications + progress tracking
Features: Loading animations, progress updates, error handling

Future Enhancements

Planned Features

Multi-language support - Analyze videos in different languages
Batch processing - Handle multiple videos simultaneously
Advanced search - Natural language search within video libraries
Export functionality - Generate video summaries and reports
Real-time analysis - Live stream processing capabilities

Architecture Decisions

Why FastAPI?

Async support for handling multiple video processing requests
Automatic API documentation with OpenAPI/Swagger
Type hints integration for better code reliability
High performance for AI workloads

Why React?

Component reusability for video player and chat interface
Rich ecosystem for real-time updates and animations
Excellent development experience with hot reloading
Industry standard for modern web applications

Why Gemini AI?

Multimodal capabilities built-in (text + vision)
Cost-effective for high-volume processing
Latest technology with superior performance
Google ecosystem integration

Code Quality & Best Practices

Backend Architecture

Separation of concerns with dedicated processors
Error handling at every level with graceful fallbacks
Type hints throughout for code reliability
Async/await for non-blocking operations
Comprehensive logging for debugging and monitoring

Frontend Architecture

Component-based design with clear state management
Error boundaries for graceful error handling
Responsive design with mobile-first approach
Performance optimization with React best practices
Accessibility with semantic HTML and ARIA labels

About

Created by: Raahim Khan
Tech Stack: React.js, FastAPI, Google Gemini AI, OpenCV, Whisper
Architecture: Full-stack multimodal AI application

License

This project is licensed under the MIT License - see the LICENSE file for details.

VizzAI - Transforming how we interact with video content through intelligent AI analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
VizzAI.pdf		VizzAI.pdf

Folders and files

Latest commit

History

Repository files navigation

VizzAI - MultiModal Video Analysis Platform

What VizzAI Does

The Problem It Solves

The Solution

Key Features

Intelligent Question Classification

Multi-Method Transcript Extraction

Smart Caching Architecture

Auto-Chapter Generation

Tech Stack

Frontend

Backend

System Architecture

Project Structure

Frontend (frontend/)

Backend (backend/)

How It Works

1. Video Processing Pipeline

2. Question Analysis Pipeline

3. Smart Caching Logic

Installation & Setup

Prerequisites

Quick Start

1. Clone repository

2. Backend setup

3. Environment configuration

4. Start backend server

5. Frontend setup (new terminal)

6. Open application

How to Use

1. Load a Video

2. Ask Questions

3. Navigate with Timestamps

API Endpoints

Core Endpoints

Example API Usage

Technical Challenges Solved

Multimodal AI Integration

Video Processing Performance

Transcript Reliability

Real-time User Experience

Future Enhancements

Planned Features

Architecture Decisions

Why FastAPI?

Why React?

Why Gemini AI?

Code Quality & Best Practices

Backend Architecture

Frontend Architecture

About

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Frontend (`frontend/`)

Backend (`backend/`)

Packages