An intelligent video analysis platform that combines AI transcript processing with visual frame analysis. Users can ask natural language questions about YouTube videos and get precise answers with clickable timestamps.
VizzAI is a full-stack multimodal AI application that revolutionizes how people interact with video content. Instead of watching entire videos, users can simply ask questions and get instant, precise answers.
- People waste time watching long videos to find specific information
- Existing solutions only work with text, not visual content
- No easy way to jump to relevant moments in videos
- Manual timestamp creation is time-consuming
- Smart Question Routing: AI automatically detects if questions need visual analysis
- Multi-Modal Analysis: Combines transcript (what people say) with visual frames (what they show)
- Intelligent Caching: Process videos once, get instant answers forever
- Auto-Generated Chapters: Creates clickable timestamps automatically
- 3-Layer Fallback System: Ensures high transcript extraction success rate
"What did he say about cars?" → Text Analysis (faster)
"What color is the car?" → Visual Analysis (comprehensive)# 3-layer fallback system for maximum reliability
Method 1: YouTube API (fastest) →
Method 2: yt-dlp captions →
Method 3: Whisper AI (most reliable)# Intelligent caching reduces processing time significantly
First Request: YouTube URL → Process → Cache → Answer
Future Requests: Cache → Answer (instant)- AI creates video chapters with clickable timestamps
- Output: "0:00:30 - Introduction, 0:02:15 - Main Topic"
- Frontend converts to clickable buttons that jump to exact moments
- React.js - Component-based UI with state management
- Custom CSS - Cyberpunk design with glassmorphism effects
- Real-time Notifications - Loading states and error handling
- Responsive Design - Mobile and desktop optimized
- FastAPI - High-performance async web framework
- Google Gemini AI - Latest multimodal AI for text + vision
- OpenCV - Video frame extraction and processing
- Whisper AI - Audio transcription fallback
- yt-dlp - YouTube video/caption downloading
- Smart Caching - In-memory data persistence
┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ React │ │ FastAPI │ │ AI Processors │
│ Frontend │◄──►│ Backend │◄──►│ (4 Engines) │
└─────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
│ │ │
User Input API Routes ┌──────────┐
Video URLs Caching │ YouTube │
Questions Validation │ Gemini │
Timestamps │ Whisper │
│ OpenCV │
└──────────┘
src/
├── App.js # Main UI component with video + chat interface
├── App.css # Cyberpunk styling with animations
├── index.js # React app entry point
└── package.json # Dependencies and scripts
api/
├── main.py # FastAPI server setup and CORS configuration
├── endpoints.py # All API routes (/smart-question, /health, etc.)
├── models.py # Data structures for requests/responses
└── cache.py # Smart caching system for processed videos
processors/
├── video_analysis_coordinator.py # Main brain - routes all requests
├── ai_processor.py # Gemini AI integration + question routing
├── youtube_processor.py # 3-method transcript extraction
└── visual_processor.py # Video frame extraction + processing
User Input: YouTube URL
↓
Check Cache: Already processed?
↓ (if new)
Extract Transcript: YouTube API → yt-dlp → Whisper AI
↓
Generate Chapters: AI creates timestamps
↓
Cache Everything: Store for future use
↓
Return: Video ready for questions
User Question: "What color is the car at 2:30?"
↓
AI Detection: Needs visual analysis
↓
Frame Extraction: Get video frames around timestamp
↓
Multimodal AI: Analyze transcript + frames
↓
Generate Answer: With precise timestamps
↓
Frontend: Makes timestamps clickable
class VideoCache:
def process_video(self, url):
if url in cache:
return cache[url] # Instant response
# Process fresh
transcript = extract_transcript(url)
frames = extract_frames(url) if needed
cache[url] = {transcript, frames, metadata}
return cache[url]# Required software
Node.js 16+
Python 3.8+
Gitgit clone https://github.com/your-username/vizzai.git
cd vizzaicd backend
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtcp .env.example .env
GEMINI_API_KEY=your_gemini_api_key_here
# Add your GEMINI_API_KEY to .env filepython -m api.main # IMP: You should be in backend foldercd frontend
npm install
npm start# http://localhost:3000- Paste YouTube URL in the input field
- Click "Load Video"
- Wait for transcript extraction
- Video chapters appear automatically
Text Questions (fast):
- "What is this video about?"
- "What did the speaker say about AI?"
- "Summarize the main points"
Visual Questions (comprehensive):
- "What color is the car?"
- "How many people are in the video?"
- "What's written on the screen at 2:30?"
- AI generates answers with timestamps: (0:02:30)
- Click any timestamp to jump to that moment
- Auto-generated chapters provide video overview
POST /smart-question
# Main endpoint - handles all question types
# Automatically routes to text or visual analysis
POST /process-youtube
# Extract transcript only (for caching)
GET /health
# System status and processor availability
GET /cache-stats
# Current cache usage statistics// Ask a question about a video
const response = await fetch('/smart-question', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
url: 'https://youtube.com/watch?v=example',
question: 'What color is the car?'
})
});
const result = await response.json();
// Returns: AI answer with clickable timestamps- Problem: Combining text and visual analysis seamlessly
- Solution: Created intelligent routing system that detects question types
- Implementation:
def check_needs_visual_analysis(self, question):
# Custom AI prompt to classify questions
if "color" in question or "appearance" in question:
return "visual"
return "transcript"- Problem: Frame extraction is slow and resource-intensive
- Solution: Smart frame sampling (1 frame per 5 seconds) + caching
- Result: Optimized processing time and resource usage
- Problem: Not all YouTube videos have captions
- Solution: Built 3-layer fallback system
- Implementation:
def process_youtube_video(self, url):
# Method 1: Official YouTube captions (fastest)
result = self.try_youtube_api(url)
if result: return result
# Method 2: Downloaded caption files (medium)
result = self.try_ytdlp_captions(url)
if result: return result
# Method 3: AI audio transcription (most reliable)
return self.try_whisper(url)- Problem: Long processing times create poor UX
- Solution: Real-time notifications + progress tracking
- Features: Loading animations, progress updates, error handling
- Multi-language support - Analyze videos in different languages
- Batch processing - Handle multiple videos simultaneously
- Advanced search - Natural language search within video libraries
- Export functionality - Generate video summaries and reports
- Real-time analysis - Live stream processing capabilities
- Async support for handling multiple video processing requests
- Automatic API documentation with OpenAPI/Swagger
- Type hints integration for better code reliability
- High performance for AI workloads
- Component reusability for video player and chat interface
- Rich ecosystem for real-time updates and animations
- Excellent development experience with hot reloading
- Industry standard for modern web applications
- Multimodal capabilities built-in (text + vision)
- Cost-effective for high-volume processing
- Latest technology with superior performance
- Google ecosystem integration
- Separation of concerns with dedicated processors
- Error handling at every level with graceful fallbacks
- Type hints throughout for code reliability
- Async/await for non-blocking operations
- Comprehensive logging for debugging and monitoring
- Component-based design with clear state management
- Error boundaries for graceful error handling
- Responsive design with mobile-first approach
- Performance optimization with React best practices
- Accessibility with semantic HTML and ARIA labels
Created by: Raahim Khan
Tech Stack: React.js, FastAPI, Google Gemini AI, OpenCV, Whisper
Architecture: Full-stack multimodal AI application
This project is licensed under the MIT License - see the LICENSE file for details.
VizzAI - Transforming how we interact with video content through intelligent AI analysis.