╔══════════════════════════════════════════════════════════════════╗
║ ║
║ H Y B R I D R E C ║
║ ───────────────────────────────────────────────────────── ║
║ Hybrid Recommender System · Leona Goel ║
║ ║
╚══════════════════════════════════════════════════════════════════╝
Important
🟢 This is the active GSSoC project repo — open all issues and PRs here only.
A production-ready recommender fusing Content-Based Filtering (TF-IDF), Collaborative Filtering (SVD), and NLP Sentiment Analysis (VADER) with a tunable weighted scoring engine — backed by Supabase PostgreSQL, served via FastAPI, and built to be dataset-agnostic by design.
25,000+ products · Sub-50ms search · 3 ML models fused · ~60% faster integration
- Architecture
- Features
- Tech Stack
- Project Structure
- Quick Start
- API Reference
- Security
- FAQ
- Screenshots
- Troubleshooting
- Setup Verification
- Beginner Contributor Tips
- Contributors
- License
- Knowledge Graph Embeddings
The core insight: blend three independent signals, each capturing something the others miss.
User Reviews (text) ──→ NLP Engine (VADER Sentiment) ──┐
Item Metadata (title/desc) ──→ Content Vectorization (TF-IDF) ──┼──→ Weighted Hybrid ──→ Ranked Results
User Purchases (clicks/buys) ──→ Matrix Factorization (SVD) ──┘ Engine
Hybrid Score = α · content_score [TF-IDF cosine similarity]
+ β · collab_score [Truncated SVD latent space]
+ γ · sentiment_score [VADER compound polarity]
// α, β, γ are live-tunable via API or UI sliders
α — Content Model · TF-IDF + Cosine Similarity
Item metadata (title + description + category) vectorized with TF-IDF (unigrams + bigrams, max 5,000 features). On-the-fly cosine similarity yields content_score ∈ [0, 1]. Fast, interpretable, and requires zero user history — ideal for cold-start.
β — Collaborative Model · Truncated SVD
User-item interaction matrix built from purchases + implicit feedback (views, clicks). SVD reduces to 50 latent factors; cosine similarity in latent space yields collab_score. Adaptive rank automatically reduces SVD components for sparse matrices.
γ — Sentiment Model · NLTK VADER
Review text analyzed for compound polarity ∈ [-1, 1]. Per-item aggregation → Min-Max normalization → sentiment_score ∈ [0, 1]. Surfaces genuinely loved products, not just popular ones.
❄ Cold-Start Handling
- Bayesian average rating — prevents 1-review, 5-star bias
- Popularity-based fallback — ranks new items by review count and category similarity
- Mock user seeding — synthetic purchase history to bootstrap collaborative filtering
| Feature | Detail |
|---|---|
PostgreSQL FTS |
GIN-indexed full-text search — sub-50ms on 250k+ rows |
Supabase Auth |
Guest (anonymous) and email/password, Row-Level Security on all tables |
Tunable Weights |
Live α/β/γ sliders to adjust recommendation blend in real time |
Dataset-Agnostic |
Fuzzy column detection (product_name → title) cuts integration time by ~60% |
Cold-Start Resilient |
Bayesian avg rating + popularity fallback for new users and items |
Type-to-Search |
Global keyboard capture — start typing anywhere to search instantly |
Responsive UI |
Amazon-inspired dark header, 4→3→2→1 column card grid across breakpoints |
Secure by Default |
Pydantic validation, parameterized queries, CORS-restricted, no stack-trace leakage |
Streamlit UI |
Local CSV upload → build models → recommendations, no Supabase or server required |
┌─────────────────┬────────────────────────────────────────────────┐
│ Layer │ Technology │
├─────────────────┼────────────────────────────────────────────────┤
│ Backend │ Python 3.10+, FastAPI, Uvicorn │
│ Database │ Supabase (PostgreSQL), Row-Level Security │
│ Search │ PostgreSQL FTS (GIN indexes, ts_rank) │
│ Auth │ Supabase Auth (anonymous + email/password) │
│ ML — Content │ scikit-learn: TF-IDF Vectorizer, Cosine Sim │
│ ML — Collab │ scikit-learn: TruncatedSVD, SciPy sparse │
│ NLP │ NLTK VADER SentimentIntensityAnalyzer │
│ Data │ Pandas, NumPy │
│ Frontend │ HTML5, CSS3, Vanilla JS, Supabase JS v2 │
└─────────────────┴────────────────────────────────────────────────┘
hybrid-recommender/
│
├── backend/
│ └── main.py # FastAPI server — search, upload, build, recommend
│
├── frontend/
│ ├── index.html # Single-page UI (Amazon-like layout)
│ ├── styles.css # Design system (dark header, cards, animations)
│ └── app.js # Frontend logic (auth, search, rendering)
│
├── scripts/
│ ├── generate_sample_data.py # Synthetic test dataset generator
│ ├── import_to_supabase.py # Batch import CSV/JSON → PostgreSQL
│ └── seed_mock_data.py # Mock users + purchases for cold-start bootstrap
│
├── data_adapter.py # ⭐ Auto column detection + schema normalization
├── content_model.py # TF-IDF content-based recommender
├── collaborative_model.py # SVD collaborative recommender + implicit feedback
├── hybrid_model.py # Weighted hybrid engine (Bayesian avg, popularity)
├── nlp_engine.py # VADER sentiment analysis pipeline
├── evaluation.py # Precision@K, Recall@K, NDCG@K benchmarks
├── db.py # Supabase client singleton (anon + admin)
├── app.py # Streamlit UI — upload CSV, build models, get recommendations
├── requirements.txt
├── .env.example
└── SETUP.md
Prerequisites: Python 3.10+ · Supabase account (free tier works)
# 1 — Clone & install
git clone https://github.com/leonagoel/hybrid-recommender.git
cd hybrid-recommender
pip install -r requirements.txt# 2 — Configure Supabase
cp .env.example .env
# Fill in from: Supabase Dashboard → Settings → APISUPABASE_URL=https://your-project-ref.supabase.co
SUPABASE_ANON_KEY=your-anon-key
SUPABASE_SERVICE_KEY=your-service-role-key# 3 — Run SQL migrations
# See SETUP.md for full schema → paste into Supabase SQL Editor# 4 — Start the server
if (-not $env:HOST) { $env:HOST = "0.0.0.0" }
if (-not $env:PORT) { $env:PORT = "8000" }
python -m uvicorn backend.main:app --host $env:HOST --port $env:PORTOpen http://localhost:8000, upload any CSV/JSON from datasets/, click Build Models, then start typing to search.
Check the active backend version:
curl "http://localhost:8000/api/version"Async recommendation tasks require Redis and a running Celery worker.
1 — Start Redis (Docker recommended):
docker run -d -p 6379:6379 redis:7-alpine2 — Add to .env:
REDIS_URL=redis://localhost:6379/03 — Start the Celery worker (separate terminal, from project root):
celery -A celery_app worker --loglevel=info4 — Use async recommendations:
# Dispatch — returns task_id instantly (202 Accepted)
curl -X POST "http://localhost:8000/api/recommend?item_title=YourItem&top_n=10"
# Poll for results using the returned task_id
curl "http://localhost:8000/api/task/<task_id>"Response flow:
POST /api/recommend → { "task_id": "abc123", "status": "PENDING" }
GET /api/task/abc123 → { "status": "SUCCESS", "result": { ... } }
streamlit run app.pyUpload any CSV file, click Build Models, then enter an item name or User ID to get recommendations directly in your browser — no database or server setup needed.
Docker Compose starts the full stack — backend API and static frontend — with a single command. No manual port juggling, no missing env vars.
- Docker Desktop (includes Compose)
1. Copy and fill in your environment file
cp .env.example .env
# Edit .env with your Supabase credentials2. Start the stack
docker-compose up --build--buildforces a fresh image build. Omit it on subsequent runs when code hasn't changed.
3. Access the app
| Service | URL |
|---|---|
| Frontend | http://localhost:3000 |
| Backend | http://localhost:8000 |
| API docs | http://localhost:8000/docs |
| Health | http://localhost:8000/health |
4. Stop the stack
docker-compose downAdd -v to also remove named volumes if you want a completely clean state.
| Problem | Fix |
|---|---|
Error: .env file not found |
Run cp .env.example .env and fill in credentials |
| Backend unhealthy / frontend won't start | Check docker-compose logs backend |
| Port 8000 already in use | Stop other services on 8000, or change "8000:8000" to "8001:8000" |
| Dataset not found at runtime | Make sure datasets/ folder exists in project root |
Retrieve frontend configuration (Supabase URL + anon key):
GET /api/configCheck if the API server is running:
GET /api/statusFull-text search across items (PostgreSQL FTS):
GET /api/search?q=...&limit=20Upload a CSV or JSON dataset:
POST /api/uploadBuild / rebuild the ML models from uploaded data:
POST /api/buildGet hybrid recommendations for a given item title:
GET /api/recommend/{title}Paginated list of all items:
GET /api/items?page=1&per_page=50List all distinct product categories:
GET /api/categoriesRead the current α / β / γ blending weights:
GET /api/weightsUpdate the α / β / γ blending weights:
PUT /api/weightsGet purchase history for a specific user:
GET /api/purchases/{user_id}Record a new purchase event:
POST /api/purchasesAll examples use http://localhost:8000 as the base URL.
Change the host/port if your server runs elsewhere (e.g., Docker uses http://localhost:8000 as well).
curl http://localhost:8000/api/status
---
## 07 — Evaluation
```python
# Run evaluation benchmarks
python evaluation.pyBenchmarks Content-Only, Collab-Only, Sentiment-Only, and Hybrid across:
Precision@K — fraction of relevant items in top-K
Recall@K — fraction of all relevant items retrieved
NDCG@K — ranking quality (discounted cumulative gain)
✓ No hardcoded credentials — config served via /api/config
✓ .env excluded from git via .gitignore
✓ CORS restricted to explicit configured origins; wildcard origins are rejected
✓ Row-Level Security (RLS) on all Supabase tables
✓ Input validation via Pydantic models
✓ Generic error messages — no stack trace leakage
✓ SQL injection safe (Supabase SDK parameterized queries)
How do I set up the project locally?
Clone the repository and install the required dependencies with pip install -r requirements.txt. After that, configure the environment variables if needed and start both the frontend and backend servers. Make sure your database or dataset files are also available before running the app.
What datasets does this project use?
This project uses datasets related to user interactions, ratings, and item metadata to generate recommendations. The exact dataset files are usually stored inside the datasets/ directory. You can check the project documentation for download links and formatting details.
How do the alpha/beta/gamma weights affect recommendations?
The alpha, beta, and gamma weights control how much influence different recommendation factors have in the final score. Changing these values can prioritize popularity, similarity, or personalized behavior differently. Experimenting with the weights helps fine-tune recommendation quality for your use case.
What is Bayesian rating and why is it used?
Bayesian rating is a method used to balance average ratings with the number of votes an item has received. It prevents items with very few ratings from unfairly appearing at the top of recommendations. This makes the ranking system more stable and reliable.
How do I run the tests?
Run the test command provided in the project, usually through a package manager like npm or a testing framework command. Make sure all dependencies are installed before running tests. The test results will help verify that the application works correctly after changes.
The backend shows "Backend offline" — what do I do?
First, check whether the backend server is running on the correct port. Verify that your environment variables and database connections are configured properly. If the issue continues, restart the backend server and review the console logs for errors.
Can I use my own dataset with this project?
Yes, you can use your own dataset as long as it follows the expected format used by the project. You may need to update file paths or preprocessing steps depending on your data structure. Testing with smaller datasets first is recommended to ensure compatibility.
pip install -r requirements.txtpython -m uvicorn backend.main:app --port 8001import nltk
nltk.download('vader_lexicon')Check your .env file — no extra spaces, no quotes, correct project credentials:
SUPABASE_URL=your_url
SUPABASE_ANON_KEY=your_key
SUPABASE_SERVICE_KEY=your_service_key# Backend
python -m uvicorn backend.main:app --host 0.0.0.0 --port 8000
# Visit: http://localhost:8000/api/status → { "status": "ok" }
# Streamlit
streamlit run app.py
# Browser opens automatically with CSV upload interfaceVerify that the backend is running by visiting:
curl http://localhost:8000/api/status
Example output when backend is running:
```text
✅ Backend is running
⏱ Response time: 42 ms
📦 Response: {'status': 'ok'}Example output when backend is offline:
❌ Could not connect to backend server
Ensure the following variables are configured in your .env file:
- SUPABASE_URL
- SUPABASE_ANON_KEY
- SUPABASE_SERVICE_KEY
Example output:
❌ Missing environment variables:
- SUPABASE_URL
- SUPABASE_ANON_KEY
- SUPABASE_SERVICE_KEY
Or:
✅ Environment setup looks good
git remote add upstream https://github.com/leonagoel/hybrid-recommender.git
git fetch upstream
git merge upstream/main- Open conflicted files
- Remove conflict markers (
<<<<<<<,=======,>>>>>>>) - Keep correct code, save, then commit
- Project runs successfully
- README formatting checked
- No unnecessary files added
- Branch name follows guidelines
- Commit message follows convention
- PR linked to issue
MIT — see LICENSE
This usually happens when the virtual environment is not activated or dependencies are not installed.
Create a virtual environment:
python -m venv venvActivate it:
venv\Scripts\activatesource venv/bin/activateInstall dependencies:
pip install -r requirements.txtIf pip install -r requirements.txt fails because of version conflicts:
Upgrade pip first:
python -m pip install --upgrade pipThen reinstall dependencies:
pip install -r requirements.txtIf issues persist, recreate the virtual environment.
deactivate
rmdir /s /q venv
python -m venv venvdeactivate
rm -rf venv
python -m venv venvRun these commands to confirm everything is working correctly.
Check Python version:
python --versionCheck installed packages:
pip listRun the test suite:
pytestIf tests run successfully, the environment is ready for development.
Built by Leona Goel
B.Tech CSE · Vellore Institute of Technology
National Finalist · Smart India Hackathon 2025 · Top 8% of 950+ Teams
Thanks to all the amazing people who contribute to this project ❤️
See CONTRIBUTING.md to get started — all skill levels welcome!
| Step | Action |
|---|---|
| 1️⃣ | Fork the repo |
| 2️⃣ | Pick a good first issue |
| 3️⃣ | Submit a Pull Request |
This project now supports semantic item relationships using TransE-style knowledge graph embeddings.
Features:
- Semantic similarity learning
- Graph-based recommendation enrichment
- Hybrid recommendation integration
- Category/author relationship modeling
Run:
python scripts/generate_kg_embeddings.py
---


