Skip to content

leonagoel/hybrid-recommender

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

792 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
╔══════════════════════════════════════════════════════════════════╗
║                                                                  ║
║    H Y B R I D R E C                                             ║
║    ─────────────────────────────────────────────────────────     ║
║    Hybrid Recommender System · Leona Goel                        ║
║                                                                  ║
╚══════════════════════════════════════════════════════════════════╝

📌 Table of Contents


Coverage Live Demo GitHub Discussions

CI Docker Compose License Python Version Contributors PRs Welcome GSSoC 2026 License: MIT Python FastAPI Supabase scikit-learn NLTK


Important

🟢 This is the active GSSoC project repo — open all issues and PRs here only.


A production-ready recommender fusing Content-Based Filtering (TF-IDF), Collaborative Filtering (SVD), and NLP Sentiment Analysis (VADER) with a tunable weighted scoring engine — backed by Supabase PostgreSQL, served via FastAPI, and built to be dataset-agnostic by design.

25,000+ products  ·  Sub-50ms search  ·  3 ML models fused  ·  ~60% faster integration

Table of Contents


01 — Architecture

The core insight: blend three independent signals, each capturing something the others miss.

User Reviews (text)           ──→  NLP Engine (VADER Sentiment)    ──┐
Item Metadata (title/desc)    ──→  Content Vectorization (TF-IDF)  ──┼──→  Weighted Hybrid  ──→  Ranked Results
User Purchases (clicks/buys)  ──→  Matrix Factorization (SVD)      ──┘         Engine

     Hybrid Score  =  α · content_score        [TF-IDF cosine similarity]
                    + β · collab_score          [Truncated SVD latent space]
                    + γ · sentiment_score       [VADER compound polarity]

     // α, β, γ are live-tunable via API or UI sliders
α — Content Model  ·  TF-IDF + Cosine Similarity

Item metadata (title + description + category) vectorized with TF-IDF (unigrams + bigrams, max 5,000 features). On-the-fly cosine similarity yields content_score ∈ [0, 1]. Fast, interpretable, and requires zero user history — ideal for cold-start.

β — Collaborative Model  ·  Truncated SVD

User-item interaction matrix built from purchases + implicit feedback (views, clicks). SVD reduces to 50 latent factors; cosine similarity in latent space yields collab_score. Adaptive rank automatically reduces SVD components for sparse matrices.

γ — Sentiment Model  ·  NLTK VADER

Review text analyzed for compound polarity ∈ [-1, 1]. Per-item aggregation → Min-Max normalization → sentiment_score ∈ [0, 1]. Surfaces genuinely loved products, not just popular ones.

❄ Cold-Start Handling
  • Bayesian average rating — prevents 1-review, 5-star bias
  • Popularity-based fallback — ranks new items by review count and category similarity
  • Mock user seeding — synthetic purchase history to bootstrap collaborative filtering

02 — Features

Feature Detail
PostgreSQL FTS GIN-indexed full-text search — sub-50ms on 250k+ rows
Supabase Auth Guest (anonymous) and email/password, Row-Level Security on all tables
Tunable Weights Live α/β/γ sliders to adjust recommendation blend in real time
Dataset-Agnostic Fuzzy column detection (product_nametitle) cuts integration time by ~60%
Cold-Start Resilient Bayesian avg rating + popularity fallback for new users and items
Type-to-Search Global keyboard capture — start typing anywhere to search instantly
Responsive UI Amazon-inspired dark header, 4→3→2→1 column card grid across breakpoints
Secure by Default Pydantic validation, parameterized queries, CORS-restricted, no stack-trace leakage
Streamlit UI Local CSV upload → build models → recommendations, no Supabase or server required

03 — Tech Stack

┌─────────────────┬────────────────────────────────────────────────┐
│ Layer           │ Technology                                      │
├─────────────────┼────────────────────────────────────────────────┤
│ Backend         │ Python 3.10+, FastAPI, Uvicorn                 │
│ Database        │ Supabase (PostgreSQL), Row-Level Security       │
│ Search          │ PostgreSQL FTS (GIN indexes, ts_rank)          │
│ Auth            │ Supabase Auth (anonymous + email/password)      │
│ ML — Content    │ scikit-learn: TF-IDF Vectorizer, Cosine Sim    │
│ ML — Collab     │ scikit-learn: TruncatedSVD, SciPy sparse       │
│ NLP             │ NLTK VADER SentimentIntensityAnalyzer           │
│ Data            │ Pandas, NumPy                                   │
│ Frontend        │ HTML5, CSS3, Vanilla JS, Supabase JS v2        │
└─────────────────┴────────────────────────────────────────────────┘

04 — Project Structure

hybrid-recommender/
│
├── backend/
│   └── main.py                  # FastAPI server — search, upload, build, recommend
│
├── frontend/
│   ├── index.html               # Single-page UI (Amazon-like layout)
│   ├── styles.css               # Design system (dark header, cards, animations)
│   └── app.js                   # Frontend logic (auth, search, rendering)
│
├── scripts/
│   ├── generate_sample_data.py  # Synthetic test dataset generator
│   ├── import_to_supabase.py    # Batch import CSV/JSON → PostgreSQL
│   └── seed_mock_data.py        # Mock users + purchases for cold-start bootstrap
│
├── data_adapter.py              # ⭐ Auto column detection + schema normalization
├── content_model.py             # TF-IDF content-based recommender
├── collaborative_model.py       # SVD collaborative recommender + implicit feedback
├── hybrid_model.py              # Weighted hybrid engine (Bayesian avg, popularity)
├── nlp_engine.py                # VADER sentiment analysis pipeline
├── evaluation.py                # Precision@K, Recall@K, NDCG@K benchmarks
├── db.py                        # Supabase client singleton (anon + admin)
├── app.py                       # Streamlit UI — upload CSV, build models, get recommendations
├── requirements.txt
├── .env.example
└── SETUP.md

05 — Quick Start

Prerequisites: Python 3.10+ · Supabase account (free tier works)

# 1 — Clone & install
git clone https://github.com/leonagoel/hybrid-recommender.git
cd hybrid-recommender
pip install -r requirements.txt
# 2 — Configure Supabase
cp .env.example .env
# Fill in from: Supabase Dashboard → Settings → API
SUPABASE_URL=https://your-project-ref.supabase.co
SUPABASE_ANON_KEY=your-anon-key
SUPABASE_SERVICE_KEY=your-service-role-key
# 3 — Run SQL migrations
# See SETUP.md for full schema → paste into Supabase SQL Editor
# 4 — Start the server
if (-not $env:HOST) { $env:HOST = "0.0.0.0" }
if (-not $env:PORT) { $env:PORT = "8000" }

python -m uvicorn backend.main:app --host $env:HOST --port $env:PORT

Open http://localhost:8000, upload any CSV/JSON from datasets/, click Build Models, then start typing to search.

Check the active backend version:

curl "http://localhost:8000/api/version"

Async Recommendations — Celery Worker Setup

Async recommendation tasks require Redis and a running Celery worker.

1 — Start Redis (Docker recommended):

docker run -d -p 6379:6379 redis:7-alpine

2 — Add to .env:

REDIS_URL=redis://localhost:6379/0

3 — Start the Celery worker (separate terminal, from project root):

celery -A celery_app worker --loglevel=info

4 — Use async recommendations:

# Dispatch — returns task_id instantly (202 Accepted)
curl -X POST "http://localhost:8000/api/recommend?item_title=YourItem&top_n=10"

# Poll for results using the returned task_id
curl "http://localhost:8000/api/task/<task_id>"

Response flow:

POST /api/recommend  →  { "task_id": "abc123", "status": "PENDING" }
GET  /api/task/abc123  →  { "status": "SUCCESS", "result": { ... } }

Alternative — Streamlit UI (no Supabase required)

streamlit run app.py

Upload any CSV file, click Build Models, then enter an item name or User ID to get recommendations directly in your browser — no database or server setup needed.


Run with Docker Compose (Recommended for Contributors)

Docker Compose starts the full stack — backend API and static frontend — with a single command. No manual port juggling, no missing env vars.

Prerequisites

Steps

1. Copy and fill in your environment file

cp .env.example .env
# Edit .env with your Supabase credentials

2. Start the stack

docker-compose up --build
  • --build forces a fresh image build. Omit it on subsequent runs when code hasn't changed.

3. Access the app

Service URL
Frontend http://localhost:3000
Backend http://localhost:8000
API docs http://localhost:8000/docs
Health http://localhost:8000/health

4. Stop the stack

docker-compose down

Add -v to also remove named volumes if you want a completely clean state.

Troubleshooting

Problem Fix
Error: .env file not found Run cp .env.example .env and fill in credentials
Backend unhealthy / frontend won't start Check docker-compose logs backend
Port 8000 already in use Stop other services on 8000, or change "8000:8000" to "8001:8000"
Dataset not found at runtime Make sure datasets/ folder exists in project root

06 — API Reference

Retrieve frontend configuration (Supabase URL + anon key):

GET /api/config

Check if the API server is running:

GET /api/status

Full-text search across items (PostgreSQL FTS):

GET /api/search?q=...&limit=20

Upload a CSV or JSON dataset:

POST /api/upload

Build / rebuild the ML models from uploaded data:

POST /api/build

Get hybrid recommendations for a given item title:

GET /api/recommend/{title}

Paginated list of all items:

GET /api/items?page=1&per_page=50

List all distinct product categories:

GET /api/categories

Read the current α / β / γ blending weights:

GET /api/weights

Update the α / β / γ blending weights:

PUT /api/weights

Get purchase history for a specific user:

GET /api/purchases/{user_id}

Record a new purchase event:

POST /api/purchases

API Examples (curl)

All examples use http://localhost:8000 as the base URL.
Change the host/port if your server runs elsewhere (e.g., Docker uses http://localhost:8000 as well).

Get server status

curl http://localhost:8000/api/status
---

## 07 — Evaluation

```python
# Run evaluation benchmarks
python evaluation.py

Benchmarks Content-Only, Collab-Only, Sentiment-Only, and Hybrid across:

Precision@K  —  fraction of relevant items in top-K
Recall@K     —  fraction of all relevant items retrieved
NDCG@K       —  ranking quality (discounted cumulative gain)

07 — Security

✓  No hardcoded credentials — config served via /api/config
✓  .env excluded from git via .gitignore
✓  CORS restricted to explicit configured origins; wildcard origins are rejected
✓  Row-Level Security (RLS) on all Supabase tables
✓  Input validation via Pydantic models
✓  Generic error messages — no stack trace leakage
✓  SQL injection safe (Supabase SDK parameterized queries)

08 — FAQ

How do I set up the project locally?

Clone the repository and install the required dependencies with pip install -r requirements.txt. After that, configure the environment variables if needed and start both the frontend and backend servers. Make sure your database or dataset files are also available before running the app.

What datasets does this project use?

This project uses datasets related to user interactions, ratings, and item metadata to generate recommendations. The exact dataset files are usually stored inside the datasets/ directory. You can check the project documentation for download links and formatting details.

How do the alpha/beta/gamma weights affect recommendations?

The alpha, beta, and gamma weights control how much influence different recommendation factors have in the final score. Changing these values can prioritize popularity, similarity, or personalized behavior differently. Experimenting with the weights helps fine-tune recommendation quality for your use case.

What is Bayesian rating and why is it used?

Bayesian rating is a method used to balance average ratings with the number of votes an item has received. It prevents items with very few ratings from unfairly appearing at the top of recommendations. This makes the ranking system more stable and reliable.

How do I run the tests?

Run the test command provided in the project, usually through a package manager like npm or a testing framework command. Make sure all dependencies are installed before running tests. The test results will help verify that the application works correctly after changes.

The backend shows "Backend offline" — what do I do?

First, check whether the backend server is running on the correct port. Verify that your environment variables and database connections are configured properly. If the issue continues, restart the backend server and review the console logs for errors.

Can I use my own dataset with this project?

Yes, you can use your own dataset as long as it follows the expected format used by the project. You may need to update file paths or preprocessing steps depending on your data structure. Testing with smaller datasets first is recommended to ensure compatibility.


09 — Screenshots

Home Page

Home Page

Recommendation Results

Recommendations

API Documentation

Swagger Docs


10 — Troubleshooting

ModuleNotFoundError

pip install -r requirements.txt

Port Already In Use

python -m uvicorn backend.main:app --port 8001

NLTK VADER Download Error

import nltk
nltk.download('vader_lexicon')

Supabase Connection Error

Check your .env file — no extra spaces, no quotes, correct project credentials:

SUPABASE_URL=your_url
SUPABASE_ANON_KEY=your_key
SUPABASE_SERVICE_KEY=your_service_key

11 — Setup Verification

# Backend
python -m uvicorn backend.main:app --host 0.0.0.0 --port 8000
# Visit: http://localhost:8000/api/status → { "status": "ok" }

# Streamlit
streamlit run app.py
# Browser opens automatically with CSV upload interface

Backend Health Check

Verify that the backend is running by visiting:

curl http://localhost:8000/api/status

Example output when backend is running:
```text
✅ Backend is running
⏱ Response time: 42 ms
📦 Response: {'status': 'ok'}

Example output when backend is offline:

❌ Could not connect to backend server

Environment Validation

Ensure the following variables are configured in your .env file:

  • SUPABASE_URL
  • SUPABASE_ANON_KEY
  • SUPABASE_SERVICE_KEY

Example output:

❌ Missing environment variables:
 - SUPABASE_URL
 - SUPABASE_ANON_KEY
 - SUPABASE_SERVICE_KEY

Or:

✅ Environment setup looks good

12 — Beginner Contributor Tips

Sync Your Fork Before Starting

git remote add upstream https://github.com/leonagoel/hybrid-recommender.git
git fetch upstream
git merge upstream/main

Resolve Merge Conflicts

  1. Open conflicted files
  2. Remove conflict markers (<<<<<<<, =======, >>>>>>>)
  3. Keep correct code, save, then commit

Pull Request Checklist

  • Project runs successfully
  • README formatting checked
  • No unnecessary files added
  • Branch name follows guidelines
  • Commit message follows convention
  • PR linked to issue

License

MIT — see LICENSE


🛠️ Troubleshooting Local Setup

1. ModuleNotFoundError while running the project

This usually happens when the virtual environment is not activated or dependencies are not installed.

Create a virtual environment:

python -m venv venv

Activate it:

Windows

venv\Scripts\activate

macOS/Linux

source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

2. Dependency conflicts during installation

If pip install -r requirements.txt fails because of version conflicts:

Upgrade pip first:

python -m pip install --upgrade pip

Then reinstall dependencies:

pip install -r requirements.txt

If issues persist, recreate the virtual environment.

Windows

deactivate
rmdir /s /q venv
python -m venv venv

macOS/Linux

deactivate
rm -rf venv
python -m venv venv

3. Verify local environment setup

Run these commands to confirm everything is working correctly.

Check Python version:

python --version

Check installed packages:

pip list

Run the test suite:

pytest

If tests run successfully, the environment is ready for development.

Documentation

Built by Leona Goel
B.Tech CSE · Vellore Institute of Technology
National Finalist · Smart India Hackathon 2025 · Top 8% of 950+ Teams

LinkedIn GitHub Email


👥 Contributors

Thanks to all the amazing people who contribute to this project ❤️

Good First Issues Open Issues PRs Welcome

Contributor Grid

Contributors

Want to contribute?

See CONTRIBUTING.md to get started — all skill levels welcome!

Step Action
1️⃣ Fork the repo
2️⃣ Pick a good first issue
3️⃣ Submit a Pull Request

Knowledge Graph Embeddings

This project now supports semantic item relationships using TransE-style knowledge graph embeddings.

Features:

  • Semantic similarity learning
  • Graph-based recommendation enrichment
  • Hybrid recommendation integration
  • Category/author relationship modeling

Run:

python scripts/generate_kg_embeddings.py

---

About

A hybrid recommender system using content-based and collaborative filtering with a data adapter for dynamic datasets.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 79.6%
  • JavaScript 11.4%
  • CSS 3.8%
  • Jupyter Notebook 2.6%
  • HTML 2.1%
  • PowerShell 0.3%
  • Other 0.2%