GitHub - leonagoel/hybrid-recommender: A hybrid recommender system using content-based and collaborative filtering with a data adapter for dynamic datasets.

╔══════════════════════════════════════════════════════════════════╗
║                                                                  ║
║    H Y B R I D R E C                                             ║
║    ─────────────────────────────────────────────────────────     ║
║    Hybrid Recommender System · Leona Goel                        ║
║                                                                  ║
╚══════════════════════════════════════════════════════════════════╝

📌 Table of Contents

🚀 Features
⚙️ Installation
💻 Usage
📂 Project Structure
🤝 Contributing
📄 License

Important

🟢 This is the active GSSoC project repo — open all issues and PRs here only.

A production-ready recommender fusing Content-Based Filtering (TF-IDF), Collaborative Filtering (SVD), and NLP Sentiment Analysis (VADER) with a tunable weighted scoring engine — backed by Supabase PostgreSQL, served via FastAPI, and built to be dataset-agnostic by design.

25,000+ products  ·  Sub-50ms search  ·  3 ML models fused  ·  ~60% faster integration

01 — Architecture

The core insight: blend three independent signals, each capturing something the others miss.

User Reviews (text)           ──→  NLP Engine (VADER Sentiment)    ──┐
Item Metadata (title/desc)    ──→  Content Vectorization (TF-IDF)  ──┼──→  Weighted Hybrid  ──→  Ranked Results
User Purchases (clicks/buys)  ──→  Matrix Factorization (SVD)      ──┘         Engine

     Hybrid Score  =  α · content_score        [TF-IDF cosine similarity]
                    + β · collab_score          [Truncated SVD latent space]
                    + γ · sentiment_score       [VADER compound polarity]

     // α, β, γ are live-tunable via API or UI sliders

α — Content Model · TF-IDF + Cosine Similarity

Item metadata (title + description + category) vectorized with TF-IDF (unigrams + bigrams, max 5,000 features). On-the-fly cosine similarity yields content_score ∈ [0, 1]. Fast, interpretable, and requires zero user history — ideal for cold-start.

β — Collaborative Model · Truncated SVD

User-item interaction matrix built from purchases + implicit feedback (views, clicks). SVD reduces to 50 latent factors; cosine similarity in latent space yields collab_score. Adaptive rank automatically reduces SVD components for sparse matrices.

γ — Sentiment Model · NLTK VADER

Review text analyzed for compound polarity ∈ [-1, 1]. Per-item aggregation → Min-Max normalization → sentiment_score ∈ [0, 1]. Surfaces genuinely loved products, not just popular ones.

❄ Cold-Start Handling

Bayesian average rating — prevents 1-review, 5-star bias
Popularity-based fallback — ranks new items by review count and category similarity
Mock user seeding — synthetic purchase history to bootstrap collaborative filtering

02 — Features

Feature	Detail
`PostgreSQL FTS`	GIN-indexed full-text search — sub-50ms on 250k+ rows
`Supabase Auth`	Guest (anonymous) and email/password, Row-Level Security on all tables
`Tunable Weights`	Live α/β/γ sliders to adjust recommendation blend in real time
`Dataset-Agnostic`	Fuzzy column detection (`product_name` → `title`) cuts integration time by ~60%
`Cold-Start Resilient`	Bayesian avg rating + popularity fallback for new users and items
`Type-to-Search`	Global keyboard capture — start typing anywhere to search instantly
`Responsive UI`	Amazon-inspired dark header, 4→3→2→1 column card grid across breakpoints
`Secure by Default`	Pydantic validation, parameterized queries, CORS-restricted, no stack-trace leakage
`Streamlit UI`	Local CSV upload → build models → recommendations, no Supabase or server required

03 — Tech Stack

┌─────────────────┬────────────────────────────────────────────────┐
│ Layer           │ Technology                                      │
├─────────────────┼────────────────────────────────────────────────┤
│ Backend         │ Python 3.10+, FastAPI, Uvicorn                 │
│ Database        │ Supabase (PostgreSQL), Row-Level Security       │
│ Search          │ PostgreSQL FTS (GIN indexes, ts_rank)          │
│ Auth            │ Supabase Auth (anonymous + email/password)      │
│ ML — Content    │ scikit-learn: TF-IDF Vectorizer, Cosine Sim    │
│ ML — Collab     │ scikit-learn: TruncatedSVD, SciPy sparse       │
│ NLP             │ NLTK VADER SentimentIntensityAnalyzer           │
│ Data            │ Pandas, NumPy                                   │
│ Frontend        │ HTML5, CSS3, Vanilla JS, Supabase JS v2        │
└─────────────────┴────────────────────────────────────────────────┘

04 — Project Structure

hybrid-recommender/
│
├── backend/
│   └── main.py                  # FastAPI server — search, upload, build, recommend
│
├── frontend/
│   ├── index.html               # Single-page UI (Amazon-like layout)
│   ├── styles.css               # Design system (dark header, cards, animations)
│   └── app.js                   # Frontend logic (auth, search, rendering)
│
├── scripts/
│   ├── generate_sample_data.py  # Synthetic test dataset generator
│   ├── import_to_supabase.py    # Batch import CSV/JSON → PostgreSQL
│   └── seed_mock_data.py        # Mock users + purchases for cold-start bootstrap
│
├── data_adapter.py              # ⭐ Auto column detection + schema normalization
├── content_model.py             # TF-IDF content-based recommender
├── collaborative_model.py       # SVD collaborative recommender + implicit feedback
├── hybrid_model.py              # Weighted hybrid engine (Bayesian avg, popularity)
├── nlp_engine.py                # VADER sentiment analysis pipeline
├── evaluation.py                # Precision@K, Recall@K, NDCG@K benchmarks
├── db.py                        # Supabase client singleton (anon + admin)
├── app.py                       # Streamlit UI — upload CSV, build models, get recommendations
├── requirements.txt
├── .env.example
└── SETUP.md

05 — Quick Start

Prerequisites: Python 3.10+ · Supabase account (free tier works)

# 1 — Clone & install
git clone https://github.com/leonagoel/hybrid-recommender.git
cd hybrid-recommender
pip install -r requirements.txt

# 2 — Configure Supabase
cp .env.example .env
# Fill in from: Supabase Dashboard → Settings → API

SUPABASE_URL=https://your-project-ref.supabase.co
SUPABASE_ANON_KEY=your-anon-key
SUPABASE_SERVICE_KEY=your-service-role-key

# 3 — Run SQL migrations
# See SETUP.md for full schema → paste into Supabase SQL Editor

# 4 — Start the server
if (-not $env:HOST) { $env:HOST = "0.0.0.0" }
if (-not $env:PORT) { $env:PORT = "8000" }

python -m uvicorn backend.main:app --host $env:HOST --port $env:PORT

Open http://localhost:8000, upload any CSV/JSON from datasets/, click Build Models, then start typing to search.

Check the active backend version:

curl "http://localhost:8000/api/version"

Async Recommendations — Celery Worker Setup

Async recommendation tasks require Redis and a running Celery worker.

1 — Start Redis (Docker recommended):

docker run -d -p 6379:6379 redis:7-alpine

2 — Add to .env:

REDIS_URL=redis://localhost:6379/0

3 — Start the Celery worker (separate terminal, from project root):

celery -A celery_app worker --loglevel=info

4 — Use async recommendations:

# Dispatch — returns task_id instantly (202 Accepted)
curl -X POST "http://localhost:8000/api/recommend?item_title=YourItem&top_n=10"

# Poll for results using the returned task_id
curl "http://localhost:8000/api/task/<task_id>"

Response flow:

POST /api/recommend  →  { "task_id": "abc123", "status": "PENDING" }
GET  /api/task/abc123  →  { "status": "SUCCESS", "result": { ... } }

Alternative — Streamlit UI (no Supabase required)

streamlit run app.py

Upload any CSV file, click Build Models, then enter an item name or User ID to get recommendations directly in your browser — no database or server setup needed.

Run with Docker Compose (Recommended for Contributors)

Docker Compose starts the full stack — backend API and static frontend — with a single command. No manual port juggling, no missing env vars.

Prerequisites

Docker Desktop (includes Compose)

Steps

1. Copy and fill in your environment file

cp .env.example .env
# Edit .env with your Supabase credentials

2. Start the stack

docker-compose up --build

--build forces a fresh image build. Omit it on subsequent runs when code hasn't changed.

3. Access the app

Service	URL
Frontend	http://localhost:3000
Backend	http://localhost:8000
API docs	http://localhost:8000/docs
Health	http://localhost:8000/health

4. Stop the stack

docker-compose down

Add -v to also remove named volumes if you want a completely clean state.

Troubleshooting

Problem	Fix
`Error: .env file not found`	Run `cp .env.example .env` and fill in credentials
Backend unhealthy / frontend won't start	Check `docker-compose logs backend`
Port 8000 already in use	Stop other services on 8000, or change `"8000:8000"` to `"8001:8000"`
Dataset not found at runtime	Make sure `datasets/` folder exists in project root

06 — API Reference

Retrieve frontend configuration (Supabase URL + anon key):

GET /api/config

Check if the API server is running:

GET /api/status

Full-text search across items (PostgreSQL FTS):

GET /api/search?q=...&limit=20

Upload a CSV or JSON dataset:

POST /api/upload

Build / rebuild the ML models from uploaded data:

POST /api/build

Get hybrid recommendations for a given item title:

GET /api/recommend/{title}

Paginated list of all items:

GET /api/items?page=1&per_page=50

List all distinct product categories:

GET /api/categories

Read the current α / β / γ blending weights:

GET /api/weights

Update the α / β / γ blending weights:

PUT /api/weights

Get purchase history for a specific user:

GET /api/purchases/{user_id}

Record a new purchase event:

POST /api/purchases

API Examples (curl)

All examples use http://localhost:8000 as the base URL.
Change the host/port if your server runs elsewhere (e.g., Docker uses http://localhost:8000 as well).

Get server status

curl http://localhost:8000/api/status
---

## 07 — Evaluation

```python
# Run evaluation benchmarks
python evaluation.py

Benchmarks Content-Only, Collab-Only, Sentiment-Only, and Hybrid across:

Precision@K  —  fraction of relevant items in top-K
Recall@K     —  fraction of all relevant items retrieved
NDCG@K       —  ranking quality (discounted cumulative gain)

07 — Security

✓  No hardcoded credentials — config served via /api/config
✓  .env excluded from git via .gitignore
✓  CORS restricted to explicit configured origins; wildcard origins are rejected
✓  Row-Level Security (RLS) on all Supabase tables
✓  Input validation via Pydantic models
✓  Generic error messages — no stack trace leakage
✓  SQL injection safe (Supabase SDK parameterized queries)

08 — FAQ

How do I set up the project locally?

Clone the repository and install the required dependencies with pip install -r requirements.txt. After that, configure the environment variables if needed and start both the frontend and backend servers. Make sure your database or dataset files are also available before running the app.

What datasets does this project use?

This project uses datasets related to user interactions, ratings, and item metadata to generate recommendations. The exact dataset files are usually stored inside the datasets/ directory. You can check the project documentation for download links and formatting details.

How do the alpha/beta/gamma weights affect recommendations?

The alpha, beta, and gamma weights control how much influence different recommendation factors have in the final score. Changing these values can prioritize popularity, similarity, or personalized behavior differently. Experimenting with the weights helps fine-tune recommendation quality for your use case.

What is Bayesian rating and why is it used?

Bayesian rating is a method used to balance average ratings with the number of votes an item has received. It prevents items with very few ratings from unfairly appearing at the top of recommendations. This makes the ranking system more stable and reliable.

How do I run the tests?

Run the test command provided in the project, usually through a package manager like npm or a testing framework command. Make sure all dependencies are installed before running tests. The test results will help verify that the application works correctly after changes.

The backend shows "Backend offline" — what do I do?

First, check whether the backend server is running on the correct port. Verify that your environment variables and database connections are configured properly. If the issue continues, restart the backend server and review the console logs for errors.

Can I use my own dataset with this project?

Yes, you can use your own dataset as long as it follows the expected format used by the project. You may need to update file paths or preprocessing steps depending on your data structure. Testing with smaller datasets first is recommended to ensure compatibility.

09 — Screenshots

Home Page

Recommendation Results

API Documentation

10 — Troubleshooting

ModuleNotFoundError

pip install -r requirements.txt

Port Already In Use

python -m uvicorn backend.main:app --port 8001

NLTK VADER Download Error

import nltk
nltk.download('vader_lexicon')

Supabase Connection Error

Check your .env file — no extra spaces, no quotes, correct project credentials:

SUPABASE_URL=your_url
SUPABASE_ANON_KEY=your_key
SUPABASE_SERVICE_KEY=your_service_key

11 — Setup Verification

# Backend
python -m uvicorn backend.main:app --host 0.0.0.0 --port 8000
# Visit: http://localhost:8000/api/status → { "status": "ok" }

# Streamlit
streamlit run app.py
# Browser opens automatically with CSV upload interface

Backend Health Check

Verify that the backend is running by visiting:

curl http://localhost:8000/api/status

Example output when backend is running:
```text
✅ Backend is running
⏱ Response time: 42 ms
📦 Response: {'status': 'ok'}

Example output when backend is offline:

❌ Could not connect to backend server

Environment Validation

Ensure the following variables are configured in your .env file:

SUPABASE_URL
SUPABASE_ANON_KEY
SUPABASE_SERVICE_KEY

Example output:

❌ Missing environment variables:
 - SUPABASE_URL
 - SUPABASE_ANON_KEY
 - SUPABASE_SERVICE_KEY

Or:

✅ Environment setup looks good

12 — Beginner Contributor Tips

Sync Your Fork Before Starting

git remote add upstream https://github.com/leonagoel/hybrid-recommender.git
git fetch upstream
git merge upstream/main

Resolve Merge Conflicts

Open conflicted files
Remove conflict markers (<<<<<<<, =======, >>>>>>>)
Keep correct code, save, then commit

Pull Request Checklist

License

MIT — see LICENSE

🛠️ Troubleshooting Local Setup

1. ModuleNotFoundError while running the project

This usually happens when the virtual environment is not activated or dependencies are not installed.

Create a virtual environment:

python -m venv venv

Activate it:

Windows

venv\Scripts\activate

macOS/Linux

source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

2. Dependency conflicts during installation

If pip install -r requirements.txt fails because of version conflicts:

Upgrade pip first:

python -m pip install --upgrade pip

Then reinstall dependencies:

pip install -r requirements.txt

If issues persist, recreate the virtual environment.

Windows

deactivate
rmdir /s /q venv
python -m venv venv

macOS/Linux

deactivate
rm -rf venv
python -m venv venv

3. Verify local environment setup

Run these commands to confirm everything is working correctly.

Check Python version:

python --version

Check installed packages:

pip list

Run the test suite:

pytest

If tests run successfully, the environment is ready for development.

Documentation

CHANGELOG

Built by Leona Goel
B.Tech CSE · Vellore Institute of Technology
National Finalist · Smart India Hackathon 2025 · Top 8% of 950+ Teams

👥 Contributors

Thanks to all the amazing people who contribute to this project ❤️

Contributor Grid

Want to contribute?

See CONTRIBUTING.md to get started — all skill levels welcome!

Step	Action
1️⃣	Fork the repo
2️⃣	Pick a good first issue
3️⃣	Submit a Pull Request

Knowledge Graph Embeddings

This project now supports semantic item relationships using TransE-style knowledge graph embeddings.

Features:

Semantic similarity learning
Graph-based recommendation enrichment
Hybrid recommendation integration
Category/author relationship modeling

Run:

python scripts/generate_kg_embeddings.py

---

Name		Name	Last commit message	Last commit date
Latest commit History 792 Commits
.github		.github
assets		assets
backend		backend
config		config
datasets		datasets
frontend		frontend
scripts		scripts
src		src
supabase/migrations		supabase/migrations
tests		tests
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitattributes		.gitattributes
.gitignore		.gitignore
.nvmrc		.nvmrc
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
LLM_IMPLEMENTATION.md		LLM_IMPLEMENTATION.md
README.md		README.md
SECURITY.md		SECURITY.md
SETUP.md		SETUP.md
TESTING.md		TESTING.md
TESTING_QUICK_REF.md		TESTING_QUICK_REF.md
TODO.md		TODO.md
bug1.md		bug1.md
bug2.md		bug2.md
bug3.md		bug3.md
celery_app.py		celery_app.py
checks.json		checks.json
create_issues.ps1		create_issues.ps1
data_preprocessing.py		data_preprocessing.py
demo.ipynb		demo.ipynb
docker-compose.yml		docker-compose.yml
git		git
issues_details.json		issues_details.json
old_main.py		old_main.py
old_search_tail.txt		old_search_tail.txt
package-lock.json		package-lock.json
placeholder_1614.txt		placeholder_1614.txt
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
scan.py		scan.py
search_func.txt		search_func.txt
tasks.py		tasks.py

Folders and files

Latest commit

History

Repository files navigation

📌 Table of Contents

Table of Contents

01 — Architecture

02 — Features

03 — Tech Stack

04 — Project Structure

05 — Quick Start

Async Recommendations — Celery Worker Setup

Alternative — Streamlit UI (no Supabase required)

Run with Docker Compose (Recommended for Contributors)

Prerequisites

Steps

Troubleshooting

06 — API Reference

API Examples (curl)

Get server status

07 — Security

08 — FAQ

09 — Screenshots

Home Page

Recommendation Results

API Documentation

10 — Troubleshooting

ModuleNotFoundError

Port Already In Use

NLTK VADER Download Error

Supabase Connection Error

11 — Setup Verification

Backend Health Check

Environment Validation

12 — Beginner Contributor Tips

Sync Your Fork Before Starting

Resolve Merge Conflicts

Pull Request Checklist

License

🛠️ Troubleshooting Local Setup

1. ModuleNotFoundError while running the project

Windows

macOS/Linux

2. Dependency conflicts during installation

Windows

macOS/Linux

3. Verify local environment setup

Documentation

👥 Contributors

Contributor Grid

Want to contribute?

Knowledge Graph Embeddings

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages