An AI-powered search engine that understands what you mean, not just what you type. Built on a four-layer architecture spanning metadata enrichment, semantic vectors, agentic query planning, and a knowledge graph.
Live demo: movieagent.streamlit.app
Traditional search matches keywords. CINESEEK matches meaning.
You can search for something like "a dark film where nobody can be trusted" or "something lighthearted for a Friday night" and get genuinely relevant results — even if those exact words don't appear anywhere in the dataset.
The raw TMDB dataset only provides basic fields like title, genre, year, and rating. An LLM reads every movie's plot description and infers structured attributes that were never explicitly in the data — tone (dark, lighthearted, tense), themes (identity, revenge, survival), and pacing (slow-burn, moderate, fast-paced). This enrichment is what makes everything downstream possible.
Every enriched movie record is converted into a 384-dimensional meaning vector using sentence-transformers and stored in ChromaDB. Similar meanings cluster together in vector space, so concept-based queries find the right results even when the exact words don't match.
When a query arrives it doesn't go straight to the database. A Groq LLaMA-3 model acts as a query planner — deconstructing your intent into structured filters (year, genre, rating), a rewritten semantic query, and keywords for exact matching. If your query is too vague, it asks a clarifying question before searching. That back-and-forth is what makes it feel like a conversation rather than a search box.
Movies, genres, tones, themes, and pacing exist as connected nodes and edges built with NetworkX. At retrieval time three strategies run in parallel — ChromaDB vector search, BM25 keyword search, and metadata filtering — and are fused using Reciprocal Rank Fusion (RRF) into a single ranked result. Groq LLaMA-3 then synthesizes a natural language explanation of why those results matched your query.
| Tool | Purpose |
|---|---|
| Python | Core language |
| Streamlit | Web UI and cloud deployment |
| Pandas | Data loading and cleaning |
| ChromaDB | Vector storage and filtered search |
sentence-transformers (all-MiniLM-L6-v2) |
Local text embeddings |
| Groq LLaMA-3.3-70b | Query planning and answer synthesis |
| rank_bm25 | Keyword retrieval |
| NetworkX | Knowledge graph construction |
| pyvis | Interactive graph visualisation |
| Streamlit Community Cloud | Free public hosting |
semantic-search-agent/
├── src/
│ ├── app.py # Streamlit UI
│ ├── planner.py # Groq query planner + clarification logic
│ ├── search.py # Hybrid search + RRF fusion + answer synthesis
│ ├── embeddings.py # Embed records and upsert into ChromaDB
│ ├── ingest.py # Load TMDB CSV + LLM metadata enrichment
│ ├── graph_builder.py # Build NetworkX knowledge graph
│ └── graph_viz.py # Export interactive graph as HTML
├── data/
│ ├── enriched.json # LLM-enriched movie records
│ └── chroma_db/ # Persisted vector embeddings
├── .streamlit/
│ └── config.toml # Streamlit configuration
├── requirements.txt
├── packages.txt
├── project.md
└── .gitignore
1. Clone the repo
git clone https://github.com/yourusername/semantic-search-agent.git
cd semantic-search-agent2. Create a virtual environment
python3 -m venv venv
source venv/bin/activate3. Install dependencies
pip install -r requirements.txt4. Add your API key
Create a .env file in the project root:
GROQ_API_KEY=your_groq_key_here
Get a free key at console.groq.com
5. Run the app
streamlit run src/app.pyThe app will open at http://localhost:8501
| Query | What it tests |
|---|---|
top rated action movies from the 1990s |
Metadata filters (year + genre) |
films with a dark twisted ending you don't see coming |
Pure semantic search |
movies about a man seeking revenge for his family |
Hybrid search |
something fun for the weekend |
Clarification flow |
a film like Blade Runner but more hopeful |
Semantic similarity |
User query
↓
Groq LLaMA-3 query planner
↓ (if vague → asks clarifying question)
Structured search plan
↓
┌──────────────────────────────────────────┐
│ Vector search │ BM25 │ Metadata filter │
└──────────────────────────────────────────┘
↓
Reciprocal Rank Fusion
↓
Ranked results + Groq-synthesised explanation
Built on the TMDB 5000 Movie Dataset from Kaggle. The dataset covers older films and does not include recent releases. The focus of this project is the search architecture, not data freshness.
This project was inspired by conversations at the Gartner conference around the future of enterprise search — the idea that search should understand intent rather than match keywords. Movies were used as the playground, but the same four-layer architecture applies directly to enterprise use cases like internal knowledge bases, document retrieval, product catalogs, and customer support systems.
The architecture demonstrated here maps directly to real enterprise search problems:
- Metadata enrichment → enrich support tickets, contracts, or product listings with LLM-inferred tags
- Semantic layer → enable concept-based search over internal documents regardless of terminology
- Context layer → agentic query planning for enterprise knowledge bases
- Knowledge graph → connect entities across departments, projects, or product lines
Developed using Cursor as an AI coding assistant alongside Claude. Deployed for free on Streamlit Community Cloud.
MIT