TDS Virtual TA Project

Hi! I’m Ashka Pathak, and this is my Virtual Teaching Assistant project. This assistant is designed to:

Scrape and store TDS course content
Download Discourse forum posts with authentication
Respond to student queries via a lightweight API

This project reflects my hands-on learning in web scraping, API development, and automation. It was both challenging and rewarding—and it greatly deepened my understanding of backend systems and data structuring.

Live Demo — Try it out using the Swagger UI

What I Learned

This project challenged me to apply:

How to structure and scrape real-world educational content
How to use browser cookies for session-based authenticated downloads
How FAISS works internally: binary .faiss for embeddings and .pkl for metadata
How to mock embeddings when OpenAI wasn’t usable
How to deploy using Render + manage vectorstores in production

Overview

This repository includes:

A website scraper that collects TDS course pages in Markdown format
A Discourse post downloader with session-based authentication
A FastAPI server that processes student queries and generates responses

Project Structure

.
├── discourse_posts.json               # Combined Discourse data
├── discourse_json/                    # Topic-wise JSON dumps
├── tds_pages_md/                      # Markdown pages of the course site
├── TDS_Project1_Data/                 # Code for scraping and downloading
│ ├── discourse_scraper.py
│ ├── website_downloader_full.py
│ ├── tds_discourse_downloader.py
│ └── ...
├── fastapi_app.py                     # FastAPI server with LLM response generation
├── embed_all_posts.py                 # Generates FAISS index
├── faiss_index/                       # Stored FAISS vectorstore (index.faiss, index.pkl)
├── requirements.txt                   # Python dependencies
├── Procfile                           # Render deployment entrypoint
└── README.md

Features

Automatically scrapes TDS course content from the official website into markdown
Downloads all Discourse forum posts from Jan 1 to Apr 14, 2025 using cookies authentication
Implements a FastAPI backend for query handling and answers the questions using FAISS vector search over scraped content
Responds to questions using either OpenAI or local LLM (LLaMA)
Runs a FastAPI server with a REST API

Installation

Clone and create a virtual environment

git clone https://github.com/AshkaPathak/TDS-Virtual-TA.git

cd TDS-Virtual-TA python3 -m venv venv source venv/bin/activate pip install -r requirements.txt ``` 2. Scrape the course content

python TDS_Project1_Data/website_downloader_full.py

Download discourse posts (requires cookies)

python TDS_Project1_Data/tds_discourse_downloader.py

Generate embeddings (FAISS)
```
python embed_all_posts.py
```
Start API server
```
uvicorn fastapi_app:app --reload
```
Go to: http://localhost:8000/docs

Deployed Version

You can access the hosted version here

Data Format

discourse_posts.json – Full Discourse post dump
discourse_json/ – Individual JSON files per topic
tds_pages_md/ – Markdown-formatted course pages

License

This project is licensed under the MIT License.

Notes & Reflections

This wasn’t just a regular course project—it genuinely pushed me to explore real-world tools beyond theory.

Firstly, I didn’t have access to a personal OpenAI key, so I had to mock the embedding system to get things running. Figuring out how FAISS expects embed_documents() and embed_query() gave me a solid look into how LangChain works under the hood.

FAISS gave me a hard time. Especially when the index wouldn’t load and all I got was a vague "FileIOReader failed" error. Turns out you need both index.faiss (the actual vectors) and index.pkl (the metadata) properly placed and committed—learnt that the hard way.

Moreover, downloading discourse posts wasn’t straightforward. Digging through Chrome DevTools, figure out which cookies to grab, and pass them correctly in the header is a good way to go I learned. Just because of this project, I learned more about headers, sessions, and auth than in any previous scraping work.

FastAPI + Swagger UI made testing fun. Seeing the API work live and actually respond to queries felt super rewarding after all the backend pieces came together.

Running LLaMA 3 locally via Ollama was a great fallback. It made me realize how far open-source models have come. It was slower, but it worked—and that gave me the confidence to say: this project doesn’t need OpenAI to run.

Right when I thought this is finally going well, I accidentally committed the entire venv/, hit GitHub’s file size limit, and had to use git filter-repo to clean it up. Annoying at the time, but now I know how to keep my Git history clean and lightweight.

This whole process helped me connect scraping, embeddings, APIs, and deployment into one working pipeline. It wasn’t smooth, but it was worth it.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
TDS_Project1_Data		TDS_Project1_Data
__pycache__		__pycache__
api		api
backup_first_attempt		backup_first_attempt
discourse_json		discourse_json
faiss_index		faiss_index
tds_pages_md		tds_pages_md
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
course_content.json		course_content.json
discourse_posts.json		discourse_posts.json
embed_all_posts.py		embed_all_posts.py
main.py		main.py
project-tds-virtual-ta-promptfoo.yaml		project-tds-virtual-ta-promptfoo.yaml
project-tds-virtual-ta-q1.webp		project-tds-virtual-ta-q1.webp
requirements.txt		requirements.txt
start.sh		start.sh
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TDS Virtual TA Project

What I Learned

Overview

Project Structure

Features

Installation

Deployed Version

Data Format

License

Notes & Reflections

Related Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TDS Virtual TA Project

What I Learned

Overview

Project Structure

Features

Installation

Deployed Version

Data Format

License

Notes & Reflections

Related Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages