Added pre-processing for implementing RAG for Louis:Chatbot #1118

Isha-Sovasaria · 2025-11-06T12:21:19Z

This PR adds a structure-aware XML parsing and meso-level chunking pipeline for the SICP-style corpus. It walks chapters → sections → subsections, resolves entity-like references, and propagates document context into every extracted unit. Non-content tags (indexing, Scheme-only, editorial) are pruned, and text/code are segmented and normalized. Finally, content is regrouped by location and packed into token-bounded, markdown-friendly chunks suitable for RAG ingestion.Below is a detailed report:
Report for Parsing+Chunking.pdf

Isha-Sovasaria added 3 commits November 2, 2025 12:09

Add parser updates and chunking logic

4e827f6

updated chunking logic

4cec811

further edits to chunking logic

310941c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added pre-processing for implementing RAG for Louis:Chatbot #1118

Added pre-processing for implementing RAG for Louis:Chatbot #1118

Uh oh!

Isha-Sovasaria commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Added pre-processing for implementing RAG for Louis:Chatbot #1118

Are you sure you want to change the base?

Added pre-processing for implementing RAG for Louis:Chatbot #1118

Uh oh!

Conversation

Isha-Sovasaria commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants