Skip to content

Conversation

@Isha-Sovasaria
Copy link

This PR adds a structure-aware XML parsing and meso-level chunking pipeline for the SICP-style corpus. It walks chapters → sections → subsections, resolves entity-like references, and propagates document context into every extracted unit. Non-content tags (indexing, Scheme-only, editorial) are pruned, and text/code are segmented and normalized. Finally, content is regrouped by location and packed into token-bounded, markdown-friendly chunks suitable for RAG ingestion.Below is a detailed report:
Report for Parsing+Chunking.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants