The goal of this project is to automate the extraction of structured information from a Bangla document and store it into clean Excel files. The system identifies chapters, bold sub-sections, and numbered hadith entries from the .docx file and organizes them into three separate .xlsx files with sequential IDs. It keeps the original text unchanged while removing blank lines and unnecessary empty cells. This reduces manual work, prevents duplicate effort, and makes the data easier to use for further analysis or processing.
- ✅ Automatic Chapter extraction
- ✅ Smart detection of Bold Sub-sections
- ✅ Accurate Hadith identification (
[১], [২], ...) - ✅ Removes blank lines & noise
- ✅ Keeps original text unchanged
- ✅ Generates 3 Excel files instantly
- ✅ Clean structure with auto ID generation
-
📂 Load
.docxfile -
🔍 Detect patterns:
অধ্যায়:→ Chapter- Bold + spacing → Sub-section
[number]→ Hadith
-
🧹 Clean data (remove blank lines)
-
📊 Export to Excel
| id | name |
|---|---|
| 1 | অধ্যায়: পিতা-মাতার সাথে সদ্ব্যবহার |
| id | name |
|---|---|
| 1 | আমি মানুষকে তার পিতা-মাতার সাথে সদ্ব্যবহারের নির্দেশ প্রদান করেছি |
| id | hadith |
|---|---|
| 1 | Full hadith text... |
Pythonpython-docxpandasopenpyxl
pip install python-docx pandas openpyxlOpen and run:
automatic excel sheet generat.ipynb
You will get:
chapters.xlsxsubsections.xlsxhadith.xlsx
This project is not just academic, it solves real problems 👇
- Quickly convert books into structured datasets
- Save hours of manual typing
- Prepare data for research or ML models
- Extract hadith collections into searchable format
- Build apps/websites using structured religious data
- Organize large texts easily
- Replace repetitive manual Excel work
- Avoid human errors and duplication
- Handle large documents efficiently
- Use as preprocessing step for NLP tasks
- Convert unstructured text → structured dataset
- Build training data easily
👉 Manual data entry from books to Excel is slow, boring, and error-prone. 👉 This tool automates the whole process in seconds.
- ⏳ Saves time
- 🎯 Improves accuracy
- 📈 Makes data usable
- 🤝 Reduces repetitive work
Feel free to fork this repo and improve it 🚀
If you find this useful, give it a ⭐ on GitHub!