This repository contains the codebase and evaluation material for the design and development of a Question-Answering (QA) system for the Geoportale Nazionale per l’Archeologia (GNA). The project was carried out as part of the internship in preparation for my Master’s thesis in Digital Humanities and Digital Knowledge at the University of Bologna.
The GNA QA system is a retrieval-augmented question-answering assistant designed to respond to natural language queries based on official GNA documentation. It integrates web crawling, document chunking, vector-based retrieval and generation in a modular and scalable architecture.
- Focused crawling and sitemap generation from GNA wiki operative manual
- Chunked document processing and metadata annotation
- Dense embeddings generation using multilingual Sentence Transformers
- Retrieval-augmented generation
- Citation-aware prompting
- Streamlit-based user interface and feedback tracking
- Evaluation suite for retrieval metrics (Precision@k, Recall@k, MRR)
generate_sitemap.py,create_chunks.py,vector_store.py: Knowledge base preparationrag_sys.py: Retrieval-Augmented Generation pipelinemain.py: Streamlit application logicfeedback_handling.py: Feedback managementevaluate_retrieval.py: Evaluation frameworkcreate_test_data.py: Test set generationmain_preprocess.py: Combined pipeline for sitemap, chunking, and vectorizationdata/: Document chunks, test datasets, metrics, logsfeedback/: Local SQLite database for user feedbacksitemap/: XML sitemap of the GNA websiteOCR/: OCR-related scripts.faiss_db/: FAISS vector store.streamlit/: Streamlit configuration filesrequirements.txt: Python dependenciespackages.txt: Additional system requirements for Streamlit Cloud
Automated evaluation was performed using a synthetic test set of 400 domain-specific questions. Key metrics include:
- Precision@5
- Recall@5
- MRR (Mean-Reciprocal Rank)
- Avg. Retrieval Time
This project was supervised and supported by Mario Caruso and Simone Persiani from BUP Solutions, whose guidance and technical insights were instrumental throughout the internship. I sincerely thank them for their time, encouragement, and valuable mentorship.