Intelligent search for your Calibre library
A privacy-first RAG system that brings semantic search to your personal research library. Built for scholars, researchers, and anyone with a serious book collection.
Quick Start • Features • Documentation • Roadmap • archilles.org
If you're a researcher, you know this problem: You've spent years building a carefully curated library in Calibre—hundreds or thousands of books, annotated and tagged. But when you need to find that specific argument about medieval trade routes, or compare how three different authors approach consciousness, you're stuck with keyword search. You know the passage exists. You just can't find it.
Archilles solves this.
It's a semantic search system built specifically for Calibre libraries. Instead of matching keywords, it understands meaning. Ask it "discussions of political legitimacy in early modern Europe" and it finds relevant passages—even if they never use those exact words.
Everything runs locally on your machine. Your library, your annotations, your research—they stay private. No cloud services, no data uploads, no subscriptions.
- Retrieval-Augmented Generation (RAG): Combines semantic embeddings with keyword search for best-of-both-worlds accuracy
- Model Context Protocol (MCP): Native integration with Claude and other AI assistants
- Calibre Integration: Works seamlessly with your existing library structure
- Local-First: LanceDB for vector storage, all processing happens on your hardware
Find books by meaning, not just keywords. Ask natural questions and get relevant passages from across your entire library.
All data stays on your machine. No cloud uploads, no telemetry, no tracking. Your research library remains private.
Seamless integration with Claude Desktop and other MCP-compatible tools. Your AI assistant can search your library directly.
Currently recommended: Claude Desktop (free tier available). Support for ChatGPT, OpenAI Codex, and other HTTP/SSE-based MCP clients is in active development.
Reads directly from your Calibre library structure. Extracts metadata, tags, comments, annotations, and custom fields automatically.
Searches beyond book text. Your Calibre comments, highlights, and notes are all indexed and searchable.
Filter by Calibre tags, combine searches across custom fields, leverage all the organization you've already done.
Built-in language detection for 75+ languages. Search in German, English, Latin, Greek, French—or all at once.
Combines semantic understanding (BGE-M3 embeddings) with keyword precision (BM25). Get the best of both approaches.
Enable a second-stage reranker (BAAI/bge-reranker-v2-m3) that scores each query-document pair for significantly improved relevance ranking. Graceful fallback if your system has limited memory.
| Archilles | Cloud RAG Services | Calibre Search | Other MCP Servers |
|---|---|---|---|
| Privacy-first, local processing | Your data uploaded to cloud | Basic keyword matching | Often single-purpose |
| Semantic + keyword hybrid | Usually semantic only | No semantic understanding | Varying capabilities |
| Calibre-native integration | Generic document handling | Built-in but limited | May not support Calibre |
| One-time setup, no subscriptions | Monthly fees, usage limits | Free (included) | Varies widely |
| Full control over your data | Terms of service apply | Your data, basic search | Depends on service |
Archilles gives you the semantic search capabilities of modern RAG systems while keeping everything under your control. If you've invested years in building and organizing your Calibre library, Archilles makes that investment exponentially more valuable.
- Python 3.11 or higher
- Calibre with your book library
- (Optional) Claude Desktop for MCP integration (recommended — free tier available; ChatGPT/Codex support in development)
# Clone the repository
git clone https://github.com/kasssandr/archilles.git
cd archilles
# Install dependencies
pip install -r requirements.txt
# Set your Calibre library path (optional - defaults to C:/Calibre Library)
# Windows PowerShell:
$env:CALIBRE_LIBRARY_PATH = "D:\Your-Calibre-Library"
# Linux/Mac:
export CALIBRE_LIBRARY_PATH="/path/to/your/Calibre Library"# Index a single book
python scripts/rag_demo.py index "/path/to/Calibre Library/Author/Book/book.pdf"
# Check your index
python scripts/rag_demo.py stats# Preview what would be indexed (dry run)
python scripts/batch_index.py --tag "Your-Tag" --dry-run
# Index all books with a specific Calibre tag
python scripts/batch_index.py --tag "History"
# Index with progress logging
python scripts/batch_index.py --tag "History" --log indexing.json
# Resume interrupted indexing (skip already indexed books)
python scripts/batch_index.py --tag "History" --skip-existing# Hybrid search (recommended - combines semantic + keyword)
python scripts/rag_demo.py query "trade networks in medieval Europe"
# Filter by language
python scripts/rag_demo.py query "Rex" --language la
# Filter by tags
python scripts/rag_demo.py query "political theory" --tag-filter Philosophy History
# Export results to Markdown (for Joplin/Obsidian)
python scripts/rag_demo.py query "consciousness" --export results.mdAdd to your Claude Desktop config (%APPDATA%\Claude\claude_desktop_config.json on Windows):
{
"mcpServers": {
"archilles": {
"command": "python",
"args": ["C:/Users/YOU/archilles/mcp_server.py"],
"env": {
"CALIBRE_LIBRARY_PATH": "D:/Your-Calibre-Library"
}
}
}
}Then in Claude Desktop, you can use natural language:
- "Search my books for discussions of political legitimacy"
- "Find annotations about consciousness"
- "What did I highlight about medieval trade?"
Archilles builds a semantic index of your Calibre library that enables intelligent search:
┌─────────────┐
│ Calibre │ ← Your existing library (books, metadata, tags, comments)
└──────┬──────┘
│
▼
┌─────────────┐
│ Extractors │ ← PyMuPDF (primary), pdfplumber, EPUB, MOBI, DJVU...
└──────┬──────┘
│
▼
┌─────────────┐
│ LanceDB │ ← BGE-M3 embeddings + BM25 full-text (hybrid search)
└──────┬──────┘
│
▼
┌─────────────┐
│ Retriever │ ← RRF fusion + optional cross-encoder reranking
└──────┬──────┘
│
▼
┌─────────────┐
│ Service │ ← ArchillesService: central facade for all consumers
└──┬───┬───┬──┘
│ │ │
▼ ▼ ▼
MCP Web CLI ← Claude Desktop, Streamlit UI, command line
- Book text: Full-text extraction from 30+ formats (PDF via PyMuPDF, EPUB, MOBI, DJVU, etc.)
- Calibre metadata: Title, author, publisher, ISBN, language
- Tags: Your Calibre tags become searchable
- Comments: Calibre's comments field (HTML cleaned automatically)
- Custom fields: Any custom Calibre fields you've defined (reading status, projects, ratings, etc.)
- Annotations: Your Calibre highlights and notes (searchable via
search_annotations)
- BGE-M3 embeddings: State-of-the-art multilingual semantic understanding (1024 dimensions, GPU)
- BM25 keyword search: Precision matching for exact terms (names, Latin phrases, technical terms)
- Reciprocal Rank Fusion (RRF): Intelligently combines semantic and keyword results (stage 1)
- Cross-encoder reranking (optional): BAAI/bge-reranker-v2-m3 rescores top candidates for more accurate ranking (stage 2, CPU)
- Section filtering: Exclude bibliography, index, and front matter noise from results
- Context expansion: Small-to-Big retrieval shows surrounding text for better understanding
- Smart boosting: Calibre comments and tag matches get priority in results
Archilles reads optional configuration from .archilles/config.json inside your Calibre library:
{
"enable_reranking": true,
"reranker_device": "cpu"
}| Option | Default | Description |
|---|---|---|
enable_reranking |
false |
Enable cross-encoder reranking (more accurate but slower; downloads ~560MB model on first use) |
reranker_device |
"cpu" |
Device for reranker inference ("cpu" or "cuda"). CPU recommended when GPU runs BGE-M3 |
rag_db_path |
.archilles/rag_db |
Custom path for the vector database |
"Find all discussions of trade routes between Mediterranean and Northern Europe before 1500"
Archilles searches across your entire collection—Latin primary sources, German monographs, English translations—and surfaces relevant passages based on concepts, not just keywords.
"Trace the motif of unreliable narrators across these 50 twentieth-century novels"
Semantic search finds passages that demonstrate unreliable narration, even when the texts never use that term. Your annotations and comments help prioritize the most relevant examples.
"Compare views on the hard problem of consciousness across Chalmers, Dennett, and Nagel"
Hybrid search combines precise name matching with semantic understanding of philosophical concepts. Your Calibre tags help filter to relevant texts.
"Find theoretical discussions of modal harmony in Renaissance treatises"
Multilingual search works across Latin treatises, Italian commentary, and modern scholarship. Technical terms get exact matching while broader concepts use semantic search.
"Locate all references to customary law in medieval court records"
Search through your collection of primary sources and secondary literature simultaneously. Custom Calibre fields (like "source_type" or "jurisdiction") help organize results.
✅ Core functionality complete:
- Full-text indexing (30+ formats, PyMuPDF primary for PDFs)
- Semantic + keyword hybrid search (LanceDB native)
- Two-stage retrieval: RRF fusion + optional cross-encoder reranking
- Calibre metadata integration (tags, comments, custom fields)
- MCP server for Claude Desktop integration (productive)
- Multi-language support (75+ languages)
- BGE-M3 embeddings (multilingual, 1024 dimensions)
- OCR support for scanned PDFs (Tesseract)
- Hardware-adaptive indexing profiles
- Streamlit Web UI (experimental)
- Section-type filtering (exclude bibliography/index noise)
- Context expansion (Small-to-Big retrieval with
window_text) - Parent-child chunk hierarchy
- Service layer architecture (decoupled MCP/Web-UI/CLI)
- Page labels (printed page numbers) for citation accuracy
🚧 Planned improvements:
- Improved embedding models (domain-specific options)
- VLM-based OCR (LightOnOCR-2, GOT-OCR 2.0)
- Graph RAG (entity relationships)
🔮 On the horizon:
- Graph RAG (entity relationships, timeline views)
- Special Editions (discipline-specific extensions)
- Multi-library support
- Advanced citation export (BibTeX, Zotero)
Archilles is being developed as a modular platform. The core (what you're using now) will always be free and open source.
Special Editions will extend Archilles with discipline-specific features for researchers who need them:
- 📜 Historical Edition: Timeline visualization, prosopography, chronology-aware search
- 📖 Literary Edition: Motif tracking, intertextual connections, narrative structure analysis
- ⚖️ Legal Edition: Citation networks, precedent tracking, jurisdiction-aware search
- 🎵 Musical Edition: Score analysis integration, theoretical terminology, composer networks
These editions are commercial add-ons to support ongoing development. The core will remain MIT licensed and fully functional.
- Issues: GitHub Issues for bugs and feature requests
- Discussions: GitHub Discussions for questions and ideas
- Documentation: Full documentation for guides and troubleshooting
Archilles is open source (MIT License). Contributions are welcome!
- Found a bug? Open an issue
- Want to add a feature? Check CONTRIBUTING.md
- Improved documentation? Pull requests appreciated
We're actively seeking beta testers from diverse research disciplines. If you have a substantial Calibre library (500+ books) and want to help shape Archilles, join our beta program.
Code of Conduct: We're committed to building a welcoming community. See CODE_OF_CONDUCT.md.
📖 Installation Guide – Detailed setup instructions 📘 User Guide – How to use Archilles effectively 🏗️ Architecture – Technical deep dive 🔌 MCP Integration – Connect Archilles to Claude ❓ FAQ – Frequently asked questions 🔧 Troubleshooting – Common issues and solutions
Archilles is released under the MIT License. Free to use, modify, and distribute.
Archilles is local-first software. We collect no telemetry, no analytics, no usage data. Your library stays on your machine.
You are responsible for ensuring your use of Archilles complies with copyright law in your jurisdiction. Archilles is a tool for searching your own legally acquired library.
Archilles is built on the shoulders of giants:
- Calibre by Kovid Goyal – The gold standard for e-book library management
- LanceDB – High-performance vector database with native hybrid search
- Model Context Protocol by Anthropic – Standardized AI assistant integration
- BGE-M3 – State-of-the-art multilingual embeddings
- Anthropic Claude – AI assistant that respects user privacy
Inspired by NotebookLM, Zotero, and decades of digital humanities research.
🌐 Website: archilles.org • archilles.de 💻 GitHub: github.com/kasssandr/archilles 💬 Discussions: GitHub Discussions 📧 Contact: hello@archilles.org
Built for researchers, by a researcher.
Archilles: Because your library deserves better than keyword search.