Archilles is a semantic search system for Calibre libraries. It uses RAG (Retrieval-Augmented Generation) technology to let you search your books by meaning, not just keywords. Ask a natural-language question and get relevant passages with full citations — from across your entire library, in any language.
Yes. Archilles core is MIT licensed and free to use. Future "Special Editions" (discipline-specific extensions) may be offered as commercial add-ons. The core will always remain free and open source.
Yes. Everything runs locally on your machine. The only network access is downloading the BGE-M3 embedding model on first run (~2.2 GB, cached permanently afterwards). Once downloaded, Archilles works fully offline.
No. Zero telemetry, no analytics, no tracking of any kind. Your library and your research stay on your machine.
- Python 3.11 or higher
- 8 GB RAM minimum (16 GB recommended for large libraries)
- ~5 GB free disk space (model ~2.2 GB + index storage)
- Calibre with an existing library
See INSTALLATION.md for the full guide.
Windows is the primary supported platform (tested on Windows 11). macOS and Linux should work — the code is cross-platform — but are not officially tested. If you run into platform-specific issues, please open a GitHub issue.
Yes. Archilles automatically detects Apple MPS (Metal Performance Shaders) and uses it for GPU-accelerated indexing on Apple Silicon Macs. No configuration needed.
Yes. The command-line interface (scripts/rag_demo.py) works completely standalone. The MCP integration for Claude Desktop is optional — it's the recommended way to use Archilles, but not required.
It depends heavily on the format, length, and content of each book. Times on a 4 GB NVIDIA GPU (minimal profile):
| Book type | Typical range |
|---|---|
| EPUB / MOBI (any length) | 1–5 minutes |
| PDF, standard text (200–400 pages) | 5–15 minutes |
| PDF, dense academic content (400–600 pages) | 15–30 minutes |
| CPU only (no GPU) | roughly 3–5× slower than GPU |
Apple Silicon (MPS) is significantly faster than CPU-only. Modern chips (M2 Pro, M3, M3 Max) are broadly comparable to a modest NVIDIA GPU. Older Apple Silicon (M1, M2 Air) falls between GPU and CPU speed.
The first run also downloads the BGE-M3 model (~2.2 GB). After that, only the book processing time applies.
For a large library, batch indexing runs unattended in the background over days or weeks. Use --skip-existing to resume after any interruption.
No. Archilles reads from Calibre but never writes to it. Your Calibre library and its metadata.db are strictly read-only.
Nothing is lost. LanceDB writes are atomic per chunk. Use --skip-existing to resume where you left off:
python scripts/batch_index.py --tag "History" --skip-existingNot currently. Archilles requires the Calibre library structure (it reads metadata directly from Calibre's metadata.db).
PDF (primary), EPUB, MOBI, DJVU, HTML, TXT, and more — 30+ formats via PyMuPDF and format-specific extractors. Scanned PDFs are supported via Tesseract OCR (Linux requires tesseract-ocr installed separately).
- Hybrid (default): Best for most queries — combines semantic and keyword search
- Semantic: Better for broad concepts, thematic searches, and cross-language queries
- Keyword: Best for exact names, dates, technical terms, and Latin phrases
75+ languages with automatic detection. BGE-M3 is multilingual — you can search in German, English, Latin, Greek, French, or any combination without changing settings. Use --language la to restrict results to a specific language.
Common reasons:
- The book hasn't been indexed yet — run
rag_demo.py statsto see what's in the index - Wrong language filter — try removing
--language - Too-specific a query — try hybrid mode or broaden the phrasing
- The passage is in a bibliography, index, or front matter — these are excluded by default to reduce noise
Yes. Annotations (Calibre highlights and notes) and Calibre comments are indexed as separate chunk types and searchable via search_annotations in the MCP interface, or via query --mode hybrid in the CLI.
Not currently. Multi-library support is planned for a future release.
BGE-M3 (BAAI/bge-m3) — state-of-the-art multilingual embeddings with 1024 dimensions. It handles 75+ languages and performs well on both short and long texts.
Yes, automatically:
- NVIDIA CUDA: detected and used automatically if PyTorch with CUDA is installed
- Apple Silicon MPS: detected and used automatically on M1/M2/M3/M4 Macs — faster than CPU, roughly comparable to a modest NVIDIA GPU on modern chips (M2 Pro, M3 and above)
- CPU fallback: always available if no GPU is detected; indexing is slower but search quality is identical
By default in .archilles/rag_db/ inside your Calibre library folder. The path is configurable via .archilles/config.json:
{ "rag_db_path": "/custom/path/to/rag_db" }Yes — one subfolder. Archilles creates .archilles/ inside your Calibre library directory and stores the vector database there (rag_db/). Your Calibre library files and metadata.db are never modified.
Backup and sync implications: The .archilles/rag_db/ folder can grow to several gigabytes for large libraries. If you back up or sync your Calibre library (Dropbox, OneDrive, NAS mirror, Time Machine), this folder will be included. You may want to:
- Exclude it from sync if you don't need the index on other devices — it can always be rebuilt by re-indexing
- Include it in backups if re-indexing thousands of books would be costly in time
The index is fully reproducible from your books at any time. If it is lost or deleted, simply run batch_index.py again with --skip-existing to rebuild.
Yes, optionally. Enable it in .archilles/config.json:
{ "enable_reranking": true }This downloads an additional ~560 MB model (bge-reranker-v2-m3) and improves result ranking quality. Disabled by default.
No. All processing — text extraction, embedding, search — happens locally on your machine. Nothing is uploaded anywhere.
Indexing for personal research use is generally covered by fair use / fair dealing in most jurisdictions. You are responsible for compliance with copyright law in your jurisdiction. Archilles is designed for searching your own legally acquired library.
The software is MIT licensed (free for commercial use). Compliance with copyright law for the content you index is your responsibility.
- Improved embedding models (domain-specific options)
- VLM-based OCR for better scanned PDF support
- Graph RAG (entity relationships)
See docs/ROADMAP.md for the full roadmap.
Planned discipline-specific commercial extensions (Historical, Literary, Legal, Musical). The core will remain MIT licensed. See EDITIONS.md for details.
Yes. See CONTRIBUTING.md for how to report bugs, request features, and submit pull requests.
For specific error messages and common issues, see TROUBLESHOOTING.md.
Didn't find your answer? Ask in GitHub Discussions or open an issue.