|
1 | 1 | # clio-agentic-search |
2 | 2 |
|
3 | | -`clio-agentic-search` is a hybrid retrieval engine for scientific computing corpora. It indexes |
4 | | -documents into namespace-specific backends and supports lexical, vector, graph, metadata, and |
5 | | -scientific-operator retrieval in one pipeline. |
6 | | - |
7 | | -## Current scope |
8 | | - |
9 | | -- Multi-namespace registry with runtime/auth config bundles. |
10 | | -- Connectors: |
11 | | - - `local_fs` (filesystem + DuckDB persistence) |
12 | | - - `object_s3` (in-memory S3-compatible object store + DuckDB) |
13 | | - - `vector_qdrant` (in-memory vector store) |
14 | | - - `graph_neo4j` (in-memory graph traversal) |
15 | | - - `kv_redis` (in-memory log stream retrieval) |
16 | | -- Scientific retrieval operators: |
17 | | - - numeric range (`unit`, `min`, `max`) |
18 | | - - unit matching (`unit`, optional `value`) |
19 | | - - formula targeting (normalized signatures) |
20 | | -- Background indexing job API with cancellation tokens and per-namespace serialized execution. |
21 | | -- Retry wrappers for connect/index operations with exponential backoff. |
22 | | -- Telemetry: |
23 | | - - tracing (`NoopTracer` by default, OpenTelemetry when enabled) |
24 | | - - Prometheus-style metrics export at `/metrics` |
| 3 | +[](https://opensource.org/licenses/BSD-3-Clause) |
| 4 | +[](https://pypi.org/project/clio-kit/) |
| 5 | +[](https://github.com/iowarp/clio-kit/actions/workflows/quality_control.yml) |
| 6 | +[](https://www.python.org/) |
| 7 | + |
| 8 | +> **Status: Experimental** — API surface and storage format may change between minor releases. Suitable for research and evaluation; not yet recommended for production workloads. |
| 9 | +
|
| 10 | +Part of [**CLIO Kit**](https://github.com/iowarp/clio-kit) — the IoWarp platform's tooling layer for AI agents. |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +Hybrid retrieval engine for scientific computing corpora. Indexes documents into namespace-specific backends and supports lexical (BM25), vector, graph, metadata, and scientific-operator retrieval in one pipeline. DuckDB storage, FastAPI server, async job queue, OpenTelemetry tracing, Prometheus metrics. |
25 | 15 |
|
26 | 16 | ## Quick start |
27 | 17 |
|
28 | 18 | ```bash |
29 | | -UV_CACHE_DIR=.uv-cache uv sync --all-groups |
30 | | -UV_CACHE_DIR=.uv-cache uv run clio --help |
31 | | -UV_CACHE_DIR=.uv-cache uv run clio index --namespace local_fs |
32 | | -UV_CACHE_DIR=.uv-cache uv run clio query --namespace local_fs --q "pressure between 190 and 360 kPa" |
33 | | -UV_CACHE_DIR=.uv-cache uv run uvicorn clio_agentic_search.api.app:app --reload |
| 19 | +# Via the CLIO Kit launcher (recommended) |
| 20 | +uvx clio-kit search serve # Start the API server |
| 21 | +uvx clio-kit search query --namespace local_fs --q "pressure between 190 and 360 kPa" |
| 22 | +uvx clio-kit search index --namespace local_fs |
| 23 | +uvx clio-kit search list --namespace local_fs |
34 | 24 | ``` |
35 | 25 |
|
36 | | -## API |
| 26 | +### Development mode |
37 | 27 |
|
38 | | -- `GET /health`: liveness probe. |
39 | | -- `GET /version`: package version. |
40 | | -- `GET /documents?namespace=<ns>`: list indexed documents and chunk counts. |
41 | | -- `POST /query`: run retrieval and return citations + trace events. |
42 | | -- `POST /jobs/index`: submit async index job (`namespace`, `full_rebuild`). |
43 | | -- `GET /jobs/{job_id}`: fetch job status/result. |
44 | | -- `DELETE /jobs/{job_id}`: request cancellation. |
45 | | -- `GET /metrics`: Prometheus text exposition format. |
| 28 | +```bash |
| 29 | +cd clio-agentic-search |
| 30 | +uv sync --all-extras --dev |
| 31 | +uv run clio serve # Start dev server with hot reload |
| 32 | +uv run clio query --namespace local_fs --q "pressure > 200 kPa" |
| 33 | +uv run clio index --namespace local_fs |
| 34 | +``` |
| 35 | + |
| 36 | +## Features |
| 37 | + |
| 38 | +- **Multi-namespace registry** with runtime/auth config bundles |
| 39 | +- **Connectors**: filesystem + DuckDB (`local_fs`), S3 object store, Qdrant vector store, Neo4j graph, Redis KV log |
| 40 | +- **Scientific retrieval operators**: numeric range (`unit`, `min`, `max`), unit matching, formula targeting (normalized signatures) |
| 41 | +- **Background indexing** job API with cancellation tokens and per-namespace serialized execution |
| 42 | +- **Retry/backoff** wrappers for connect/index operations |
| 43 | +- **Telemetry**: OpenTelemetry tracing (opt-in), Prometheus metrics at `/metrics` |
| 44 | + |
| 45 | +## API endpoints |
| 46 | + |
| 47 | +| Method | Path | Description | |
| 48 | +|--------|------|-------------| |
| 49 | +| `GET` | `/health` | Liveness probe | |
| 50 | +| `GET` | `/version` | Package version | |
| 51 | +| `GET` | `/documents?namespace=<ns>` | List indexed documents and chunk counts | |
| 52 | +| `POST` | `/query` | Run retrieval, return citations + trace events | |
| 53 | +| `POST` | `/jobs/index` | Submit async index job | |
| 54 | +| `GET` | `/jobs/{job_id}` | Fetch job status/result | |
| 55 | +| `DELETE` | `/jobs/{job_id}` | Request cancellation | |
| 56 | +| `GET` | `/metrics` | Prometheus text exposition format | |
46 | 57 |
|
47 | 58 | ## CLI commands |
48 | 59 |
|
49 | | -- `clio query` |
50 | | -- `clio index` |
51 | | -- `clio list` |
52 | | -- `clio seed` |
53 | | -- `clio serve` |
| 60 | +| Command | Description | |
| 61 | +|---------|-------------| |
| 62 | +| `clio query` | Run retrieval queries against a namespace | |
| 63 | +| `clio index` | Index documents into a namespace | |
| 64 | +| `clio list` | List indexed documents | |
| 65 | +| `clio seed` | Seed sample data for testing | |
| 66 | +| `clio serve` | Start the FastAPI server | |
54 | 67 |
|
55 | 68 | ## Environment variables |
56 | 69 |
|
57 | | -- `CLIO_LOCAL_ROOT` (default `.`) |
58 | | -- `CLIO_STORAGE_PATH` (default `.clio-agentic-search.duckdb`) |
59 | | -- `CLIO_CORS_ORIGINS` (default `*`) |
60 | | -- `CLIO_OTEL_ENABLED` (`1`/`true`/`yes` to enable OTel tracer) |
61 | | -- `OTEL_EXPORTER_OTLP_ENDPOINT` (default `http://localhost:4317`) |
62 | | -- `CLIO_ANN_BACKEND` (`exact` default, `hnsw` when `clio-agentic-search[ann]` installed) |
63 | | -- `CLIO_CACHE_SHARDS` (default `16`, vector index shard count) |
64 | | -- `CLIO_VECTOR_WARMUP_ASYNC` (default `1`, background vector index warmup on connect) |
65 | | -- `CLIO_INDEX_DOCUMENT_BATCH_SIZE` (default `32`, batched document bundle writes per index pass) |
66 | | -- `CLIO_LEXICAL_BATCH_SIZE` (default `50000`, lexical posting write batch size) |
67 | | -- `CLIO_LEXICAL_DF_PRUNE_THRESHOLD` (default `0.98`, prune tokens above this chunk-frequency ratio) |
68 | | -- `CLIO_LEXICAL_DF_PRUNE_MIN_CHUNKS` (default `200`, minimum indexed chunks before DF pruning applies) |
69 | | -- `CLIO_LEXICAL_MAX_TOKENS_PER_CHUNK` (default `96`, keep top-frequency tokens per chunk) |
70 | | -- `CLIO_LEXICAL_PRUNE_STOPWORDS` (default `1`, remove built-in stopwords from lexical postings) |
71 | | -- `CLIO_LEXICAL_POSTINGS_COMPRESSION` (`none` default, `gzip` for compressed staging during indexing) |
72 | | -- `CLIO_OBJECT_*`, `CLIO_VECTOR_*`/`CLIO_QDRANT_*`, `CLIO_GRAPH_*`/`CLIO_NEO4J_*`, |
73 | | - `CLIO_KV_*`/`CLIO_REDIS_*` for namespace-specific connector config |
| 70 | +| Variable | Default | Description | |
| 71 | +|----------|---------|-------------| |
| 72 | +| `CLIO_LOCAL_ROOT` | `.` | Root directory for local filesystem connector | |
| 73 | +| `CLIO_STORAGE_PATH` | `.clio-agentic-search.duckdb` | DuckDB database path | |
| 74 | +| `CLIO_CORS_ORIGINS` | `*` | Allowed CORS origins | |
| 75 | +| `CLIO_OTEL_ENABLED` | `false` | Enable OpenTelemetry tracing (`1`/`true`/`yes`) | |
| 76 | +| `CLIO_ANN_BACKEND` | `exact` | ANN backend (`hnsw` when `[ann]` extra installed) | |
| 77 | +| `CLIO_CACHE_SHARDS` | `16` | Vector index shard count | |
| 78 | +| `CLIO_INDEX_DOCUMENT_BATCH_SIZE` | `32` | Documents per index batch | |
| 79 | +| `CLIO_LEXICAL_BATCH_SIZE` | `50000` | Lexical posting write batch size | |
| 80 | + |
| 81 | +See source for additional `CLIO_LEXICAL_*`, `CLIO_OBJECT_*`, `CLIO_VECTOR_*`, `CLIO_GRAPH_*`, `CLIO_KV_*` variables. |
74 | 82 |
|
75 | 83 | ## Quality checks |
76 | 84 |
|
77 | 85 | ```bash |
78 | | -UV_CACHE_DIR=.uv-cache uv run ruff check . |
79 | | -UV_CACHE_DIR=.uv-cache uv run ruff format --check . |
80 | | -UV_CACHE_DIR=.uv-cache uv run mypy src/ |
81 | | -UV_CACHE_DIR=.uv-cache uv run pytest --ignore=tests/benchmarks |
82 | | -UV_CACHE_DIR=.uv-cache uv run python -m clio_agentic_search.evals.quality_gate |
| 86 | +uv run ruff check . |
| 87 | +uv run ruff format --check . |
| 88 | +uv run mypy src/ |
| 89 | +uv run pytest --ignore=tests/benchmarks -v |
| 90 | +uv run python -m clio_agentic_search.evals.quality_gate |
83 | 91 | ``` |
84 | 92 |
|
85 | | -## Benchmark note |
| 93 | +## Benchmarks |
86 | 94 |
|
87 | | -`tests/benchmarks/test_throughput.py` enforces p95 latency for smaller corpora by default. |
88 | | -For the 10k-chunk p95 assertion, enable hardware-specific enforcement with: |
| 95 | +`tests/benchmarks/test_throughput.py` enforces p95 latency for smaller corpora by default. For 10k-chunk SLO enforcement: |
89 | 96 |
|
90 | 97 | ```bash |
91 | | -CLIO_ENFORCE_LARGE_SLO=1 UV_CACHE_DIR=.uv-cache uv run pytest tests/benchmarks/ -v --benchmark-disable -k "10000_chunks" |
| 98 | +CLIO_ENFORCE_LARGE_SLO=1 uv run pytest tests/benchmarks/ -v --benchmark-disable -k "10000_chunks" |
92 | 99 | ``` |
0 commit comments