Skip to content

feat: add Milvus storage backend (Lite / self-hosted / Zilliz Cloud)#721

Open
zc277584121 wants to merge 1 commit intodoobidoo:mainfrom
zc277584121:milvus-storage-backend
Open

feat: add Milvus storage backend (Lite / self-hosted / Zilliz Cloud)#721
zc277584121 wants to merge 1 commit intodoobidoo:mainfrom
zc277584121:milvus-storage-backend

Conversation

@zc277584121
Copy link
Copy Markdown

Summary

Adds a new milvus storage backend to the Strategy Pattern under storage/, alongside sqlite-vec, cloudflare, and hybrid. Implemented against the existing MemoryStorage abstract interface. Supports three deployment modes from the same code path — Milvus Lite (local .db file, zero deps, good for scripts and tests), self-hosted Milvus via Docker (recommended for MCP servers and long-lived services), and Zilliz Cloud (managed, at-scale).

What's in this PR

  • src/mcp_memory_service/storage/milvus.py (new) — MilvusMemoryStorage built on pymilvus.MilvusClient. AUTOINDEX + COSINE, dynamic fields for tags/metadata, one MilvusClient per storage instance, asyncio.to_thread to run the sync pymilvus client from async handlers. All functions graded ≤ B (cyclomatic complexity ≤ 8).
  • Factory wiringstorage/factory.py, server_impl.py (eager + lazy init paths), cli/utils.py, web/api/configuration.py, utils/db_utils.py — four dispatch points in the codebase all now recognize milvus.
  • ConfigMCP_MILVUS_URI, MCP_MILVUS_TOKEN, MCP_MILVUS_COLLECTION_NAME. The MCP_ prefix is intentional; pymilvus reserves unprefixed MILVUS_URI and validates it at import time.
  • Optional dependency grouppyproject.toml adds [project.optional-dependencies].milvus = [pymilvus, milvus-lite, <conditional setuptools<81 on Py 3.13>]. No impact on default installs.
  • Graceful reconnect for Milvus Lite — when Milvus Lite's embedded daemon exits after idle, the backend detects the specific error signatures and reconnects once per RPC. Only active when uri.endswith(".db"); remote backends never take the reconnect path. See upstream milvus-lite#334 for context.
  • Tests — 32 new Milvus tests in tests/storage/test_milvus.py. Full suite: 1620 passed.
  • Docsdocs/milvus-backend.md (new dedicated guide), docs/guides/STORAGE_BACKENDS.md extended with a Milvus column and decision flowchart, README.md and .env.example both reference the new backend with appropriate caveats.

Which URI to use

URI form Best for
./milvus.db (Milvus Lite) Scripts, tests, short-lived CLIs. Not recommended for MCP servers or other long-lived services — see docs for why.
http://host:19530 Recommended for MCP deployments. Stable, concurrent-safe, easy to run via milvusdb/milvus Docker image.
https://<cluster>.zillizcloud.com + token Managed, at scale.

Verification

  • Full test suite: pytest -q tests/ --timeout=180 → 1620 passed, 0 failed (baseline 1588 + 32 Milvus).
  • Real Claude Code MCP end-to-end against Docker Milvus (uri=http://localhost:19530): health, 3 stores, semantic search (Milvus memory ranked add MCP server badge #1 for "vector database"), tag filter (exactly 1 match for "user-preference"), total_memories=3 — all green.
  • Milvus Lite sequential 75 s idle probe: store → sleep 75 s → store — both succeed, retrieve returns 2 hits, thanks to the reconnect helper.

Migration / compatibility

  • Zero breaking changes. Existing deployments on sqlite_vec, hybrid, cloudflare are untouched.
  • New backend is opt-in via MCP_MEMORY_STORAGE_BACKEND=milvus and the [milvus] optional-dependency group.
  • Schema validator and health-check registry both recognize MilvusMemoryStorage so memory_health and the web dashboard report correct backend identity.

Introduces MilvusMemoryStorage, a pluggable backend that uses pymilvus's
MilvusClient API. One configuration supports three deployment targets:

  * Milvus Lite (default): single local file, zero external dependencies
  * Self-hosted Milvus server: http(s) endpoint
  * Zilliz Cloud: managed endpoint with a token

Schema and semantics:
  * Primary key is the memory's content hash, so exact-hash dedup is free
    and cleanup_duplicates() is a no-op.
  * Tags are stored as ",tag1,tag2," with leading/trailing commas and
    matched with "tags like '%,tag,%'" — exact match (no substring).
  * AUTOINDEX + COSINE metric work identically on Lite, server, and Cloud.
  * Semantic deduplication honours the same MCP_SEMANTIC_DEDUP_* env vars
    as sqlite_vec, so the service layer sees consistent behaviour.
  * Hard delete (no tombstone table); soft-delete/sync scenarios should
    continue to use the hybrid backend.

Wiring:
  * storage/factory.py routes backend == "milvus" to the new class and
    reads MCP_MILVUS_URI / MCP_MILVUS_TOKEN / MCP_MILVUS_COLLECTION_NAME.
  * config.py exposes those three variables and extends SUPPORTED_BACKENDS.
    The MCP_ prefix is deliberate — pymilvus reserves MILVUS_URI and
    validates it at import time, so a Lite file path in the unprefixed
    env var would error before our code runs.
  * pyproject.toml ships a new "milvus" optional-dependency group that
    pulls in pymilvus>=2.5.0, milvus-lite>=2.4.10, and pins setuptools<81
    on Python 3.13+ (milvus-lite still imports pkg_resources).

Tests:
  * tests/storage/test_milvus.py (23 tests) covers initialize, store,
    retrieve, batch, exact-tag match, AND/OR tag search, delete paths,
    update, stats, and semantic dedup. Uses Milvus Lite with a shared DB
    file + per-test collection to avoid spawning a daemon per test.
  * Full suite: 1611 passed (baseline 1588 + 23 new).

Docs:
  * docs/milvus-backend.md — per-target deployment guide, env vars,
    schema, limitations, troubleshooting.
  * docs/guides/STORAGE_BACKENDS.md — adds Milvus to the comparison
    table, selection flowchart, and config reference.
  * README.md / .env.example mention the new backend option.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Milvus storage backend to the MCP Memory Service, supporting Milvus Lite, self-hosted Milvus, and Zilliz Cloud. The implementation includes a new MilvusMemoryStorage class, updated configuration handling, and comprehensive unit tests. Feedback focuses on improving the scalability and correctness of the Milvus implementation, specifically addressing issues with server-side filtering for substring searches, fixing broken pagination and sorting in query methods, ensuring consistent timestamp metadata, and optimizing fetch limits during vector searches.

Comment on lines +1156 to +1170
async def get_by_exact_content(self, content: str) -> List[Memory]:
"""Case-insensitive substring match on content.

Matches sqlite_vec's semantics (which uses ``LIKE '%...%' COLLATE NOCASE``).
Milvus ``like`` is case-sensitive, so we scan candidates and filter in
Python. For expected collection sizes (thousands to low millions) this
is acceptable; production users with huge collections should prefer
full-text search on a dedicated field instead.
"""
if not self._ensure_initialized():
return []

memories = await self.get_all_memories()
needle = content.lower()
return [m for m in memories if needle in (m.content or "").lower()]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This method performs a full scan of the collection (up to _MILVUS_MAX_LIMIT) and filters in Python, which has significant drawbacks:

  1. Incompleteness: It only searches the first 16,384 records. In a collection with millions of memories, matches beyond this limit will never be found.
  2. Performance/Memory Risk: Fetching thousands of records into memory as Python objects to perform a substring search is inefficient and poses an OOM (Out of Memory) risk for large collections.

Consider using the Milvus like operator for server-side filtering to narrow down candidates. Even if case-insensitivity requires a final Python check, the number of records fetched would be drastically reduced. For true case-insensitive scalability, consider storing a normalized lowercase field or using a Milvus analyzer (available for TEXT fields in 2.4+).

Comment on lines +1499 to +1535
async def _query_memories(
self,
filter_expr: str,
limit: int,
offset: int = 0,
sort_desc_key: Optional[str] = None,
) -> List[Memory]:
"""Run a scalar ``query`` and return Memory objects, sorted client-side."""
limit = max(1, min(limit, _MILVUS_MAX_LIMIT))
# Milvus has a limit on offset (~16384), but total rows < offset+limit still works.
fetch_limit = min(offset + limit, _MILVUS_MAX_LIMIT)

try:
rows = await self._call_client(
"query",
collection_name=self.collection_name,
filter=filter_expr or "",
output_fields=list(self._OUTPUT_FIELDS),
limit=fetch_limit,
)
except Exception as exc: # noqa: BLE001
logger.warning("Milvus query failed (filter=%r): %s", filter_expr, exc)
return []

memories: List[Memory] = []
for row in rows:
mem = self._entity_to_memory(row)
if mem is not None:
memories.append(mem)

if sort_desc_key:
memories.sort(key=lambda m: getattr(m, sort_desc_key) or 0.0, reverse=True)

if offset:
memories = memories[offset:]
return memories[:limit]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of _query_memories has several issues that affect correctness and scalability:

  1. Incorrect Sorting: Sorting is performed in Python after fetching a subset of records. Without an order_by clause in the Milvus query, the database returns an arbitrary set of records. This means get_recent_memories will not reliably return the most recent entries.
  2. Broken Pagination: Slicing is done in Python on a result set capped by _MILVUS_MAX_LIMIT. If offset + limit exceeds this limit, the function will return fewer records than requested or even an empty list (e.g., if offset >= 16384), even if more records exist.
  3. Efficiency: Milvus supports server-side sorting and pagination via order_by, limit, and offset. Using these parameters directly in the query call is significantly more efficient and correct.

Note: While Milvus has a default max_query_limit (typically 16384), using the offset parameter allows paginating through the entire collection.

    async def _query_memories(
        self,
        filter_expr: str,
        limit: int,
        offset: int = 0,
        sort_desc_key: Optional[str] = None,
    ) -> List[Memory]:
        """Run a scalar query and return Memory objects, sorted server-side."""
        query_params = {
            "collection_name": self.collection_name,
            "filter": filter_expr or "",
            "output_fields": list(self._OUTPUT_FIELDS),
            "limit": limit,
            "offset": offset,
        }
        if sort_desc_key:
            query_params["order_by"] = [f"{sort_desc_key} desc"]

        try:
            rows = await self._call_client("query", **query_params)
        except Exception as exc:  # noqa: BLE001
            logger.warning("Milvus query failed (filter=%r): %s", filter_expr, exc)
            return []

        memories: List[Memory] = []
        for row in rows:
            mem = self._entity_to_memory(row)
            if mem is not None:
                memories.append(mem)

        return memories

Comment on lines +576 to +588
def _memory_to_entity(self, memory: Memory, embedding: List[float]) -> Dict[str, Any]:
return {
"id": memory.content_hash,
"vector": embedding,
"content": memory.content or "",
"tags": _tags_to_string(memory.tags),
"memory_type": memory.memory_type or "",
"metadata": json.dumps(memory.metadata) if memory.metadata else "{}",
"created_at": float(memory.created_at) if memory.created_at is not None else time.time(),
"updated_at": float(memory.updated_at) if memory.updated_at is not None else time.time(),
"created_at_iso": memory.created_at_iso or "",
"updated_at_iso": memory.updated_at_iso or "",
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is an inconsistency in timestamp handling when fields are missing from the Memory object. If memory.created_at is None, it is initialized with time.time(), but created_at_iso is stored as an empty string. This results in records having a valid float timestamp but an empty ISO string, which may break consumers expecting consistent temporal metadata. Both fields should be populated using the same timestamp value. A similar pattern should be applied to _build_update_entity.

    def _memory_to_entity(self, memory: Memory, embedding: List[float]) -> Dict[str, Any]:
        now = time.time()
        created_at = float(memory.created_at) if memory.created_at is not None else now
        updated_at = float(memory.updated_at) if memory.updated_at is not None else now

        def to_iso(ts):
            return datetime.fromtimestamp(ts, tz=timezone.utc).isoformat().replace("+00:00", "Z")

        return {
            "id": memory.content_hash,
            "vector": embedding,
            "content": memory.content or "",
            "tags": _tags_to_string(memory.tags),
            "memory_type": memory.memory_type or "",
            "metadata": json.dumps(memory.metadata) if memory.metadata else "{}",
            "created_at": created_at,
            "updated_at": updated_at,
            "created_at_iso": memory.created_at_iso or to_iso(created_at),
            "updated_at_iso": memory.updated_at_iso or to_iso(updated_at),
        }

Comment on lines +879 to +884
def _retrieve_fetch_limit(n_results: int, tag_filter: str) -> int:
# Over-fetch when a tag filter is active — Milvus applies filters during
# search but effective selectivity can vary.
base = n_results * 3 if tag_filter else n_results
return max(1, min(base, _MILVUS_MAX_LIMIT))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Over-fetching by a factor of 3 when a tag filter is active is unnecessary in Milvus. Milvus performs integrated filtering during the vector search process, ensuring that the limit requested is satisfied by matching entities (if they exist). Unlike systems that perform post-filtering, Milvus's search with a filter expression is efficient and exact. Reducing fetch_n to n_results would reduce network overhead and database load.

Suggested change
def _retrieve_fetch_limit(n_results: int, tag_filter: str) -> int:
# Over-fetch when a tag filter is active — Milvus applies filters during
# search but effective selectivity can vary.
base = n_results * 3 if tag_filter else n_results
return max(1, min(base, _MILVUS_MAX_LIMIT))
@staticmethod
def _retrieve_fetch_limit(n_results: int, tag_filter: str) -> int:
# Milvus applies filters during search, so over-fetching is not required.
return max(1, min(n_results, _MILVUS_MAX_LIMIT))

@doobidoo
Copy link
Copy Markdown
Owner

Review findings — Gemini cross-check + additions

Read the four Gemini comments and verified each against src/mcp_memory_service/storage/milvus.py. Summary of where I agree, where I'd push back, and a few things Gemini missed.

Gemini #1get_by_exact_content full-scan (line 1156-1170, high) — ✅ agree

Real risk. Two problems compound:

  • Hard cap at _MILVUS_MAX_LIMIT (16 384) — silently truncates on larger collections.
  • Full Python materialization of every row before substring filter — OOM risk + RTT bandwidth.

Caveat on Gemini's fix: pushing down tags like "%,...,%" works because tags are normalized lowercase; a generic like "%needle%" on content is case-sensitive in Milvus and so doesn't match sqlite_vec's LIKE … COLLATE NOCASE semantics (verified at sqlite_vec.py:2668). To get parity and server-side filtering, store a normalized content_lower field at insert time and push like down on that. Otherwise the function will quietly diverge from sqlite_vec on case-mixed content.

Gemini #2_query_memories sort + pagination (line 1499-1534, high) — ✅ agree, this is the most important finding

Two real bugs once a collection grows past 16 384 rows:

  1. Sort is fenced inside a fetch window. pymilvus.query() doesn't guarantee row order, so get_recent_memories will sort the first arbitrary 16 384 rows — not the actually-most-recent ones. On any non-trivial collection, "recent" becomes "random subset that happens to be recent".
  2. Pagination silently caps. fetch_limit = min(offset + limit, _MILVUS_MAX_LIMIT) plus client-side slicing means once offset + limit > 16 384 you can never page past row 16 384.

Recommended fix: use pymilvus.QueryIterator for sorted/paginated scans, or push sorting into Milvus via order_by (Milvus 2.5+ scalar query supports it). Either is preferable to the current client-side approach.

Gemini #3 — timestamp inconsistency in _memory_to_entity (line 576-588, medium) — ✅ agree, trivial fix

Confirmed at lines 584-587:

"created_at": float(memory.created_at) if memory.created_at is not None else time.time(),
...
"created_at_iso": memory.created_at_iso or "",

Two clocks, one record. Compute now = time.time() once when either is missing, derive the ISO string from the same value. Same fix in _build_update_entity.

Gemini #4 — over-fetch 3× when tag filter active (line 878-883, medium) — ❌ I'd push back

Gemini says Milvus's filtered ANN guarantees limit is satisfied. That's only true for brute-force search; with AUTOINDEX/HNSW + a selective filter, the search can hit ef_search boundary before collecting enough matching candidates and return fewer than limit results. The 3× defensive over-fetch is a standard pattern (Pinecone, Weaviate, Qdrant clients all do similar) and is correct here.

Two improvements that would address Gemini's underlying concern without losing the safety margin:

  • Add a one-line comment at line 880 explaining why (filtered ANN under HNSW can under-return) — would have pre-empted this finding.
  • Consider 2× instead of 3× as a small win on bandwidth; 3× is conservative.

I'd close this thread as "won't fix" with the rationale.


Additional findings Gemini missed

_EMBEDDING_CACHE keyed by hash(text) (line 466-480, medium)

_EMBEDDING_CACHE: Dict[int, List[float]] = {}
...
cache_key = hash(text)

Two issues:

  • Hash collision → wrong embedding returned. Python hash() for strings is 64-bit; collisions are rare but real, and a wrong embedding silently corrupts retrieval. Key by text itself.
  • Unbounded growth. No eviction — long-lived MCP processes will grow this monotonically. Use functools.lru_cache or cachetools.LRUCache with an explicit cap.

This is a latent correctness + memory-leak bug that won't surface in the 32 unit tests.

Milvus Lite reconnect retry — single-attempt is correct, worth a test

The reconnect helper at _call_client (around line 540-573) only retries once per RPC, which is the right call (avoids retry storms). Worth adding a test that simulates two consecutive dead-channel errors and asserts the second one propagates — currently the 75 s idle probe only validates the happy path.

Hybrid backend integration

hybrid.py isn't extended to support Milvus as the cloud target. Not a blocker for this PR, but worth a follow-up issue so Milvus + local SQLite-Vec mirror is achievable later.


Bottom line

Strong PR overall — clean separation, three deployment modes from one code path, defensive concurrency, real reconnect handling for a known upstream issue. The two high-severity Gemini findings (get_by_exact_content, _query_memories) are real and worth fixing before merge. The medium timestamp fix is trivial. I'd push back on Gemini's over-fetch finding. The embedding-cache collision is the one Gemini missed and I'd add it to the must-fix list.

Happy to take any of these as follow-up commits if it helps move the PR forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants