Add memory hashing, versioning, and forget support by kodahhhhh · Pull Request #187 · XortexAI/XMem

kodahhhhh · 2026-05-20T00:14:04Z

Summary

add SHA-256 content hashing to vector-memory metadata and skip duplicate ADD/UPDATE writes before re-embedding
store memory lineage metadata (parent_memory_id, version, is_current) and preserve prior versions by adding new current vectors on UPDATE
forget full memory lineages on DELETE and filter superseded/forgotten memories out of judge/retrieval paths
expand focused Weaver tests for duplicate detection, versioned updates, and lineage deletion

Fixes #166

Tests

.venv/bin/python -m pytest tests/test_deterministic_memory_layer.py tests/integration/test_weaver_pipeline.py -q
git diff --check
.venv/bin/python -m compileall -q src/agents/judge.py src/pipelines/retrieval.py src/pipelines/weaver.py tests/test_deterministic_memory_layer.py tests/integration/test_weaver_pipeline.py

gemini-code-assist

Code Review

This pull request introduces memory versioning and lineage tracking, including content hashing for duplicate detection and a new update flow that preserves history by marking old records as superseded. Review feedback highlights several critical areas: a hard limit on lineage retrieval that could impact data deletion, potential consistency issues allowing multiple 'current' versions, and opportunities to optimize network performance by reducing redundant data transmission. There is also a noted inconsistency between the hard deletion implementation and the soft deletion checks used during retrieval.

gemini-code-assist · 2026-05-20T00:16:56Z

+            for match in search_fn(
+                filters={"parent_memory_id": parent_memory_id},
+                top_k=100,
+            ) or []:
+                ids.add(match.id)


The use of top_k=100 here creates a hard limit on the number of versions that can be 'forgotten' in a single lineage. If a memory has been updated more than 100 times, older versions will remain in the vector store, potentially violating privacy requirements or causing data leaks. Consider using a much larger top_k or implementing a paginated search if the underlying store supports it.

gemini-code-assist · 2026-05-20T00:16:56Z

            )
-            return await self._vector_add(op, domain, user_id)
+
+        self._mark_memory_superseded(op.embedding_id, previous, new_id)


There is a potential consistency issue here. If op.embedding_id refers to a version that is already superseded (e.g., due to a race condition or stale retrieval), this logic will mark that old version as superseded by the new one, but it won't affect the actual current version of the lineage. This results in multiple versions having is_current: True simultaneously. You should ensure that only one version per lineage is marked as current, perhaps by searching for the existing current version of the lineage before performing the update.

gemini-code-assist · 2026-05-20T00:16:56Z

+    def _set_parent_memory_id(self, memory_id: Optional[str]) -> None:
+        if not memory_id:
+            return
+        try:
+            self.vector_store.update(
+                id=memory_id,
+                metadata={"parent_memory_id": memory_id},
+            )
+        except Exception as exc:
+            logger.warning("Could not set parent_memory_id for %s: %s", memory_id, exc)


This method introduces an unnecessary network round-trip for every new memory added. Since BaseVectorStore.add allows providing explicit IDs, you can generate a UUID in the Weaver, include it in the initial metadata as parent_memory_id, and pass it to the add call. This would eliminate the need for a follow-up update call.

gemini-code-assist · 2026-05-20T00:16:56Z

+        self.vector_store.update(
+            id=memory_id,
+            text=previous.get("content"),
+            embedding=previous.get("embedding"),
+            metadata={
+                "is_current": False,
+                "superseded_by": superseded_by,
+            },
+        )


Passing text and embedding back to the vector store during a metadata-only update is redundant and increases payload size. Most vector store providers (like Pinecone) allow updating metadata without re-sending the vector data. You should check if your BaseVectorStore implementation supports metadata-only updates and omit these fields if possible.

References

Avoid redundant data transmission to external services to improve performance and reduce costs.

gemini-code-assist · 2026-05-20T00:16:56Z

    async def _vector_delete(self, op: Operation) -> ExecutedOp:
-        success = self.vector_store.delete(ids=[op.embedding_id])
+        ids = self._lineage_ids_for_forget(op.embedding_id)
+        success = self.vector_store.delete(ids=ids)


The DELETE operation performs a hard delete from the vector store, yet the retrieval logic in retrieval.py and judge.py explicitly checks for a forgotten_at metadata field. If the intention of 'forget support' is to allow for soft deletes or auditing, _vector_delete should probably update the lineage with a forgotten_at timestamp instead of calling delete(). If hard deletion is intended, the forgotten_at checks in the retrieval paths are dead code.

kodahhhhh · 2026-05-20T00:24:44Z

Addressed the Gemini review feedback in 5bc7006:

New memory IDs are generated before add and passed through BaseVectorStore.add(..., ids=...), so parent_memory_id is set in the initial write without a follow-up update.
Superseding old versions now uses metadata-only vector store updates.
Updates first resolve the current active version for the lineage, so stale update operations do not leave multiple is_current versions behind.
Forget now collects lineage IDs with a much larger metadata lookup window (10_000) before hard deleting the lineage.
Removed the unused forgotten_at soft-delete checks because this implementation intentionally performs hard forget/deletion.
Added a regression test for stale updates superseding the active current version.

Tests: .venv/bin/python -m pytest tests/test_deterministic_memory_layer.py tests/integration/test_weaver_pipeline.py -q (11 passed), git diff --check, and compileall on touched files.

ishaanxgupta · 2026-05-20T09:19:55Z

Hi @kodahhhhh can you please discuss in the issue thread about your approach

greptile-apps · 2026-05-23T09:23:00Z

Greptile Summary

This PR adds content-addressed deduplication (SHA-256 hashes), immutable versioning (UPDATE creates a new vector and marks previous versions is_current=False), and full-lineage deletion (DELETE removes every version sharing the same parent_memory_id) to the vector memory layer. All retrieval and judge paths are updated to skip inactive memories by over-fetching and filtering.

weaver.py: New lineage helpers (_find_active_duplicate, _lineage_matches, _mark_lineage_superseded, _lineage_ids_for_forget) replace in-place UPDATEs with append-only versioning; ADD seeds a self-referential parent_memory_id as the lineage root key.
judge.py / retrieval.py: Both pipelines now over-fetch (top_k × 5 / × 2) and apply an is_current filter to exclude superseded memories from judge comparisons and retrieval results.
Tests: FakeVectorStore gains get and partial-metadata-merge support; three new focused tests cover duplicate skipping, stale-ID updates, and full-lineage deletion.

Confidence Score: 3/5

The lineage versioning and deduplication logic is well-designed overall, but the batch DELETE path dropped its exception handler when it was refactored from a single-bulk call to a per-op loop, leaving the batch uncleared on error.

The versioning and filtering changes to judge.py and retrieval.py look correct. The concern is in weaver.py: flush_delete_batch no longer wraps _vector_delete in try/except, so any exception from vector_store.get, vector_store.delete, or the lineage lookup escapes execute entirely and leaves delete_batch_ops uncleared. The rest of the implementation is sound.

src/pipelines/weaver.py (specifically flush_delete_batch at line 287) needs a try/except guard to match the error-handling pattern in flush_add_batch.

Important Files Changed

Filename	Overview
src/pipelines/weaver.py	Core change: adds SHA-256 content hashing, lineage versioning (ADD creates self-referential root; UPDATE writes a new vector and marks old versions superseded), and full-lineage forget on DELETE. Batch DELETE lost its try/except wrapper, which is a regression.
src/agents/judge.py	Adds _active_memory_results helper and applies it at all search return sites; search_by_text and search_by_metadata now over-fetch (top_k * 5) so enough active results survive the inactive filter. Logic looks correct.
src/pipelines/retrieval.py	Adds _is_active_memory filter to summary-domain and profile-catalog search loops; over-fetches (top_k * 2) then skips inactive entries. Clean and self-contained change.
tests/conftest.py	One-line fix: update now merges metadata instead of replacing it, correctly reflecting the partial-update contract needed by the new lineage marking logic.
tests/test_deterministic_memory_layer.py	Extends FakeVectorStore with get, partial-update in update, and optional ids in add; adds three focused tests for duplicate detection, versioned update, and full-lineage delete.
tests/integration/test_weaver_pipeline.py	Updates integration assertions to match the new version-per-UPDATE semantics, but replaces the standard import with a fragile importlib.exec_module path-based loader that will break on layout changes.

Sequence Diagram

sequenceDiagram
    participant C as Caller
    participant W as Weaver.execute
    participant FA as flush_add_batch
    participant FD as flush_delete_batch
    participant VS as VectorStore

    C->>W: execute(JudgeResult)
    W->>FA: flush_add_batch()
    FA->>VS: search_by_metadata(content_hash)
    VS-->>FA: matches
    alt duplicate found
        FA-->>W: OpStatus.SKIPPED
    else new content
        FA->>VS: add(texts, embeddings, ids, metadata)
        VS-->>FA: stored_ids
        FA-->>W: OpStatus.SUCCESS (new_id)
    end
    W->>VS: get(embedding_id)
    VS-->>W: previous doc
    W->>VS: search_by_metadata(parent_memory_id)
    VS-->>W: lineage matches
    alt content_hash unchanged
        W-->>C: OpStatus.SKIPPED
    else new content
        W->>VS: add(new version vector)
        W->>VS: "update(old ids, is_current=False)"
        W-->>C: OpStatus.SUCCESS (new_id)
    end
    W->>FD: flush_delete_batch()
    FD->>VS: get(embedding_id)
    VS-->>FD: target doc
    FD->>VS: search_by_metadata(parent_memory_id)
    VS-->>FD: all versions
    FD->>VS: delete(all lineage ids)
    FD-->>W: OpStatus.SUCCESS / FAILED
    W-->>C: WeaverResult

Comments Outside Diff (2)

src/pipelines/weaver.py, line 245-256 (link)

Store-returned IDs discarded silently in batch ADD

ids is re-assigned with the return value from vector_store.add, but successful_ids (the pre-generated UUIDs) are used for ExecutedOp.new_id. If a vector store implementation ignores the supplied ids and generates its own, the new_id recorded in every ExecutedOp will be wrong and subsequent UPDATE/DELETE operations referencing those IDs will fail. Consider using the store-returned IDs instead, or assert that they match.
tests/integration/test_weaver_pipeline.py, line 565-569 (link)

Fragile importlib.exec_module replaces a working import

The original file used from src.pipelines.weaver import Weaver and worked fine. The new approach hardcodes an OS path and re-executes the module outside the normal package system. If the src/ layout changes, or if a CI runner runs tests from a different working directory, weaver_spec.loader.exec_module(weaver_module) will either fail to find the file or re-execute the module in an unexpected package context. There's no apparent benefit over the standard import.

_{Reviews (1): Last reviewed commit: "Address memory lineage review feedback" | Re-trigger Greptile}

greptile-apps · 2026-05-23T09:23:04Z

            if valid_ops:
-                loop = asyncio.get_running_loop()
-                try:
-                    success = await loop.run_in_executor(None, partial(self.vector_store.delete, ids=ids_to_delete))
-                    status = OpStatus.SUCCESS if success else OpStatus.FAILED
-                    for op in valid_ops:
-                        executed_ops.append(ExecutedOp(
-                            type=op.type, status=status,
-                            embedding_id=op.embedding_id
-                        ))
-                except Exception as exc:
-                    logger.error("Vector batch DELETE failed: %s", exc)
-                    for op in valid_ops:
-                        executed_ops.append(ExecutedOp(
-                            type=op.type, status=OpStatus.FAILED,
-                            embedding_id=op.embedding_id, error=str(exc)
-                        ))
+                for op in valid_ops:
+                    executed_ops.append(await self._vector_delete(op))

            delete_batch_ops.clear()


Batch DELETE drops exception handling

The old flush_delete_batch wrapped the delete call in try/except and appended OpStatus.FAILED entries on error. The new per-op loop has no exception guard: if _vector_delete raises (e.g., vector_store.get or vector_store.delete throws), the exception escapes flush_delete_batch, the delete_batch_ops.clear() on line 291 is never reached, and the same ops may be reprocessed on the next call. Compare with flush_add_batch, which still has the try/except pattern.

greptile-apps · 2026-05-23T09:23:08Z

+        new_id = self._new_memory_id()
+        metadata = _memory_metadata(
+            op.content,
+            domain,
+            user_id,
+            parent_memory_id=new_id,
+        )


Version 1 memories are their own parent_memory_id

new_id is generated, placed in parent_memory_id, and then used as the record's own ID. This self-referential root is deliberate (it lets _lineage_matches find all versions of a memory), but adding a comment would prevent future maintainers from "fixing" it incorrectly.

Suggested change

new_id = self._new_memory_id()

metadata = _memory_metadata(

op.content,

domain,

user_id,

parent_memory_id=new_id,

)

# For a brand-new memory, use the record's own ID as its parent_memory_id.

# This self-referential root is the stable lineage key shared by all

# future versions of this memory (see _lineage_matches / _lineage_ids_for_forget).

new_id = self._new_memory_id()

metadata = _memory_metadata(

op.content,

domain,

user_id,

parent_memory_id=new_id,

)

Add memory hashing versioning and forget support

f7a49a1

kodahhhhh requested review from ishaanxgupta and ved015 as code owners May 20, 2026 00:14

github-actions Bot added tests pipelines agents labels May 20, 2026

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

Address memory lineage review feedback

5bc7006

greptile-apps Bot reviewed May 23, 2026

View reviewed changes

Conversation

kodahhhhh commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

kodahhhhh commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ishaanxgupta commented May 20, 2026

Uh oh!

greptile-apps Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kodahhhhh commented May 20, 2026 •

edited

Loading

kodahhhhh commented May 20, 2026 •

edited

Loading

greptile-apps Bot commented May 23, 2026 •

edited

Loading