Skip to content

Add memory hashing, versioning, and forget support#187

Open
kodahhhhh wants to merge 2 commits into
XortexAI:mainfrom
kodahhhhh:fix-memory-hashing-versioning-forget
Open

Add memory hashing, versioning, and forget support#187
kodahhhhh wants to merge 2 commits into
XortexAI:mainfrom
kodahhhhh:fix-memory-hashing-versioning-forget

Conversation

@kodahhhhh
Copy link
Copy Markdown

@kodahhhhh kodahhhhh commented May 20, 2026

Summary

  • add SHA-256 content hashing to vector-memory metadata and skip duplicate ADD/UPDATE writes before re-embedding
  • store memory lineage metadata (parent_memory_id, version, is_current) and preserve prior versions by adding new current vectors on UPDATE
  • forget full memory lineages on DELETE and filter superseded/forgotten memories out of judge/retrieval paths
  • expand focused Weaver tests for duplicate detection, versioned updates, and lineage deletion

Fixes #166

Tests

  • .venv/bin/python -m pytest tests/test_deterministic_memory_layer.py tests/integration/test_weaver_pipeline.py -q
  • git diff --check
  • .venv/bin/python -m compileall -q src/agents/judge.py src/pipelines/retrieval.py src/pipelines/weaver.py tests/test_deterministic_memory_layer.py tests/integration/test_weaver_pipeline.py

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces memory versioning and lineage tracking, including content hashing for duplicate detection and a new update flow that preserves history by marking old records as superseded. Review feedback highlights several critical areas: a hard limit on lineage retrieval that could impact data deletion, potential consistency issues allowing multiple 'current' versions, and opportunities to optimize network performance by reducing redundant data transmission. There is also a noted inconsistency between the hard deletion implementation and the soft deletion checks used during retrieval.

Comment thread src/pipelines/weaver.py Outdated
Comment on lines +586 to +590
for match in search_fn(
filters={"parent_memory_id": parent_memory_id},
top_k=100,
) or []:
ids.add(match.id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of top_k=100 here creates a hard limit on the number of versions that can be 'forgotten' in a single lineage. If a memory has been updated more than 100 times, older versions will remain in the vector store, potentially violating privacy requirements or causing data leaks. Consider using a much larger top_k or implementing a paginated search if the underlying store supports it.

Comment thread src/pipelines/weaver.py Outdated
)
return await self._vector_add(op, domain, user_id)

self._mark_memory_superseded(op.embedding_id, previous, new_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a potential consistency issue here. If op.embedding_id refers to a version that is already superseded (e.g., due to a race condition or stale retrieval), this logic will mark that old version as superseded by the new one, but it won't affect the actual current version of the lineage. This results in multiple versions having is_current: True simultaneously. You should ensure that only one version per lineage is marked as current, perhaps by searching for the existing current version of the lineage before performing the update.

Comment thread src/pipelines/weaver.py Outdated
Comment on lines +545 to +554
def _set_parent_memory_id(self, memory_id: Optional[str]) -> None:
if not memory_id:
return
try:
self.vector_store.update(
id=memory_id,
metadata={"parent_memory_id": memory_id},
)
except Exception as exc:
logger.warning("Could not set parent_memory_id for %s: %s", memory_id, exc)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This method introduces an unnecessary network round-trip for every new memory added. Since BaseVectorStore.add allows providing explicit IDs, you can generate a UUID in the Weaver, include it in the initial metadata as parent_memory_id, and pass it to the add call. This would eliminate the need for a follow-up update call.

Comment thread src/pipelines/weaver.py Outdated
Comment on lines +564 to +572
self.vector_store.update(
id=memory_id,
text=previous.get("content"),
embedding=previous.get("embedding"),
metadata={
"is_current": False,
"superseded_by": superseded_by,
},
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Passing text and embedding back to the vector store during a metadata-only update is redundant and increases payload size. Most vector store providers (like Pinecone) allow updating metadata without re-sending the vector data. You should check if your BaseVectorStore implementation supports metadata-only updates and omit these fields if possible.

References
  1. Avoid redundant data transmission to external services to improve performance and reduce costs.

Comment thread src/pipelines/weaver.py
async def _vector_delete(self, op: Operation) -> ExecutedOp:
success = self.vector_store.delete(ids=[op.embedding_id])
ids = self._lineage_ids_for_forget(op.embedding_id)
success = self.vector_store.delete(ids=ids)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The DELETE operation performs a hard delete from the vector store, yet the retrieval logic in retrieval.py and judge.py explicitly checks for a forgotten_at metadata field. If the intention of 'forget support' is to allow for soft deletes or auditing, _vector_delete should probably update the lineage with a forgotten_at timestamp instead of calling delete(). If hard deletion is intended, the forgotten_at checks in the retrieval paths are dead code.

@kodahhhhh
Copy link
Copy Markdown
Author

kodahhhhh commented May 20, 2026

Addressed the Gemini review feedback in 5bc7006:

  • New memory IDs are generated before add and passed through BaseVectorStore.add(..., ids=...), so parent_memory_id is set in the initial write without a follow-up update.
  • Superseding old versions now uses metadata-only vector store updates.
  • Updates first resolve the current active version for the lineage, so stale update operations do not leave multiple is_current versions behind.
  • Forget now collects lineage IDs with a much larger metadata lookup window (10_000) before hard deleting the lineage.
  • Removed the unused forgotten_at soft-delete checks because this implementation intentionally performs hard forget/deletion.
  • Added a regression test for stale updates superseding the active current version.

Tests: .venv/bin/python -m pytest tests/test_deterministic_memory_layer.py tests/integration/test_weaver_pipeline.py -q (11 passed), git diff --check, and compileall on touched files.

@ishaanxgupta
Copy link
Copy Markdown
Member

Hi @kodahhhhh can you please discuss in the issue thread about your approach

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 23, 2026

Greptile Summary

This PR adds content-addressed deduplication (SHA-256 hashes), immutable versioning (UPDATE creates a new vector and marks previous versions is_current=False), and full-lineage deletion (DELETE removes every version sharing the same parent_memory_id) to the vector memory layer. All retrieval and judge paths are updated to skip inactive memories by over-fetching and filtering.

  • weaver.py: New lineage helpers (_find_active_duplicate, _lineage_matches, _mark_lineage_superseded, _lineage_ids_for_forget) replace in-place UPDATEs with append-only versioning; ADD seeds a self-referential parent_memory_id as the lineage root key.
  • judge.py / retrieval.py: Both pipelines now over-fetch (top_k × 5 / × 2) and apply an is_current filter to exclude superseded memories from judge comparisons and retrieval results.
  • Tests: FakeVectorStore gains get and partial-metadata-merge support; three new focused tests cover duplicate skipping, stale-ID updates, and full-lineage deletion.

Confidence Score: 3/5

The lineage versioning and deduplication logic is well-designed overall, but the batch DELETE path dropped its exception handler when it was refactored from a single-bulk call to a per-op loop, leaving the batch uncleared on error.

The versioning and filtering changes to judge.py and retrieval.py look correct. The concern is in weaver.py: flush_delete_batch no longer wraps _vector_delete in try/except, so any exception from vector_store.get, vector_store.delete, or the lineage lookup escapes execute entirely and leaves delete_batch_ops uncleared. The rest of the implementation is sound.

src/pipelines/weaver.py (specifically flush_delete_batch at line 287) needs a try/except guard to match the error-handling pattern in flush_add_batch.

Important Files Changed

Filename Overview
src/pipelines/weaver.py Core change: adds SHA-256 content hashing, lineage versioning (ADD creates self-referential root; UPDATE writes a new vector and marks old versions superseded), and full-lineage forget on DELETE. Batch DELETE lost its try/except wrapper, which is a regression.
src/agents/judge.py Adds _active_memory_results helper and applies it at all search return sites; search_by_text and search_by_metadata now over-fetch (top_k * 5) so enough active results survive the inactive filter. Logic looks correct.
src/pipelines/retrieval.py Adds _is_active_memory filter to summary-domain and profile-catalog search loops; over-fetches (top_k * 2) then skips inactive entries. Clean and self-contained change.
tests/conftest.py One-line fix: update now merges metadata instead of replacing it, correctly reflecting the partial-update contract needed by the new lineage marking logic.
tests/test_deterministic_memory_layer.py Extends FakeVectorStore with get, partial-update in update, and optional ids in add; adds three focused tests for duplicate detection, versioned update, and full-lineage delete.
tests/integration/test_weaver_pipeline.py Updates integration assertions to match the new version-per-UPDATE semantics, but replaces the standard import with a fragile importlib.exec_module path-based loader that will break on layout changes.

Sequence Diagram

sequenceDiagram
    participant C as Caller
    participant W as Weaver.execute
    participant FA as flush_add_batch
    participant FD as flush_delete_batch
    participant VS as VectorStore

    C->>W: execute(JudgeResult)
    W->>FA: flush_add_batch()
    FA->>VS: search_by_metadata(content_hash)
    VS-->>FA: matches
    alt duplicate found
        FA-->>W: OpStatus.SKIPPED
    else new content
        FA->>VS: add(texts, embeddings, ids, metadata)
        VS-->>FA: stored_ids
        FA-->>W: OpStatus.SUCCESS (new_id)
    end
    W->>VS: get(embedding_id)
    VS-->>W: previous doc
    W->>VS: search_by_metadata(parent_memory_id)
    VS-->>W: lineage matches
    alt content_hash unchanged
        W-->>C: OpStatus.SKIPPED
    else new content
        W->>VS: add(new version vector)
        W->>VS: "update(old ids, is_current=False)"
        W-->>C: OpStatus.SUCCESS (new_id)
    end
    W->>FD: flush_delete_batch()
    FD->>VS: get(embedding_id)
    VS-->>FD: target doc
    FD->>VS: search_by_metadata(parent_memory_id)
    VS-->>FD: all versions
    FD->>VS: delete(all lineage ids)
    FD-->>W: OpStatus.SUCCESS / FAILED
    W-->>C: WeaverResult
Loading

Comments Outside Diff (2)

  1. src/pipelines/weaver.py, line 245-256 (link)

    P2 Store-returned IDs discarded silently in batch ADD

    ids is re-assigned with the return value from vector_store.add, but successful_ids (the pre-generated UUIDs) are used for ExecutedOp.new_id. If a vector store implementation ignores the supplied ids and generates its own, the new_id recorded in every ExecutedOp will be wrong and subsequent UPDATE/DELETE operations referencing those IDs will fail. Consider using the store-returned IDs instead, or assert that they match.

    Fix in Cursor Fix in Codex Fix in Claude Code

  2. tests/integration/test_weaver_pipeline.py, line 565-569 (link)

    P2 Fragile importlib.exec_module replaces a working import

    The original file used from src.pipelines.weaver import Weaver and worked fine. The new approach hardcodes an OS path and re-executes the module outside the normal package system. If the src/ layout changes, or if a CI runner runs tests from a different working directory, weaver_spec.loader.exec_module(weaver_module) will either fail to find the file or re-execute the module in an unexpected package context. There's no apparent benefit over the standard import.

    Fix in Cursor Fix in Codex Fix in Claude Code

Fix All in Cursor Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "Address memory lineage review feedback" | Re-trigger Greptile

Comment thread src/pipelines/weaver.py
Comment on lines 287 to 291
if valid_ops:
loop = asyncio.get_running_loop()
try:
success = await loop.run_in_executor(None, partial(self.vector_store.delete, ids=ids_to_delete))
status = OpStatus.SUCCESS if success else OpStatus.FAILED
for op in valid_ops:
executed_ops.append(ExecutedOp(
type=op.type, status=status,
embedding_id=op.embedding_id
))
except Exception as exc:
logger.error("Vector batch DELETE failed: %s", exc)
for op in valid_ops:
executed_ops.append(ExecutedOp(
type=op.type, status=OpStatus.FAILED,
embedding_id=op.embedding_id, error=str(exc)
))
for op in valid_ops:
executed_ops.append(await self._vector_delete(op))

delete_batch_ops.clear()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Batch DELETE drops exception handling

The old flush_delete_batch wrapped the delete call in try/except and appended OpStatus.FAILED entries on error. The new per-op loop has no exception guard: if _vector_delete raises (e.g., vector_store.get or vector_store.delete throws), the exception escapes flush_delete_batch, the delete_batch_ops.clear() on line 291 is never reached, and the same ops may be reprocessed on the next call. Compare with flush_add_batch, which still has the try/except pattern.

Fix in Cursor Fix in Codex Fix in Claude Code

Comment thread src/pipelines/weaver.py
Comment on lines +437 to +443
new_id = self._new_memory_id()
metadata = _memory_metadata(
op.content,
domain,
user_id,
parent_memory_id=new_id,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Version 1 memories are their own parent_memory_id

new_id is generated, placed in parent_memory_id, and then used as the record's own ID. This self-referential root is deliberate (it lets _lineage_matches find all versions of a memory), but adding a comment would prevent future maintainers from "fixing" it incorrectly.

Suggested change
new_id = self._new_memory_id()
metadata = _memory_metadata(
op.content,
domain,
user_id,
parent_memory_id=new_id,
)
# For a brand-new memory, use the record's own ID as its parent_memory_id.
# This self-referential root is the stable lineage key shared by all
# future versions of this memory (see _lineage_matches / _lineage_ids_for_forget).
new_id = self._new_memory_id()
metadata = _memory_metadata(
op.content,
domain,
user_id,
parent_memory_id=new_id,
)

Fix in Cursor Fix in Codex Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

content hashing, memory versioning, memory decay

2 participants