Problem
The entity extraction pipeline produces misclassified entity types, duplicate/near-duplicate entities, and nonsensical entities that pollute the graph and degrade recall quality.
Examples from Real Data
Type misclassification:
entity:people:scalemath — ScaleMath is an organization, not a person
entity:people:completed — "completed" is not a person
entity:people:advocacy — "advocacy" is not a person
entity:people:involvement — "involvement" is not a person
entity:people:key-findings — "key-findings" is not a person
entity:people:deployed-automem — not a person
entity:people:config-file-approach — not a person
entity:people:recommended — not a person
Near-duplicate entities:
Nonsensical entities:
entity:people:word — what is this?
entity:people:ud83d-udc4d — this is a Unicode emoji escape, not a person
entity:people:falkor — partial extraction from "FalkorDB"
Impact
Each bad entity becomes a node in the graph that can be traversed during expansion, creating false connections between unrelated memories. The entity:people:alex node alone connects two completely different people (Alex Panagis and Alex Beck), causing major recall pollution.
Proposed Solutions
-
Post-extraction validation: Run extracted entities through a validation step that checks for common patterns (single common words, Unicode escapes, possessive suffixes, etc.)
-
Entity type verification: Cross-reference extracted entities against the content to verify type classification. "ScaleMath specializes in B2B SaaS" should not produce a people entity.
-
Canonical entity merging: When a new entity is extracted that's a substring or variant of an existing entity (e.g. alex-panagis-founder when alex-panagis exists), merge into the canonical form.
-
Minimum entity quality threshold: Reject single-word generic entities (entity:people:alex, entity:people:completed) when more specific alternatives exist.
-
Periodic graph cleanup job: Admin endpoint that scans for and merges/removes low-quality entity nodes.
Problem
The entity extraction pipeline produces misclassified entity types, duplicate/near-duplicate entities, and nonsensical entities that pollute the graph and degrade recall quality.
Examples from Real Data
Type misclassification:
entity:people:scalemath— ScaleMath is an organization, not a personentity:people:completed— "completed" is not a personentity:people:advocacy— "advocacy" is not a personentity:people:involvement— "involvement" is not a personentity:people:key-findings— "key-findings" is not a personentity:people:deployed-automem— not a personentity:people:config-file-approach— not a personentity:people:recommended— not a personNear-duplicate entities:
entity:people:alex-panagisvsentity:people:alex-panagis-founder— should be one canonical entityentity:people:alex-beckvsentity:people:alex-beck-svsentity:people:alex-beck-a— possessive/suffix variants of the same personentity:people:alex— overly generic, bridges unrelated people (see recall: Full-name entity disambiguation (Alex Panagis vs Alex Beck problem) #71)Nonsensical entities:
entity:people:word— what is this?entity:people:ud83d-udc4d— this is a Unicode emoji escape, not a personentity:people:falkor— partial extraction from "FalkorDB"Impact
Each bad entity becomes a node in the graph that can be traversed during expansion, creating false connections between unrelated memories. The
entity:people:alexnode alone connects two completely different people (Alex Panagis and Alex Beck), causing major recall pollution.Proposed Solutions
Post-extraction validation: Run extracted entities through a validation step that checks for common patterns (single common words, Unicode escapes, possessive suffixes, etc.)
Entity type verification: Cross-reference extracted entities against the content to verify type classification. "ScaleMath specializes in B2B SaaS" should not produce a
peopleentity.Canonical entity merging: When a new entity is extracted that's a substring or variant of an existing entity (e.g.
alex-panagis-founderwhenalex-panagisexists), merge into the canonical form.Minimum entity quality threshold: Reject single-word generic entities (
entity:people:alex,entity:people:completed) when more specific alternatives exist.Periodic graph cleanup job: Admin endpoint that scans for and merges/removes low-quality entity nodes.