Skip to content

enrichment: Entity extraction produces misclassified types and duplicate entities #72

@jack-arturo

Description

@jack-arturo

Problem

The entity extraction pipeline produces misclassified entity types, duplicate/near-duplicate entities, and nonsensical entities that pollute the graph and degrade recall quality.

Examples from Real Data

Type misclassification:

  • entity:people:scalemath — ScaleMath is an organization, not a person
  • entity:people:completed — "completed" is not a person
  • entity:people:advocacy — "advocacy" is not a person
  • entity:people:involvement — "involvement" is not a person
  • entity:people:key-findings — "key-findings" is not a person
  • entity:people:deployed-automem — not a person
  • entity:people:config-file-approach — not a person
  • entity:people:recommended — not a person

Near-duplicate entities:

Nonsensical entities:

  • entity:people:word — what is this?
  • entity:people:ud83d-udc4d — this is a Unicode emoji escape, not a person
  • entity:people:falkor — partial extraction from "FalkorDB"

Impact

Each bad entity becomes a node in the graph that can be traversed during expansion, creating false connections between unrelated memories. The entity:people:alex node alone connects two completely different people (Alex Panagis and Alex Beck), causing major recall pollution.

Proposed Solutions

  1. Post-extraction validation: Run extracted entities through a validation step that checks for common patterns (single common words, Unicode escapes, possessive suffixes, etc.)

  2. Entity type verification: Cross-reference extracted entities against the content to verify type classification. "ScaleMath specializes in B2B SaaS" should not produce a people entity.

  3. Canonical entity merging: When a new entity is extracted that's a substring or variant of an existing entity (e.g. alex-panagis-founder when alex-panagis exists), merge into the canonical form.

  4. Minimum entity quality threshold: Reject single-word generic entities (entity:people:alex, entity:people:completed) when more specific alternatives exist.

  5. Periodic graph cleanup job: Admin endpoint that scans for and merges/removes low-quality entity nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions