Skip to content

Allow updating existing documents by removing duplicates before upload #112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented May 30, 2025

Currently, uploading the same document twice creates duplicated chunks in the vector database. This PR implements deduplication functionality that removes all existing chunks with the same filename before adding new ones, allowing for document updates without duplication.

Changes Made

Azure CosmosDB NoSQL Vector Store Path

  • Added deletion by SQL filter before adding new documents
  • Uses SQL query SELECT * FROM c WHERE c.metadata.source = "{filename}" to find existing documents
  • Includes error handling for cases where the container doesn't exist yet (first upload)
  • Properly escapes filenames containing quotes to prevent SQL injection

FAISS Vector Store Path (Local Development)

  • Added removeDuplicateDocuments() helper function to find and delete existing documents
  • Iterates through the document mapping to find documents with matching metadata.source
  • Deletes documents by their IDs using the FAISS delete() method
  • Only runs when an existing FAISS store is loaded

Implementation Details

Before this change:

// Azure CosmosDB path
await AzureCosmosDBNoSQLVectorStore.fromDocuments(documents, embeddings, { credentials });

// FAISS path  
const store = await FaissStore.load(faissStoreFolder, embeddings);
await store.addDocuments(documents);

After this change:

// Azure CosmosDB path
const store = await AzureCosmosDBNoSQLVectorStore.fromDocuments([], embeddings, { credentials });
await store.delete({ filter: `SELECT * FROM c WHERE c.metadata.source = "${filename}"` });
await store.addDocuments(documents);

// FAISS path
const store = await FaissStore.load(faissStoreFolder, embeddings);
await removeDuplicateDocuments(store, filename);
await store.addDocuments(documents);

The solution is minimal and preserves all existing functionality while adding the requested deduplication capability.

Fixes #36.

Warning

Firewall rules blocked me from connecting to one or more addresses

I tried to connect to the following addresses, but was blocked by firewall rules:

  • functionscdn.azureedge.net
    • Triggering command: node lib/install.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@Copilot Copilot AI changed the title [WIP] Allow updating existing documents Allow updating existing documents by removing duplicates before upload May 30, 2025
@Copilot Copilot AI requested a review from sinedied May 30, 2025 09:55
Copilot finished work on behalf of sinedied May 30, 2025 09:55
Copy link

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed.

@github-actions github-actions bot added the Stale label Jul 30, 2025
@sinedied sinedied removed the Stale label Jul 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow updating existing documents
2 participants