Skip to content

designed7000/deduplication_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Product Catalogue Deduplication Pipeline

Overview

This project creates a 3-filter pipeline and ultimately creates a master catalogue by identifying and merging duplicate products across multiple datasets. The pipeline deduplicated 34.3% of products (37,103 out of 108,172), reducing the final catalogue to 71,069 unique products.

Methodology

Three-Stage Approach

The strategy used for deduplication has 3 stages that are presented below:

  1. Exact Product Name Check

    • Finds all products that share the exact name.
    • Removed: 1832 products
    • Example: "3veta" = "3veta"
  2. Fuzzy Product Name Matching

    • A bit more sophisticated than stage 1. It identifies spelling variations.
    • Algorithm: It returns a similarity score and based on the threshold user puts it considers them as similar or not
    • Threshold: Was set at 80% using fuzzywuzzy.token_sort_ratio
    • Removed: 12,504 additional products, after the products removed in stage 1
    • Examples: "101 Blockchains" ≈ "101-blockchains"
  3. Semantic Similarity Matching

    • The most sophisticated out of 3 stages
    • Identifies similar products using AI embeddings.
    • Model used for embeddings: sentence-transformers (all-MiniLM-L6-v2)
    • Columns used : name + description + category
    • Threshold: 70% cosine similarity
    • Optimization: FAISS for similarity search
    • Removed: 23,623 additional products, after the products removed in stage 2
    • Examples: "adobe-photoshop" ≈ "photoroom" ≈ "adobe-photoshop-express" ≈ "photoshop-elements"

Key Technical Decisions

Used jupyter notebook format for better understanding of the code and to make notes for better explanation

Why 3-stage approach?

  • Each method is efficient for different levels of data deduplication
  • Sequentially approach is computationally efficient (exact → fuzzy → semantic complexity)

Why These Thresholds?

  • Fuzzy (80%): Balances precision vs recall for name variations. I checked by hand for different threshold values.
  • Semantic (70%): Captures meaningful conceptual similarity without strictness.

Why FAISS for Semantic Matching?

  • Way faster than creating a semantic mathing from scratch.
  • It uses hardware efficiency and it is written in c++, therefore is way faster.

Results

Pipeline Performance

Stage 1: 108,172 → 107,196 products (exact matching)
Stage 2: 107,196 → 94,692 products (fuzzy matching)  
Stage 3: 94,692 → 71,069 products (semantic matching)
Total Reduction: 37,103 products (34.3%)

Catalogue Statistics

  • Final unique products: 71,069
  • Products with duplicates found: 11,722 (16% of catalogue)
  • Average duplicates per product: 1.33

Implementation Details

Dependencies

pandas
numpy
sentence-transformers
fuzzywuzzy
faiss-cpu  # or faiss-gpu
tqdm

Runtime Performance

  • Total pipeline time: ~6 minutes (approximately)
  • Fuzzy Matching ~20 seconds
  • Embedding creation: ~1:43 minutes (94K products)
  • FAISS similarity search: ~2:45 minutes

Limitations & Future Improvements

Current Limitations

  1. Threshold Sensitivity: Performance depends on manual threshold tuning
  2. Computational Cost: Semantic matching requires significant resources. However, worth it considering the possibilities of the performance.

Potential Improvements

  1. Adaptive Thresholds: Machine learning-based threshold optimization
  2. Products Updates: Pipeline for adding new products without full reprocessing

Usage

Running the Pipeline

# Run the Jupyter notebook
jupyter notebook main.ipynb

Input Format

  • CSV files with columns: product_name, description, category
  • Missing values in descriptions are handled automatically

Output

  • final_master_catalogue.csv: Final Deduplicated product catalogue
  • Includes metadata: duplicate_count, deduplication_method

Evaluation Metrics

  • Precision: High-quality matches with minimal false positives
  • Recall: Comprehensive coverage across different duplicate types
  • Efficiency and Scalability: Used algorithms that main aim was to be fast, scalable and efficient.

Author: Chortis Alexandros Framework: Python 3.10, Jupyter Notebook

About

Product Catalogue Deduplication Pipeline using 3-stages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published