This project creates a 3-filter pipeline and ultimately creates a master catalogue by identifying and merging duplicate products across multiple datasets. The pipeline deduplicated 34.3% of products (37,103 out of 108,172), reducing the final catalogue to 71,069 unique products.
The strategy used for deduplication has 3 stages that are presented below:
-
Exact Product Name Check
- Finds all products that share the exact name.
- Removed: 1832 products
- Example: "3veta" = "3veta"
-
Fuzzy Product Name Matching
- A bit more sophisticated than stage 1. It identifies spelling variations.
- Algorithm: It returns a similarity score and based on the threshold user puts it considers them as similar or not
- Threshold: Was set at 80% using
fuzzywuzzy.token_sort_ratio - Removed: 12,504 additional products, after the products removed in stage 1
- Examples: "101 Blockchains" ≈ "101-blockchains"
-
Semantic Similarity Matching
- The most sophisticated out of 3 stages
- Identifies similar products using AI embeddings.
- Model used for embeddings:
sentence-transformers(all-MiniLM-L6-v2) - Columns used : name + description + category
- Threshold: 70% cosine similarity
- Optimization: FAISS for similarity search
- Removed: 23,623 additional products, after the products removed in stage 2
- Examples: "adobe-photoshop" ≈ "photoroom" ≈ "adobe-photoshop-express" ≈ "photoshop-elements"
Used jupyter notebook format for better understanding of the code and to make notes for better explanation
Why 3-stage approach?
- Each method is efficient for different levels of data deduplication
- Sequentially approach is computationally efficient (exact → fuzzy → semantic complexity)
Why These Thresholds?
- Fuzzy (80%): Balances precision vs recall for name variations. I checked by hand for different threshold values.
- Semantic (70%): Captures meaningful conceptual similarity without strictness.
Why FAISS for Semantic Matching?
- Way faster than creating a semantic mathing from scratch.
- It uses hardware efficiency and it is written in c++, therefore is way faster.
Stage 1: 108,172 → 107,196 products (exact matching)
Stage 2: 107,196 → 94,692 products (fuzzy matching)
Stage 3: 94,692 → 71,069 products (semantic matching)
Total Reduction: 37,103 products (34.3%)
- Final unique products: 71,069
- Products with duplicates found: 11,722 (16% of catalogue)
- Average duplicates per product: 1.33
pandas
numpy
sentence-transformers
fuzzywuzzy
faiss-cpu # or faiss-gpu
tqdm- Total pipeline time: ~6 minutes (approximately)
- Fuzzy Matching ~20 seconds
- Embedding creation: ~1:43 minutes (94K products)
- FAISS similarity search: ~2:45 minutes
- Threshold Sensitivity: Performance depends on manual threshold tuning
- Computational Cost: Semantic matching requires significant resources. However, worth it considering the possibilities of the performance.
- Adaptive Thresholds: Machine learning-based threshold optimization
- Products Updates: Pipeline for adding new products without full reprocessing
# Run the Jupyter notebook
jupyter notebook main.ipynb- CSV files with columns:
product_name,description,category - Missing values in descriptions are handled automatically
final_master_catalogue.csv: Final Deduplicated product catalogue- Includes metadata:
duplicate_count,deduplication_method
- Precision: High-quality matches with minimal false positives
- Recall: Comprehensive coverage across different duplicate types
- Efficiency and Scalability: Used algorithms that main aim was to be fast, scalable and efficient.
Author: Chortis Alexandros Framework: Python 3.10, Jupyter Notebook