Product Catalogue Deduplication Pipeline

Overview

This project creates a 3-filter pipeline and ultimately creates a master catalogue by identifying and merging duplicate products across multiple datasets. The pipeline deduplicated 34.3% of products (37,103 out of 108,172), reducing the final catalogue to 71,069 unique products.

Methodology

Three-Stage Approach

The strategy used for deduplication has 3 stages that are presented below:

Exact Product Name Check
- Finds all products that share the exact name.
- Removed: 1832 products
- Example: "3veta" = "3veta"
Fuzzy Product Name Matching
- A bit more sophisticated than stage 1. It identifies spelling variations.
- Algorithm: It returns a similarity score and based on the threshold user puts it considers them as similar or not
- Threshold: Was set at 80% using fuzzywuzzy.token_sort_ratio
- Removed: 12,504 additional products, after the products removed in stage 1
- Examples: "101 Blockchains" ≈ "101-blockchains"
Semantic Similarity Matching
- The most sophisticated out of 3 stages
- Identifies similar products using AI embeddings.
- Model used for embeddings: sentence-transformers (all-MiniLM-L6-v2)
- Columns used : name + description + category
- Threshold: 70% cosine similarity
- Optimization: FAISS for similarity search
- Removed: 23,623 additional products, after the products removed in stage 2
- Examples: "adobe-photoshop" ≈ "photoroom" ≈ "adobe-photoshop-express" ≈ "photoshop-elements"

Key Technical Decisions

Used jupyter notebook format for better understanding of the code and to make notes for better explanation

Why 3-stage approach?

Each method is efficient for different levels of data deduplication
Sequentially approach is computationally efficient (exact → fuzzy → semantic complexity)

Why These Thresholds?

Fuzzy (80%): Balances precision vs recall for name variations. I checked by hand for different threshold values.
Semantic (70%): Captures meaningful conceptual similarity without strictness.

Why FAISS for Semantic Matching?

Way faster than creating a semantic mathing from scratch.
It uses hardware efficiency and it is written in c++, therefore is way faster.

Results

Pipeline Performance

Stage 1: 108,172 → 107,196 products (exact matching)
Stage 2: 107,196 → 94,692 products (fuzzy matching)  
Stage 3: 94,692 → 71,069 products (semantic matching)
Total Reduction: 37,103 products (34.3%)

Catalogue Statistics

Final unique products: 71,069
Products with duplicates found: 11,722 (16% of catalogue)
Average duplicates per product: 1.33

Implementation Details

Dependencies

pandas
numpy
sentence-transformers
fuzzywuzzy
faiss-cpu  # or faiss-gpu
tqdm

Runtime Performance

Total pipeline time: ~6 minutes (approximately)
Fuzzy Matching ~20 seconds
Embedding creation: ~1:43 minutes (94K products)
FAISS similarity search: ~2:45 minutes

Limitations & Future Improvements

Current Limitations

Threshold Sensitivity: Performance depends on manual threshold tuning
Computational Cost: Semantic matching requires significant resources. However, worth it considering the possibilities of the performance.

Potential Improvements

Adaptive Thresholds: Machine learning-based threshold optimization
Products Updates: Pipeline for adding new products without full reprocessing

Usage

Running the Pipeline

# Run the Jupyter notebook
jupyter notebook main.ipynb

Input Format

CSV files with columns: product_name, description, category
Missing values in descriptions are handled automatically

Output

final_master_catalogue.csv: Final Deduplicated product catalogue
Includes metadata: duplicate_count, deduplication_method

Evaluation Metrics

Precision: High-quality matches with minimal false positives
Recall: Comprehensive coverage across different duplicate types
Efficiency and Scalability: Used algorithms that main aim was to be fast, scalable and efficient.

Author: Chortis Alexandros Framework: Python 3.10, Jupyter Notebook

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
final_master_catalogue.csv		final_master_catalogue.csv
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Product Catalogue Deduplication Pipeline

Overview

Methodology

Three-Stage Approach

Key Technical Decisions

Results

Pipeline Performance

Catalogue Statistics

Implementation Details

Dependencies

Runtime Performance

Limitations & Future Improvements

Current Limitations

Potential Improvements

Usage

Running the Pipeline

Input Format

Output

Evaluation Metrics

About

Uh oh!

Releases

Packages

Languages

designed7000/deduplication_project

Folders and files

Latest commit

History

Repository files navigation

Product Catalogue Deduplication Pipeline

Overview

Methodology

Three-Stage Approach

Key Technical Decisions

Results

Pipeline Performance

Catalogue Statistics

Implementation Details

Dependencies

Runtime Performance

Limitations & Future Improvements

Current Limitations

Potential Improvements

Usage

Running the Pipeline

Input Format

Output

Evaluation Metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages