Skip to content

Ucazin/olist-ecommerce-analytics

Repository files navigation

Olist E-commerce SQL Analytics

Architecture

flowchart LR
    GEN["src/generate_synthetic_olist.py<br/>(deterministic, seed 42)"] --> RAW["9 CSVs<br/>(orders, items, products,<br/>customers, sellers, reviews,<br/>payments, geolocation,<br/>category translations)"]
    RAW --> LOAD["sql/01_load_raw.sql<br/>→ DuckDB raw.*"]
    LOAD --> MODEL["sql/02_dim_model.sql<br/>→ fact_orders + 5 dims"]
    MODEL --> QS["sql/03_business_questions.sql<br/>10 analytical queries"]
    MODEL --> WIN["sql/04_window_functions.sql<br/>5 window-function showcases"]
    QS --> ANS["outputs/business_answers.md"]
    QS --> CHARTS["notebooks/eda.py<br/>→ 10 PNG charts"]
    MODEL --> STATS["src/validate_late_delivery_significance.py<br/>Welch t-test + Cohen's d + bootstrap CI"]
    STATS --> SVAL["outputs/statistical_validation.md"]
Loading

🌐 Live walkthrough: https://ucazin.github.io/olist-ecommerce-analytics/

End-to-end SQL analytics on an Olist-shaped Brazilian e-commerce dataset — 96,497 delivered orders, 92,315 unique customers, R$ 14.7M in revenue, joined across 9 raw tables (orders, items, products, customers, sellers, reviews, payments, geolocation, category translations).

This is the SQL anchor of the portfolio: a Kimball-style dimensional model, ten business-question queries, and a chart deck built on top.

Data note. The CSVs are procedurally generated by src/generate_synthetic_olist.py with a fixed seed (no Kaggle download needed). Shape, cardinality, and category distribution mirror the real Olist public dataset; absolute revenue/CLV numbers are not directly comparable. Every SQL query and dashboard chart runs unchanged if you swap in the original Kaggle CSVs.

Headline findings

Insight Number
Late deliveries (8+ days past estimate) collapse review score 1.64 / 5 vs 3.98 on-time
Share of 8+d-late orders rated 1–2 stars 86.8%
The bottleneck is carrier transit, not seller dispatch 25.6 d vs 13.1 d on late orders
Revenue concentration — top 5 categories 53% of total revenue
bed_bath_table alone 24.3% of revenue
Revenue-at-risk from late + low-review orders R$ 560k
Repeat-purchase rate (90 d) 1.0–1.2% — flat across lateness buckets

Full numerical answers for the 10 business questions: outputs/business_answers.md.

Business Questions

  1. Revenue concentration — which product categories drive revenue, and what share belongs to the top 10%?
  2. State-level customer value — which Brazilian states deliver the highest customer lifetime value (CLV), and where should the marketing budget concentrate?
  3. Delivery → Review impact — by how many points does a late delivery (vs estimate) drop the review score, and how does that change repeat-purchase probability?
  4. Operational bottleneck decomposition — for orders that arrived late, where was the delay introduced: seller dispatch, carrier transit, or last-mile?
  5. Cohort retention — what does the monthly cohort retention curve look like for first-time buyers, and which acquisition month retained best?
  6. Payment behavior — what's the average installment count by category, and which categories see the most boleto vs credit-card mix?
  7. Seller performance scorecard — which sellers have the worst review-to-volume ratio, and what's their estimated revenue at risk?
  8. Freight as a margin drag — what percentage of order value is consumed by freight, by region and category?
  9. Seasonality — what's the day-of-week and month-of-year demand pattern, and how should staffing react?
  10. Geographic spread — which seller-to-customer state pairs concentrate the most volume, and where would a regional hub reduce shipping time?

Charts

All 10 charts are committed in outputs/:

# Question Chart
01 Revenue concentration 01_revenue_by_category.png
02 State CLV 02_state_clv.png
03 Review × delivery 03_review_by_delivery.png
04 Bottleneck decomposition 04_bottleneck_decomposition.png
05 Cohort retention 05_cohort_retention.png
06 Payment mix 06_payment_mix.png
07 Revenue at risk by seller state 07_worst_sellers.png
08 Freight share by state 08_freight_share.png
09 Seasonality heatmap 09_seasonality_heatmap.png
10 Monthly revenue trend 10_monthly_revenue.png

How to run

# 1. Create venv and install deps
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

# 2. Generate the synthetic dataset (~12s, ~58 MB across 9 CSVs)
python src/generate_synthetic_olist.py

# 3. Build the warehouse + answer the 10 business questions
python src/run_pipeline.py
#    creates olist.duckdb and outputs/business_answers.md

# 4. Render the 10 charts
python notebooks/eda.py

Each script is idempotent — re-running rebuilds outputs in place.

Tech Stack

  • DuckDB — embedded analytical DB, runs the entire warehouse from a single file, no setup
  • SQL — PostgreSQL-flavored, portable to Snowflake / BigQuery / Redshift with minor changes
  • Python (pandas, matplotlib, seaborn) — synthetic data generator + EDA + chart export
  • Power BI / Tableau — full wireframe + DAX measures in dashboard/DASHBOARD_SPEC.md, implemented in Python charts here

Project Structure

01-olist-ecommerce-sql/
├── data/
│   ├── DOWNLOAD.md             # Original Kaggle dump (optional — synthetic gen is default)
│   └── raw/                    # 9 generated CSVs (gitignored, regenerable)
├── src/
│   ├── generate_synthetic_olist.py   # Deterministic Olist-shaped data generator
│   ├── run_pipeline.py               # Load + model + answer 10 Qs end-to-end
│   └── extract_key_numbers.py        # Pull headline numbers for README/MEMO
├── sql/
│   ├── 01_load_raw.sql         # Raw CSV → DuckDB tables
│   ├── 02_dim_model.sql        # Kimball star schema: fact_orders, dim_*
│   └── 03_business_questions.sql # 10 analytical queries
├── notebooks/
│   └── eda.py                  # Renders the 10 charts to outputs/
├── dashboard/
│   └── DASHBOARD_SPEC.md       # Power BI / Tableau wireframe + DAX measures
├── outputs/                    # Generated charts + business_answers.md (committed)
├── MEMO.md                     # One-page business memo with recommendations
├── README.md
├── LICENSE                     # MIT
├── requirements.txt
└── .gitignore

Data Model (Kimball star schema)

┌─────────────────┐         ┌──────────────────┐
│  dim_customer   │◄────────│   fact_orders    │
│  customer_key   │         │   order_key      │
│  customer_state │         │   customer_key  ─┘
│  customer_city  │         │   product_key   ─┐
└─────────────────┘         │   seller_key    ─┤
                            │   date_key      ─┤
┌─────────────────┐         │   payment_type   │      ┌──────────────────┐
│  dim_product    │◄────────│   item_count     │      │   dim_date       │
│  product_key    │         │   gross_revenue  │      │   date_key       │
│  category       │         │   freight_value  │      │   year, month    │
│  weight_g, ...  │         │   review_score   │      │   day_of_week    │
└─────────────────┘         │   delivery_lag   │      └──────────────────┘
                            └──────────────────┘
┌─────────────────┐                  ▲
│   dim_seller    │──────────────────┘
│   seller_key    │
│   seller_state  │
└─────────────────┘

Grain of fact_orders is one row per order (items aggregated to order level). Full reasoning in sql/02_dim_model.sql.

What I learned (Olist quirks worth knowing)

  • ~3% of orders are flagged canceled or unavailable — exclude from revenue but keep in conversion analyses.
  • freight_value lives on the item table, not the order — easy to double-count if you SUM(freight_value) after joining items to orders without aggregating first.
  • order_purchase_timestamp and order_approved_at differ by hours to days — for cohort analysis, purchase_timestamp is the canonical "acquisition" timestamp.
  • November 2017 is a strong Black Friday peak; treating it as a normal month inflates cohort baselines and rolling averages.
  • Reviews can arrive months after delivery — filter by review timestamp, not order timestamp, when measuring fresh sentiment.

Skills Demonstrated

  • Kimball-style dimensional modeling (facts, conformed dimensions, surrogate keys)
  • Multi-table SQL with CTEs, window functions, QUALIFY (DuckDB)
  • Cohort retention via DATE_TRUNC + self-join pattern
  • Business framing — every query answers a question a Head of Operations would actually ask
  • Reproducibility — deterministic synthetic data, idempotent SQL pipeline, single-file warehouse, no external credentials

Dataset reference

Olist, & Sionek, A. (2018). Brazilian E-Commerce Public Dataset by Olist [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/195341 — referenced for shape and cardinality; this repo regenerates a synthetic equivalent.

About

End-to-end SQL analytics on a synthetic Olist e-commerce dataset — DuckDB warehouse, Kimball star schema, 10 business-question queries, Python chart deck.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages