flowchart LR
GEN["src/generate_synthetic_olist.py<br/>(deterministic, seed 42)"] --> RAW["9 CSVs<br/>(orders, items, products,<br/>customers, sellers, reviews,<br/>payments, geolocation,<br/>category translations)"]
RAW --> LOAD["sql/01_load_raw.sql<br/>→ DuckDB raw.*"]
LOAD --> MODEL["sql/02_dim_model.sql<br/>→ fact_orders + 5 dims"]
MODEL --> QS["sql/03_business_questions.sql<br/>10 analytical queries"]
MODEL --> WIN["sql/04_window_functions.sql<br/>5 window-function showcases"]
QS --> ANS["outputs/business_answers.md"]
QS --> CHARTS["notebooks/eda.py<br/>→ 10 PNG charts"]
MODEL --> STATS["src/validate_late_delivery_significance.py<br/>Welch t-test + Cohen's d + bootstrap CI"]
STATS --> SVAL["outputs/statistical_validation.md"]
🌐 Live walkthrough: https://ucazin.github.io/olist-ecommerce-analytics/
End-to-end SQL analytics on an Olist-shaped Brazilian e-commerce dataset — 96,497 delivered orders, 92,315 unique customers, R$ 14.7M in revenue, joined across 9 raw tables (orders, items, products, customers, sellers, reviews, payments, geolocation, category translations).
This is the SQL anchor of the portfolio: a Kimball-style dimensional model, ten business-question queries, and a chart deck built on top.
Data note. The CSVs are procedurally generated by
src/generate_synthetic_olist.pywith a fixed seed (no Kaggle download needed). Shape, cardinality, and category distribution mirror the real Olist public dataset; absolute revenue/CLV numbers are not directly comparable. Every SQL query and dashboard chart runs unchanged if you swap in the original Kaggle CSVs.
| Insight | Number |
|---|---|
| Late deliveries (8+ days past estimate) collapse review score | 1.64 / 5 vs 3.98 on-time |
| Share of 8+d-late orders rated 1–2 stars | 86.8% |
| The bottleneck is carrier transit, not seller dispatch | 25.6 d vs 13.1 d on late orders |
| Revenue concentration — top 5 categories | 53% of total revenue |
bed_bath_table alone |
24.3% of revenue |
| Revenue-at-risk from late + low-review orders | R$ 560k |
| Repeat-purchase rate (90 d) | 1.0–1.2% — flat across lateness buckets |
Full numerical answers for the 10 business questions: outputs/business_answers.md.
- Revenue concentration — which product categories drive revenue, and what share belongs to the top 10%?
- State-level customer value — which Brazilian states deliver the highest customer lifetime value (CLV), and where should the marketing budget concentrate?
- Delivery → Review impact — by how many points does a late delivery (vs estimate) drop the review score, and how does that change repeat-purchase probability?
- Operational bottleneck decomposition — for orders that arrived late, where was the delay introduced: seller dispatch, carrier transit, or last-mile?
- Cohort retention — what does the monthly cohort retention curve look like for first-time buyers, and which acquisition month retained best?
- Payment behavior — what's the average installment count by category, and which categories see the most boleto vs credit-card mix?
- Seller performance scorecard — which sellers have the worst review-to-volume ratio, and what's their estimated revenue at risk?
- Freight as a margin drag — what percentage of order value is consumed by freight, by region and category?
- Seasonality — what's the day-of-week and month-of-year demand pattern, and how should staffing react?
- Geographic spread — which seller-to-customer state pairs concentrate the most volume, and where would a regional hub reduce shipping time?
All 10 charts are committed in outputs/:
| # | Question | Chart |
|---|---|---|
| 01 | Revenue concentration | 01_revenue_by_category.png |
| 02 | State CLV | 02_state_clv.png |
| 03 | Review × delivery | 03_review_by_delivery.png |
| 04 | Bottleneck decomposition | 04_bottleneck_decomposition.png |
| 05 | Cohort retention | 05_cohort_retention.png |
| 06 | Payment mix | 06_payment_mix.png |
| 07 | Revenue at risk by seller state | 07_worst_sellers.png |
| 08 | Freight share by state | 08_freight_share.png |
| 09 | Seasonality heatmap | 09_seasonality_heatmap.png |
| 10 | Monthly revenue trend | 10_monthly_revenue.png |
# 1. Create venv and install deps
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
# 2. Generate the synthetic dataset (~12s, ~58 MB across 9 CSVs)
python src/generate_synthetic_olist.py
# 3. Build the warehouse + answer the 10 business questions
python src/run_pipeline.py
# creates olist.duckdb and outputs/business_answers.md
# 4. Render the 10 charts
python notebooks/eda.pyEach script is idempotent — re-running rebuilds outputs in place.
- DuckDB — embedded analytical DB, runs the entire warehouse from a single file, no setup
- SQL — PostgreSQL-flavored, portable to Snowflake / BigQuery / Redshift with minor changes
- Python (pandas, matplotlib, seaborn) — synthetic data generator + EDA + chart export
- Power BI / Tableau — full wireframe + DAX measures in
dashboard/DASHBOARD_SPEC.md, implemented in Python charts here
01-olist-ecommerce-sql/
├── data/
│ ├── DOWNLOAD.md # Original Kaggle dump (optional — synthetic gen is default)
│ └── raw/ # 9 generated CSVs (gitignored, regenerable)
├── src/
│ ├── generate_synthetic_olist.py # Deterministic Olist-shaped data generator
│ ├── run_pipeline.py # Load + model + answer 10 Qs end-to-end
│ └── extract_key_numbers.py # Pull headline numbers for README/MEMO
├── sql/
│ ├── 01_load_raw.sql # Raw CSV → DuckDB tables
│ ├── 02_dim_model.sql # Kimball star schema: fact_orders, dim_*
│ └── 03_business_questions.sql # 10 analytical queries
├── notebooks/
│ └── eda.py # Renders the 10 charts to outputs/
├── dashboard/
│ └── DASHBOARD_SPEC.md # Power BI / Tableau wireframe + DAX measures
├── outputs/ # Generated charts + business_answers.md (committed)
├── MEMO.md # One-page business memo with recommendations
├── README.md
├── LICENSE # MIT
├── requirements.txt
└── .gitignore
┌─────────────────┐ ┌──────────────────┐
│ dim_customer │◄────────│ fact_orders │
│ customer_key │ │ order_key │
│ customer_state │ │ customer_key ─┘
│ customer_city │ │ product_key ─┐
└─────────────────┘ │ seller_key ─┤
│ date_key ─┤
┌─────────────────┐ │ payment_type │ ┌──────────────────┐
│ dim_product │◄────────│ item_count │ │ dim_date │
│ product_key │ │ gross_revenue │ │ date_key │
│ category │ │ freight_value │ │ year, month │
│ weight_g, ... │ │ review_score │ │ day_of_week │
└─────────────────┘ │ delivery_lag │ └──────────────────┘
└──────────────────┘
┌─────────────────┐ ▲
│ dim_seller │──────────────────┘
│ seller_key │
│ seller_state │
└─────────────────┘
Grain of fact_orders is one row per order (items aggregated to order level). Full reasoning in sql/02_dim_model.sql.
- ~3% of orders are flagged
canceledorunavailable— exclude from revenue but keep in conversion analyses. freight_valuelives on the item table, not the order — easy to double-count if youSUM(freight_value)after joining items to orders without aggregating first.order_purchase_timestampandorder_approved_atdiffer by hours to days — for cohort analysis,purchase_timestampis the canonical "acquisition" timestamp.- November 2017 is a strong Black Friday peak; treating it as a normal month inflates cohort baselines and rolling averages.
- Reviews can arrive months after delivery — filter by review timestamp, not order timestamp, when measuring fresh sentiment.
- Kimball-style dimensional modeling (facts, conformed dimensions, surrogate keys)
- Multi-table SQL with CTEs, window functions,
QUALIFY(DuckDB) - Cohort retention via
DATE_TRUNC+ self-join pattern - Business framing — every query answers a question a Head of Operations would actually ask
- Reproducibility — deterministic synthetic data, idempotent SQL pipeline, single-file warehouse, no external credentials
Olist, & Sionek, A. (2018). Brazilian E-Commerce Public Dataset by Olist [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/195341 — referenced for shape and cardinality; this repo regenerates a synthetic equivalent.