Olist E-commerce SQL Analytics

Architecture

flowchart LR
    GEN["src/generate_synthetic_olist.py<br/>(deterministic, seed 42)"] --> RAW["9 CSVs<br/>(orders, items, products,<br/>customers, sellers, reviews,<br/>payments, geolocation,<br/>category translations)"]
    RAW --> LOAD["sql/01_load_raw.sql<br/>→ DuckDB raw.*"]
    LOAD --> MODEL["sql/02_dim_model.sql<br/>→ fact_orders + 5 dims"]
    MODEL --> QS["sql/03_business_questions.sql<br/>10 analytical queries"]
    MODEL --> WIN["sql/04_window_functions.sql<br/>5 window-function showcases"]
    QS --> ANS["outputs/business_answers.md"]
    QS --> CHARTS["notebooks/eda.py<br/>→ 10 PNG charts"]
    MODEL --> STATS["src/validate_late_delivery_significance.py<br/>Welch t-test + Cohen's d + bootstrap CI"]
    STATS --> SVAL["outputs/statistical_validation.md"]

🌐 Live walkthrough: https://ucazin.github.io/olist-ecommerce-analytics/

End-to-end SQL analytics on an Olist-shaped Brazilian e-commerce dataset — 96,497 delivered orders, 92,315 unique customers, R$ 14.7M in revenue, joined across 9 raw tables (orders, items, products, customers, sellers, reviews, payments, geolocation, category translations).

This is the SQL anchor of the portfolio: a Kimball-style dimensional model, ten business-question queries, and a chart deck built on top.

Data note. The CSVs are procedurally generated by src/generate_synthetic_olist.py with a fixed seed (no Kaggle download needed). Shape, cardinality, and category distribution mirror the real Olist public dataset; absolute revenue/CLV numbers are not directly comparable. Every SQL query and dashboard chart runs unchanged if you swap in the original Kaggle CSVs.

Headline findings

Insight	Number
Late deliveries (8+ days past estimate) collapse review score	1.64 / 5 vs 3.98 on-time
Share of 8+d-late orders rated 1–2 stars	86.8%
The bottleneck is carrier transit, not seller dispatch	25.6 d vs 13.1 d on late orders
Revenue concentration — top 5 categories	53% of total revenue
`bed_bath_table` alone	24.3% of revenue
Revenue-at-risk from late + low-review orders	R$ 560k
Repeat-purchase rate (90 d)	1.0–1.2% — flat across lateness buckets

Full numerical answers for the 10 business questions: outputs/business_answers.md.

Business Questions

Revenue concentration — which product categories drive revenue, and what share belongs to the top 10%?
State-level customer value — which Brazilian states deliver the highest customer lifetime value (CLV), and where should the marketing budget concentrate?
Delivery → Review impact — by how many points does a late delivery (vs estimate) drop the review score, and how does that change repeat-purchase probability?
Operational bottleneck decomposition — for orders that arrived late, where was the delay introduced: seller dispatch, carrier transit, or last-mile?
Cohort retention — what does the monthly cohort retention curve look like for first-time buyers, and which acquisition month retained best?
Payment behavior — what's the average installment count by category, and which categories see the most boleto vs credit-card mix?
Seller performance scorecard — which sellers have the worst review-to-volume ratio, and what's their estimated revenue at risk?
Freight as a margin drag — what percentage of order value is consumed by freight, by region and category?
Seasonality — what's the day-of-week and month-of-year demand pattern, and how should staffing react?
Geographic spread — which seller-to-customer state pairs concentrate the most volume, and where would a regional hub reduce shipping time?

Charts

All 10 charts are committed in outputs/:

#	Question	Chart
01	Revenue concentration	`01_revenue_by_category.png`
02	State CLV	`02_state_clv.png`
03	Review × delivery	`03_review_by_delivery.png`
04	Bottleneck decomposition	`04_bottleneck_decomposition.png`
05	Cohort retention	`05_cohort_retention.png`
06	Payment mix	`06_payment_mix.png`
07	Revenue at risk by seller state	`07_worst_sellers.png`
08	Freight share by state	`08_freight_share.png`
09	Seasonality heatmap	`09_seasonality_heatmap.png`
10	Monthly revenue trend	`10_monthly_revenue.png`

How to run

# 1. Create venv and install deps
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

# 2. Generate the synthetic dataset (~12s, ~58 MB across 9 CSVs)
python src/generate_synthetic_olist.py

# 3. Build the warehouse + answer the 10 business questions
python src/run_pipeline.py
#    creates olist.duckdb and outputs/business_answers.md

# 4. Render the 10 charts
python notebooks/eda.py

Each script is idempotent — re-running rebuilds outputs in place.

Tech Stack

DuckDB — embedded analytical DB, runs the entire warehouse from a single file, no setup
SQL — PostgreSQL-flavored, portable to Snowflake / BigQuery / Redshift with minor changes
Python (pandas, matplotlib, seaborn) — synthetic data generator + EDA + chart export
Power BI / Tableau — full wireframe + DAX measures in dashboard/DASHBOARD_SPEC.md, implemented in Python charts here

Project Structure

01-olist-ecommerce-sql/
├── data/
│   ├── DOWNLOAD.md             # Original Kaggle dump (optional — synthetic gen is default)
│   └── raw/                    # 9 generated CSVs (gitignored, regenerable)
├── src/
│   ├── generate_synthetic_olist.py   # Deterministic Olist-shaped data generator
│   ├── run_pipeline.py               # Load + model + answer 10 Qs end-to-end
│   └── extract_key_numbers.py        # Pull headline numbers for README/MEMO
├── sql/
│   ├── 01_load_raw.sql         # Raw CSV → DuckDB tables
│   ├── 02_dim_model.sql        # Kimball star schema: fact_orders, dim_*
│   └── 03_business_questions.sql # 10 analytical queries
├── notebooks/
│   └── eda.py                  # Renders the 10 charts to outputs/
├── dashboard/
│   └── DASHBOARD_SPEC.md       # Power BI / Tableau wireframe + DAX measures
├── outputs/                    # Generated charts + business_answers.md (committed)
├── MEMO.md                     # One-page business memo with recommendations
├── README.md
├── LICENSE                     # MIT
├── requirements.txt
└── .gitignore

Data Model (Kimball star schema)

┌─────────────────┐         ┌──────────────────┐
│  dim_customer   │◄────────│   fact_orders    │
│  customer_key   │         │   order_key      │
│  customer_state │         │   customer_key  ─┘
│  customer_city  │         │   product_key   ─┐
└─────────────────┘         │   seller_key    ─┤
                            │   date_key      ─┤
┌─────────────────┐         │   payment_type   │      ┌──────────────────┐
│  dim_product    │◄────────│   item_count     │      │   dim_date       │
│  product_key    │         │   gross_revenue  │      │   date_key       │
│  category       │         │   freight_value  │      │   year, month    │
│  weight_g, ...  │         │   review_score   │      │   day_of_week    │
└─────────────────┘         │   delivery_lag   │      └──────────────────┘
                            └──────────────────┘
┌─────────────────┐                  ▲
│   dim_seller    │──────────────────┘
│   seller_key    │
│   seller_state  │
└─────────────────┘

Grain of fact_orders is one row per order (items aggregated to order level). Full reasoning in sql/02_dim_model.sql.

What I learned (Olist quirks worth knowing)

~3% of orders are flagged canceled or unavailable — exclude from revenue but keep in conversion analyses.
freight_value lives on the item table, not the order — easy to double-count if you SUM(freight_value) after joining items to orders without aggregating first.
order_purchase_timestamp and order_approved_at differ by hours to days — for cohort analysis, purchase_timestamp is the canonical "acquisition" timestamp.
November 2017 is a strong Black Friday peak; treating it as a normal month inflates cohort baselines and rolling averages.
Reviews can arrive months after delivery — filter by review timestamp, not order timestamp, when measuring fresh sentiment.

Skills Demonstrated

Kimball-style dimensional modeling (facts, conformed dimensions, surrogate keys)
Multi-table SQL with CTEs, window functions, QUALIFY (DuckDB)
Cohort retention via DATE_TRUNC + self-join pattern
Business framing — every query answers a question a Head of Operations would actually ask
Reproducibility — deterministic synthetic data, idempotent SQL pipeline, single-file warehouse, no external credentials

Dataset reference

Olist, & Sionek, A. (2018). Brazilian E-Commerce Public Dataset by Olist [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/195341 — referenced for shape and cardinality; this repo regenerates a synthetic equivalent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Olist E-commerce SQL Analytics

Architecture

Headline findings

Business Questions

Charts

How to run

Tech Stack

Project Structure

Data Model (Kimball star schema)

What I learned (Olist quirks worth knowing)

Skills Demonstrated

Dataset reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
dashboard		dashboard
data		data
notebooks		notebooks
outputs		outputs
sql		sql
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MEMO.md		MEMO.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Olist E-commerce SQL Analytics

Architecture

Headline findings

Business Questions

Charts

How to run

Tech Stack

Project Structure

Data Model (Kimball star schema)

What I learned (Olist quirks worth knowing)

Skills Demonstrated

Dataset reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages