ML-powered demand predictions for retail store operations, reducing inventory waste by $192K annually through 64% more accurate forecasting.
Multi-location retail operations lose 8-15% of revenue to food waste caused by inaccurate demand forecasting. Store managers currently use simple moving averages and manual judgment for ordering decisions, leading to:
- Overstocking on slow days โ Food expires โ Money thrown away
- Understocking on peak days โ Empty shelves โ Lost revenue
This system solves both problems by predicting daily item-level demand across store locations with 9.42% error โ enabling precise, data-driven inventory ordering.
| Model | MAE | RMSE | MAPE | Rยฒ | vs Baseline |
|---|---|---|---|---|---|
| Baseline (7-Day MA) | 721.8 | 1336.5 | 26.47% | 0.8519 | โ |
| Linear Regression | 402.8 | 710.5 | 21.02% | 0.9581 | 20.6% better |
| Random Forest | 254.7 | 496.8 | 10.76% | 0.9795 | 59.4% better |
| XGBoost | 221.0 | 440.9 | 9.42% | 0.9839 | 64.4% better โ |
| LightGBM | 229.2 | 446.2 | 9.98% | 0.9835 | 62.3% better |
Best Model: XGBoost โ predicts daily demand with only 9.42% average error.
| Impact Area | Value | Details |
|---|---|---|
| ๐ Forecast Error Reduction | 64.4% | vs moving average baseline |
| ๐๏ธ Annual Waste Reduction | $192,000 | across 10 high-revenue stores |
| ๐ฃ Marketing Reallocation | $200,000+ | identified misallocated promotion spend |
| ๐ Weekend Demand Insight | 45% | Saturday sales higher than Tuesday |
| ๐ท๏ธ Promotion Effectiveness | 42% vs 3% | GROCERY vs BABY CARE response gap |
BigQuery (3M+ transactions, SQL EDA)
โ
PySpark (60+ engineered features)
โ
XGBoost Model (9.42% MAPE, Rยฒ = 0.9839)
โ
FastAPI (REST API) + Streamlit (Dashboard)
โ
Docker (Containerized Deployment)
demand-forecasting-system/
โ
โโโ app/
โ โโโ __init__.py # Python package init
โ โโโ main.py # FastAPI application (5 endpoints)
โ โโโ model.py # Model loading & prediction logic
โ
โโโ dashboard/
โ โโโ app.py # Streamlit dashboard (5 tabs)
โ
โโโ models/
โ โโโ best_model_xgb.pkl # Trained model file
โ โโโ feature_columns.json # Feature list (60+ features)
โ
โโโ notebooks/
โ โโโ 01_EDA_BigQuery.ipynb
โ โโโ 02_Feature_Engineering.ipynb
โ โโโ 03_Model_Training.ipynb
โ
โโโ results/
โ โโโ eda/ # EDA visualizations
โ โ โโโ category_sales.png
โ โ โโโ weekly_seasonality.png
โ โ โโโ monthly_seasonality.png
โ โ โโโ sales_trend.png
โ โ โโโ promotion_impact.png
โ โ โโโ store_analysis.png
โ โ โโโ zero_sales_analysis.png
โ โโโ model_comparison.png
โ โโโ feature_importance.png
โ โโโ actual_vs_predicted.png
โ โโโ model_results.csv
โ โโโ feature_importance.csv
โ
โโโ Dockerfile # Container configuration
โโโ docker-compose.yml # Multi-service orchestration
โโโ requirements.txt # Python dependencies
โโโ .gitignore
# Clone the repository
git clone https://github.com/anushkundu/demand-forecasting-system.git
cd demand-forecasting-system
# Build and run both API + Dashboard
docker-compose up --build
# API: http://localhost:8000/docs
# Dashboard: http://localhost:8501# Clone the repository
git clone https://github.com/anushkundu/demand-forecasting-system.git
cd demand-forecasting-system
# Create virtual environment
python -m venv venv
# Activate virtual environment
# Windows:
venv\Scripts\activate
# Mac/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Start API (Terminal 1)
uvicorn app.main:app --reload --port 8000
# Start Dashboard (Terminal 2 โ activate venv first)
streamlit run dashboard/app.pypip install fastapi uvicorn lightgbm scikit-learn pandas numpy pydantic
uvicorn app.main:app --reload --port 8000
# Open: http://localhost:8000/docs| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Health check โ confirms API is running |
/health |
GET | Detailed health status |
/quick-predict |
POST | Simple prediction (4-5 inputs) |
/predict |
POST | Full prediction (all features) |
/model-info |
GET | Model performance and metadata |
import requests
response = requests.post(
"http://localhost:8000/quick-predict",
json={
"yesterday_sales": 600,
"last_week_same_day": 550,
"weekly_average": 500,
"day_of_week": 7,
"is_promotion": True,
"is_holiday": False
}
)
print(response.json())
# {
# "predicted_demand": 650,
# "confidence": "medium",
# "recommendation": "High demand โ increase stock order",
# "features_provided": 7,
# "features_expected": 60
# }response = requests.post(
"http://localhost:8000/predict",
json={
"sales_lag_1": 420,
"sales_lag_7": 450,
"rolling_mean_7": 410,
"day_of_week": 7,
"month": 12,
"onpromotion": 5,
"is_weekend": 1,
"is_holiday": 0,
"rolling_mean_14": 400,
"rolling_mean_30": 390,
"oil_price": 52.3,
"store_type_encoded": 4,
"family_encoded": 0,
"category_avg_all_stores": 380
}
)
print(response.json())curl -X POST "http://localhost:8000/quick-predict" \
-H "Content-Type: application/json" \
-d '{
"yesterday_sales": 420,
"last_week_same_day": 450,
"weekly_average": 410,
"day_of_week": 7,
"is_promotion": true,
"is_holiday": false
}'Engineered 60+ predictive features across 4 tiers using PySpark:
| Tier | Category | Features | Examples |
|---|---|---|---|
| 1 | Calendar | 19 | day_of_week, month, cyclical sin/cos encoding, is_weekend, is_december |
| 2 | Lag & Rolling | 16 | sales_lag_1/7/14/28/365, rolling_mean_7/14/30, rolling_std |
| 3 | Advanced | 15 | promotion saturation, holiday proximity, WoW/MoM/YoY momentum |
| 4 | Expert | 14 | demand regime, cross-store comparison, z-scores, CV, interactions |
| Rank | Feature | Importance | Business Meaning |
|---|---|---|---|
| 1 | rolling_mean_7 | 49.87% | 7-day average demand level |
| 2 | sales_lag_7 | 32.25% | Same day last week |
| 3 | category_avg_all_stores | 10.60% | Chain-wide demand signal |
| 4 | sales_lag_14 | 2.19% | 2 weeks ago sales |
| 5 | sales_lag_1 | 1.05% | Yesterday's sales |
| 6 | expanding_std | 0.93% | Long-term volatility |
| 7 | sales_lag_28 | 0.50% | Monthly cycle |
| 8 | cluster | 0.28% | Store cluster grouping |
| 9 | rolling_std_7 | 0.22% | Recent demand volatility |
| 10 | rolling_max_7 | 0.19% | Recent peak demand |
Key Insight: Top 3 features account for 92.7% of prediction power โ all related to recent sales history and cross-store patterns.
- All lag features use only past data via
F.lag() - Rolling windows exclude current row:
rowsBetween(-N, -1) - Expanding windows exclude current row:
rowsBetween(unboundedPreceding, -1) - Train/test split is strictly temporal (no future data in training)
- Verified through manual spot-checks on random samples
| # | Finding | Business Impact |
|---|---|---|
| 1 | Saturday sales 45% higher than Tuesday | Adjust daily order quantities by day of week โ $80K savings |
| 2 | December 45% above annual average | Pre-position inventory by late November โ prevent stockouts |
| 3 | Promotions: GROCERY +42% vs BABY CARE +3% | Reallocate marketing budget โ $200K+ incremental revenue |
| 4 | Top 10 stores = 55% of total revenue | Prioritize ML deployment to high-value stores first |
| 5 | 12 of 33 categories have >70% zero-sales days | Exclude from ML, keep on manual ordering |
| 6 | Pre-holiday surge +25%, holiday drop -60% | Create holiday proximity features, not just binary flags |
| 7 | Oil price correlation: 0.15 | Weak but measurable โ included as external feature |
| 8 | Year-over-year growth: ~8% | Include trend feature to avoid systematic underprediction |
| Layer | Technology | Purpose |
|---|---|---|
| Data Storage | Google BigQuery | Cloud data warehouse, SQL-based EDA |
| Data Processing | PySpark | Distributed feature engineering at scale |
| Machine Learning | XGBoost, LightGBM, scikit-learn | Model training and comparison |
| API | FastAPI | REST API for predictions |
| Dashboard | Streamlit, Plotly | Interactive visualization |
| Containerization | Docker, Docker Compose | Reproducible deployment |
| Version Control | Git, GitHub | Code management |
| Notebook | Description | Key Output |
|---|---|---|
01_EDA_BigQuery.ipynb |
SQL EDA on 3M+ rows, 12 queries, 7 visualizations | Business insights, feature hypotheses |
02_Feature_Engineering.ipynb |
PySpark feature pipeline, 60+ features, leakage prevention | train_features.parquet, test_features.parquet |
03_Model_Training.ipynb |
5 model comparison, evaluation, business impact calculation | best_model_lgbm.pkl, results |
The Streamlit dashboard includes 5 interactive tabs:
| Tab | Feature |
|---|---|
| ๐ฎ Predict Demand | Enter sales data โ get AI-powered forecast with confidence gauge |
| ๐ EDA Insights | Interactive EDA visualizations with business context |
| ๐ Model Performance | Model comparison charts, radar plot, downloadable results |
| ๐ Feature Insights | Feature importance with bar, lollipop, and treemap views |
| โน๏ธ About System | Architecture diagram, key results, tech stack |
| Improvement | Expected Impact |
|---|---|
| Add weather data (Open-Meteo API) | +1-2% MAPE improvement |
| Hierarchical models per store type | Better local predictions |
| Prediction intervals (confidence bands) | Inform safety stock decisions |
| Model monitoring & drift detection | Maintain accuracy over time |
| Automated retraining pipeline | Keep model fresh with new data |
fastapi
uvicorn
python-multipart
lightgbm
scikit-learn
pandas
numpy
streamlit
plotly
pydantic
requests
Python 3.10+ recommended. Tested on Python 3.12.
# Test API is running
curl http://localhost:8000/health
# Test prediction
curl -X POST http://localhost:8000/quick-predict \
-H "Content-Type: application/json" \
-d '{"yesterday_sales": 4200, "last_week_same_day": 4050, "weekly_average": 4100, "day_of_week": 7, "is_promotion": false, "is_holiday": false}'
# Test model info
curl http://localhost:8000/model-infoAnush Kundu
- ๐ Nagpur, India
- ๐ MSc Data Science, Kingston University London
- ๐ผ 2.5 years in retail analytics (Compass Group UK, Cognizant)
- ๐ง anushkundu55@gmail.com
- ๐ LinkedIn
- ๐ GitHub
This project was inspired by real-world experience at Compass Group (London), where I managed demand forecasting for 100+ menu items using manual methods. The moving average approach reduced waste by 12% but left significant room for improvement. This system explores how machine learning can push accuracy further โ achieving 64.4% better forecasting than the baseline methods used in operations.
This project is open source under the MIT License.
๐ DemandCast AI โ Built with BigQuery ยท PySpark ยท XGBoost ยท FastAPI ยท Streamlit ยท Docker