MichiganCast: Multimodal Lake-Effect Precipitation Forecasting Pipeline

A production-oriented project for leakage-safe forecasting (t+6h, t+24h) using satellite image sequences and meteorological time-series.

Overview

MichiganCast focuses on binary precipitation-event forecasting over the Lake Michigan region with strict anti-leakage rules:

Inputs are built from historical windows ending at anchor time t
Labels are generated at future timestamp y_time = t + horizon
Forecast horizons currently supported: 6h and 24h
Time-based train/validation/test split is enforced by year range

Current codebase implements:

Data contract checks and data quality validation
Reproducible cleaning pipeline and EDA report generation
Traditional ML baselines (Logistic Regression, Random Forest, Gradient Boosting)
Modular multimodal PyTorch training stack (Dataset, Model, Train Loop, CLI entry)

Project Background

The project originates from a practical weather challenge in the Lake Michigan region, where precipitation events are strongly influenced by lake-effect dynamics and can change quickly across time. The initial exploratory workflow was built in notebooks and revealed two critical realities: satellite signals are informative but noisy, and tabular weather variables alone do not fully capture spatial cloud evolution.

During early exploration, several data constraints were identified and later formalized into the engineering pipeline: daytime-only image reliability windows, missing or corrupted image segments, coastline padding artifacts, and class imbalance in rain events. These constraints shaped the current design choices in src/, including contract checks, data validation gates, deterministic cleaning, and explicit leakage-safe labeling with time-based splits.

Model selection follows the same reasoning. Classical baselines are kept as lower-bound references. The final training stack uses multimodal temporal learning because the task is inherently spatiotemporal: cloud structures evolve across image sequences, while meteorological signals provide complementary dynamics. This is why the project integrates a visual sequence branch with a meteorological sequence branch, then evaluates the fused model under rare-event-aware metrics.

For the full narrative on motivation, data realities, and architecture rationale, see Project Background and Modeling Rationale.

Architecture

raw tabular CSV + satellite PNG
            │
            ▼
  src/data/contracts.py     -> schema/range/missing-rule validation
  src/data/validate.py      -> alignment, continuity, image inventory checks
            │
            ▼
  src/data/clean.py         -> reproducible cleaning pipeline
            │
            ▼
  src/features/labeling.py  -> forecast sample index + leakage guards (X_time < y_time)
            │
            ▼
  src/data/split.py         -> strict time-based split by year (train/val/test)
            │
   ┌────────┴───────────┐
   ▼                    ▼
src/models/baselines.py src/models/multimodal/train.py
traditional ML          ConvLSTM + LSTM multimodal network
   │                    │
   └────────┬───────────┘
            ▼
 artifacts/reports + artifacts/figures + models/pytorch

Core Capabilities

Concern	Implementation	Output
Data contract validation	`src/data/contracts.py`	`artifacts/reports/data_contract_report*.json`
Data quality checks	`src/data/validate.py`	`artifacts/data_quality_report.json`
Reproducible cleaning	`src/data/clean.py`	`data/interim/clean.csv`, cleaning summary JSON
Leakage-safe labeling	`src/features/labeling.py`	Forecast sample index with temporal constraints
Time-based split	`src/data/split.py`	Year-range split summaries / split CSVs
EDA automation	`src/analysis/eda_report.py`	EDA JSON + figures (distribution/correlation/seasonality)
Baseline modeling	`src/models/baselines.py`	PR-AUC, F1, Recall, Recall@Precision, Brier, confusion matrices
Multimodal training	`src/models/multimodal/*`	Best checkpoint + train summary JSON

Mathematical Foundations

This project uses a forecasting-oriented mathematical stack rather than same-time classification assumptions:

Forecast target definition with strict temporal ordering (X_time < y_time)
Time-based split by target year to prevent future leakage
ConvLSTM plus LSTM multimodal fusion for spatiotemporal representation learning
Imbalance-aware optimization with weighted BCE, focal loss, and threshold moving
Rare-event evaluation with PR-AUC, Recall@Precision, and Brier score
Stability criterion based on repeated-run metric deltas

For full formulas, derivations, code mapping, and academic references:

Pipeline Surface

All commands below can be run with the project-bound conda environment wrapper:

scripts/run_in_pytorch_env.sh <command ...>

Stage	Command
Data contract	`python -m src.data.contracts --input data/processed/tabular/traverse_city_daytime_meteo_preprocessed.csv`
Data quality	`python -m src.data.validate --input data/processed/tabular/traverse_city_daytime_meteo_preprocessed.csv --image-dir data/processed/images/lake_michigan_64_png`
Cleaning pipeline	`python -m src.data.clean --input data/processed/tabular/traverse_city_daytime_meteo_preprocessed.csv --output data/interim/traverse_city_daytime_clean_v1.csv`
EDA report	`python -m src.analysis.eda_report --input data/interim/traverse_city_daytime_clean_v1.csv --summary artifacts/reports/eda_summary.json --fig-dir artifacts/figures`
Baselines	`python -m src.models.baselines --input data/interim/traverse_city_daytime_clean_v1.csv --report artifacts/reports/baseline_results.json --fig-dir artifacts/figures`
Multimodal train	`python -m src.models.multimodal.train --input-csv data/interim/traverse_city_daytime_clean_v1.csv --image-dir data/processed/images/lake_michigan_64_png --output-dir artifacts/reports --checkpoint-path models/pytorch/michigancast_multimodal_best.pth`

Getting Started

Prerequisites

Conda available in shell
Local conda env: pytorch_env
Dataset files under data/processed/ (tabular + images)

Environment

source scripts/activate_pytorch_env.sh

Or run directly without activating shell:

scripts/run_in_pytorch_env.sh python -V

Optional Jupyter kernel registration:

scripts/register_jupyter_kernel.sh

Quick Smoke Run

scripts/run_in_pytorch_env.sh python -m src.data.clean --nrows 5000
scripts/run_in_pytorch_env.sh python -m src.analysis.eda_report --nrows 5000
scripts/run_in_pytorch_env.sh python -m src.models.baselines --nrows 5000
scripts/run_in_pytorch_env.sh python -m src.models.multimodal.train --nrows 5000 --max-samples-per-split 256 --epochs 2

Notebook Demo Flow

Primary demo notebooks (run in order):

Order	Notebook	Goal
00	`notebooks/demo/00_demo_index.ipynb`	Demo roadmap and capability mapping
01	`notebooks/demo/01_data_foundation.ipynb`	Data contracts, validation, cleaning, data engineering foundation
02	`notebooks/demo/02_analysis_and_baselines.ipynb`	EDA and traditional ML baselines
03	`notebooks/demo/03_multimodal_training_and_tuning.ipynb`	Multimodal training and manual tuning
04	`notebooks/demo/04_imbalance_stability_export_serve.ipynb`	Imbalance strategy, stability, export, and serving

Archived historical notebooks:

notebooks/legacy/00_util/read_large_weather_csv_experiments.ipynb
notebooks/legacy/01_eda/lake_effect_precipitation_exploration.ipynb
notebooks/legacy/02_preprocessing/lake_michigan_satellite_preprocessing.ipynb
notebooks/legacy/03_training/multimodal_rainfall_training_pipeline.ipynb

Engineering Depth

Data cleaning + analysis + ML: contracts.py / validate.py / clean.py / eda_report.py / baselines.py form a reproducible tabular ML pipeline with explicit quality gates and rare-event metrics.
Manual neural-network tuning + local training: src/models/multimodal/ decouples dataset construction, architecture, and training loop, enabling controlled local experiments on horizon/lookback/loss-threshold parameters.
Data engineering: The repository uses layered data directories (raw/interim/processed/reference), versioned artifacts, environment scripts, and CLI-first pipelines to reduce notebook-only workflow risk.

Repository Structure

.
├── artifacts/      # generated reports/figures
├── configs/        # task + environment + experiment configs
├── data/           # raw / interim / processed / reference
├── docs/           # active docs, planning docs, and historical archive
├── models/         # trained checkpoints (.pth/.h5)
├── notebooks/demo/ # demo notebooks (00-04)
├── notebooks/legacy/ # archived historical notebooks
├── scripts/        # env helpers
├── src/            # production Python modules
└── tests/          # test placeholders (to be expanded)

Documentation

More Work and Course Origin

To explore more of my work, visit odieyang.com/work.

This repository is a reconstruction of a course project from my time studying in Boston, built for a machine learning and neural networks class taught by Professor Dino. My teammate on the original project was Tianyu Zhang.

Back then, it looked like an assignment. Years later, it reads like a beginning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MichiganCast: Multimodal Lake-Effect Precipitation Forecasting Pipeline

Table of Contents

Overview

Project Background

Architecture

Core Capabilities

Mathematical Foundations

Pipeline Surface

Getting Started

Prerequisites

Environment

Quick Smoke Run

Notebook Demo Flow

Engineering Depth

Repository Structure

Documentation

More Work and Course Origin

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
artifacts		artifacts
configs		configs
data		data
docs		docs
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.conda-env		.conda-env
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

MichiganCast: Multimodal Lake-Effect Precipitation Forecasting Pipeline

Table of Contents

Overview

Project Background

Architecture

Core Capabilities

Mathematical Foundations

Pipeline Surface

Getting Started

Prerequisites

Environment

Quick Smoke Run

Notebook Demo Flow

Engineering Depth

Repository Structure

Documentation

More Work and Course Origin

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages