🌍 Spatial Predictions

A reproducible geospatial ETL and spatial machine learning workflow for historical sub-national population estimation in Europe (1850-2020).

🧪 Project

This repository supports comparative population reconstruction across countries with heterogeneous historical records. National tabular and boundary datasets are harmonized into a common panel, enriched with geospatial predictors, and used to train spatially aware polygon-level models plus a mass-preserving dasymetric allocation workflow.

Primary goal:

estimate population patterns for administrative units and periods where direct observations are sparse, inconsistent, or missing, and allocate known polygon totals to grid cells using explicit dasymetric weighting.

⚙️ Method overview

1) Country-level harmonization

For each enabled country, the pipeline:

Loads configured tabular and geometry inputs.
Harmonizes changing administrative units via optional crosswalks.
Reprojects to a shared canonical CRS (EPSG:3035 by default).
Applies QA checks (key uniqueness, geometry validity, CRS consistency, join coverage).
Extracts configured raster covariates and adds derived geometric covariates.
Writes country panel outputs.

2) Cross-country polygon-level modeling

After country panels are assembled, the workflow:

Combines country panels into a global panel.
Constructs SoilGrids PCA features (soil_pc1 to soil_pcN).
Creates spatial block cross-validation folds.
Trains and tunes ranger, xgboost, and lightgbm models.
Selects the best model by cross-validated RMSE.
Evaluates final performance on a holdout country (DEU by default).
Uses the selected model family as a score-proxy model for ML-weighted dasymetric allocation (not direct cell-count prediction).

3) Constrained dasymetric allocation (production path)

For polygon-years with known population totals, the workflow:

Builds a prediction grid in the canonical equal-area CRS.
Computes exact polygon-grid overlap areas.
Generates a nonnegative cell-level weighting surface (current baseline: uniform area).
- optional ML-weighted score/intensity surface from a score-proxy model trained without polygon-only log_area
Normalizes weights within each polygon-year and allocates polygon totals to cells exactly.
Writes constrained population count rasters plus allocation diagnostics (mass-preservation QA).

🌐 Included countries (current config)

The table below reflects countries.enabled in config/global/project.yml and the corresponding country config files.

ISO3	Country	Time frame in panel	Administrative unit
`DEU`	Germany	`1890`, `1900`, `1910`	Harmonized electoral districts (`target_unit_id: ADM_HARM_DEU_V1`; raw ID: `Wahlkreis_Nummer`)
`NLD`	Netherlands	`1850`, `1860`, `1870`, `1880`, `1890`, `1900`, `1910`, `1920`, `1930`, `1940`, `1950`, `1960`, `1970`, `1990`, `2000`, `2010`, `2020`	Harmonized municipalities (`target_unit_id: ADM_HARM_NLD_V1`; raw ID: `GMDNR`)

🧭 Predictor set

Configured raster predictors:

elevation_mean
slope_mean
tri_mean
dist_coast_km
dist_river_km

Additional modeling covariates:

soil_pc1 to soil_pcN (SoilGrids PCA)
log_area
lon, lat
year_<value> dummy variables

Source definitions, provenance, and licensing are documented in docs/FEATURE_SOURCES.md.

📦 Outputs

Core output artifacts:

data/final/<ISO3>/<ISO3>_panel.gpkg
data/final/<ISO3>/<ISO3>_panel.parquet
data/final/global_panel.gpkg
data/final/global_panel.parquet
data/final/predictions/global_<year>_intensity_<model_id>.tif
data/final/predictions/global_<year>_population_count_constrained_<model_id>.tif
data/final/predictions/global_<year>_population_count_calibrated_<model_id>.tif (optional; when calibration totals are configured)
data/final/diagnostics/allocation_diagnostics_all_<model_id>.csv
models/model_summary.csv
models/cv_summary.csv
models/<model_id>_folds.csv
models/<model_id>_final.rds

Column-level schema is specified in docs/DATA_SCHEMA.md.

🚀 Running the pipeline

🐳 Docker

docker compose run --rm pipeline

💻 Local

R -q -e "renv::restore(prompt=FALSE)"
R -q -e "targets::tar_make()"

🧪 Tests

R -q -e "testthat::test_dir('tests/testthat')"

🗂️ Configuration structure

config/global/project.yml: enabled countries, project seed
config/global/crs.yml: canonical CRS
config/global/ml.yml: model setup, CV, holdout country, raster prediction settings
config/global/allocation.yml: constrained allocation settings (area denominator, fallback, allocation QA tolerance, optional calibration totals)
config/global/qa.yml: QA thresholds and behavior
config/global/paths.yml: output directory settings
config/countries/<ISO3>.yml: country-specific input mappings and assembly logic
config/crosswalks/<ISO3>.csv: harmonization crosswalk tables
config/sources/features.yml: active feature registry
config/sources/*.yml: source-specific acquisition and processing settings

➕ Adding a country case

Add raw files under data/raw/<ISO3>/.
Add config/countries/<ISO3>.yml.
Add config/crosswalks/<ISO3>.csv if harmonization is required.
Add <ISO3> under countries.enabled in config/global/project.yml.
Re-run targets::tar_make().

🔁 Reproducibility

Dependency versions are locked in renv.lock.
Workflow orchestration and caching are managed by targets.
The targets cache lives in _targets/ (local/regenerable; ignored by git).
Containerized execution is defined by Dockerfile and docker-compose.yml.
Pipeline behavior is config-driven and country-extensible.

📁 Repository layout

_targets.R              Pipeline definition
_targets/               Local targets cache (git-ignored, regenerable)
R/                      Pipeline functions
config/                 Global, country, and source configuration
data/raw/               Input data (not committed)
data/final/             Final outputs
models/                 Trained models and evaluation artifacts
docs/                   Project spec, schema, and feature source documentation
tests/testthat/         Test suite

📚 Canonical documentation

Use these files as source of truth:

docs/PROJECT_SPEC.md
docs/DATA_SCHEMA.md
docs/FEATURE_SOURCES.md

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.claude/rules		.claude/rules
.github/workflows		.github/workflows
.vscode		.vscode
R		R
cache		cache
config		config
docs		docs
misc		misc
renv		renv
tests		tests
.Rprofile		.Rprofile
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
_targets.R		_targets.R
docker-compose.yml		docker-compose.yml
renv.lock		renv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌍 Spatial Predictions

🧪 Project

⚙️ Method overview

1) Country-level harmonization

2) Cross-country polygon-level modeling

3) Constrained dasymetric allocation (production path)

🌐 Included countries (current config)

🧭 Predictor set

📦 Outputs

🚀 Running the pipeline

🐳 Docker

💻 Local

🧪 Tests

🗂️ Configuration structure

➕ Adding a country case

🔁 Reproducibility

📁 Repository layout

📚 Canonical documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🌍 Spatial Predictions

🧪 Project

⚙️ Method overview

1) Country-level harmonization

2) Cross-country polygon-level modeling

3) Constrained dasymetric allocation (production path)

🌐 Included countries (current config)

🧭 Predictor set

📦 Outputs

🚀 Running the pipeline

🐳 Docker

💻 Local

🧪 Tests

🗂️ Configuration structure

➕ Adding a country case

🔁 Reproducibility

📁 Repository layout

📚 Canonical documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages