A reproducible geospatial ETL and spatial machine learning workflow for historical sub-national population estimation in Europe (1850-2020).
This repository supports comparative population reconstruction across countries with heterogeneous historical records. National tabular and boundary datasets are harmonized into a common panel, enriched with geospatial predictors, and used to train spatially aware polygon-level models plus a mass-preserving dasymetric allocation workflow.
Primary goal:
- estimate population patterns for administrative units and periods where direct observations are sparse, inconsistent, or missing, and allocate known polygon totals to grid cells using explicit dasymetric weighting.
For each enabled country, the pipeline:
- Loads configured tabular and geometry inputs.
- Harmonizes changing administrative units via optional crosswalks.
- Reprojects to a shared canonical CRS (
EPSG:3035by default). - Applies QA checks (key uniqueness, geometry validity, CRS consistency, join coverage).
- Extracts configured raster covariates and adds derived geometric covariates.
- Writes country panel outputs.
After country panels are assembled, the workflow:
- Combines country panels into a global panel.
- Constructs SoilGrids PCA features (
soil_pc1tosoil_pcN). - Creates spatial block cross-validation folds.
- Trains and tunes
ranger,xgboost, andlightgbmmodels. - Selects the best model by cross-validated RMSE.
- Evaluates final performance on a holdout country (
DEUby default). - Uses the selected model family as a score-proxy model for ML-weighted dasymetric allocation (not direct cell-count prediction).
For polygon-years with known population totals, the workflow:
- Builds a prediction grid in the canonical equal-area CRS.
- Computes exact polygon-grid overlap areas.
- Generates a nonnegative cell-level weighting surface (current baseline: uniform area).
- optional ML-weighted score/intensity surface from a score-proxy model trained without polygon-only
log_area
- optional ML-weighted score/intensity surface from a score-proxy model trained without polygon-only
- Normalizes weights within each polygon-year and allocates polygon totals to cells exactly.
- Writes constrained population count rasters plus allocation diagnostics (mass-preservation QA).
The table below reflects countries.enabled in config/global/project.yml and the corresponding country config files.
| ISO3 | Country | Time frame in panel | Administrative unit |
|---|---|---|---|
DEU |
Germany | 1890, 1900, 1910 |
Harmonized electoral districts (target_unit_id: ADM_HARM_DEU_V1; raw ID: Wahlkreis_Nummer) |
NLD |
Netherlands | 1850, 1860, 1870, 1880, 1890, 1900, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1990, 2000, 2010, 2020 |
Harmonized municipalities (target_unit_id: ADM_HARM_NLD_V1; raw ID: GMDNR) |
Configured raster predictors:
elevation_meanslope_meantri_meandist_coast_kmdist_river_km
Additional modeling covariates:
soil_pc1tosoil_pcN(SoilGrids PCA)log_arealon,latyear_<value>dummy variables
Source definitions, provenance, and licensing are documented in docs/FEATURE_SOURCES.md.
Core output artifacts:
data/final/<ISO3>/<ISO3>_panel.gpkgdata/final/<ISO3>/<ISO3>_panel.parquetdata/final/global_panel.gpkgdata/final/global_panel.parquetdata/final/predictions/global_<year>_intensity_<model_id>.tifdata/final/predictions/global_<year>_population_count_constrained_<model_id>.tifdata/final/predictions/global_<year>_population_count_calibrated_<model_id>.tif(optional; when calibration totals are configured)data/final/diagnostics/allocation_diagnostics_all_<model_id>.csvmodels/model_summary.csvmodels/cv_summary.csvmodels/<model_id>_folds.csvmodels/<model_id>_final.rds
Column-level schema is specified in docs/DATA_SCHEMA.md.
docker compose run --rm pipelineR -q -e "renv::restore(prompt=FALSE)"
R -q -e "targets::tar_make()"R -q -e "testthat::test_dir('tests/testthat')"config/global/project.yml: enabled countries, project seedconfig/global/crs.yml: canonical CRSconfig/global/ml.yml: model setup, CV, holdout country, raster prediction settingsconfig/global/allocation.yml: constrained allocation settings (area denominator, fallback, allocation QA tolerance, optional calibration totals)config/global/qa.yml: QA thresholds and behaviorconfig/global/paths.yml: output directory settingsconfig/countries/<ISO3>.yml: country-specific input mappings and assembly logicconfig/crosswalks/<ISO3>.csv: harmonization crosswalk tablesconfig/sources/features.yml: active feature registryconfig/sources/*.yml: source-specific acquisition and processing settings
- Add raw files under
data/raw/<ISO3>/. - Add
config/countries/<ISO3>.yml. - Add
config/crosswalks/<ISO3>.csvif harmonization is required. - Add
<ISO3>undercountries.enabledinconfig/global/project.yml. - Re-run
targets::tar_make().
- Dependency versions are locked in
renv.lock. - Workflow orchestration and caching are managed by
targets. - The
targetscache lives in_targets/(local/regenerable; ignored by git). - Containerized execution is defined by
Dockerfileanddocker-compose.yml. - Pipeline behavior is config-driven and country-extensible.
_targets.R Pipeline definition
_targets/ Local targets cache (git-ignored, regenerable)
R/ Pipeline functions
config/ Global, country, and source configuration
data/raw/ Input data (not committed)
data/final/ Final outputs
models/ Trained models and evaluation artifacts
docs/ Project spec, schema, and feature source documentation
tests/testthat/ Test suite
Use these files as source of truth:
docs/PROJECT_SPEC.mddocs/DATA_SCHEMA.mddocs/FEATURE_SOURCES.md