Skip to content
/ t2e-soar Public

Lightweight modelling workflow for EBDP TWIN2EXPAND project

License

Notifications You must be signed in to change notification settings

UCL/t2e-soar

Repository files navigation

ebdp-lightweight

Lightweight modelling workflow for EBDP TWIN2EXPAND project for evidence based approaches to urban design and planning.

Data Quality Confidence Scores

NEW: This project includes a comprehensive data quality analysis framework that assigns confidence scores to each city based on POI data completeness and reliability.

Key Features:

  • Confidence scores (HIGH/MEDIUM/LOW) for all cities
  • Validated against official datasets (France/Netherlands)
  • 4 filtering scenarios for different research goals
  • Automatic outputs (visualizations, reports, scores)

Development

Project configuration is managed using a pyproject.toml file. For development purposes: uv is used for installation and management of the packages and related upgrades. For example: uv sync will install packages listed in the pyproject.toml file and creates a self-contained development environment in a .venv folder.

Data Loading

See the data_loading.md markdown file for data loading guidelines.

Licenses

This repo depends on copy-left open source packages licensed as AGPLv3 and therefore adopts the same license. This is also in keeping with the intention of the TWIN2EXPAND project to create openly reproducible workflows.

The Overture Maps data source is licensed Community Data License Agreement – Permissive, Version 2.0 with some layers licensed as Open Data Commons Open Database License. OpenStreetMap data is © OpenStreetMap contributors

Loading Notes

The data source is a combination of EU Copernicus data and Overture Maps, which largely resembles OpenStreetMap. Overture intends to provide a higher degree of data verification and issues fixed releases.

Boundaries

Boundaries are extracted from the 2021 Urban Centres / High Density Clusters dataset. This is 1x1km raster with high density clusters described as contiguous 1km2 cells with at least 1,500 residents per km2 and consisting of cumulative urban clusters with at least 50,000 people.

Download the dataset from the above link. Then run the generate_boundary_polys.py script to generate the vector boundaries from the raster source. Provide the input datapath to the TIFF file and the output file path for the generated vector boundaries in GPKG format. The generated GPKG will contain three layers named bounds, unioned_bounds_2000, and unioned_bounds_10000. This script will automatically remove boundaries intersecting the UK.

Example:

python -m src.data.generate_boundary_polys temp/HDENS-CLST-2021/HDENS_CLST_2021.tif temp/datasets/boundaries.gpkg

Urban Atlas

urban atlas (~37GB vectors)

Run the load_urban_atlas_blocks.py script to generate the blocks data. Provide the path to the boundaries GPKG generated previously, as well as the downloaded Urban Atlas data and an output path for the generated blocks GPKG.

Example:

python -m src.data.load_urban_atlas_blocks \
        temp/datasets/boundaries.gpkg \
            temp/UA_2018_3035_eu \
                temp/datasets/blocks.gpkg

Tree cover

Tree cover (~36GB vectors).

Run the load_urban_atlas_trees.py script to generate the tree cover data. Provide the path to the boundaries GPKG generated previously, as well as the downloaded STL data and an output path for the generated tree cover GPKG.

Example:

python -m src.data.load_urban_atlas_trees \
    temp/datasets/boundaries.gpkg \
        temp/STL_2018_3035_eu \
            temp/datasets/tree_canopies.gpkg

Building Heights

Digital Height Model (~ 1GB raster).

Run the load_bldg_hts_raster.py script to generate the building heights data. Provide the path to the boundaries GPKG generated previously, as well as the downloaded building height data and an output folder path for the extracted building heights TIFF files.

Example:

python -m src.data.load_bldg_hts_raster \
    temp/datasets/boundaries.gpkg \
        temp/Building_Height_2012_3035_eu \
            temp/cities_data/heights

Ingesting Overture data

Run the load_overture.py script to download and prepare the overture data. The script will download the relevant Overture GPKG files for each boundary, clip them to the boundary, and save them to the output directory. Provide the path to the boundaries GPKG generated previously, as well as an output directory for the clipped Overture data. Optionally, you can specify the number of parallel workers to speed up the processing. By default, it uses 2 workers. Pass an additional argument --overwrite to redo processing for boundaries that already have corresponding Overture data in the output directory. Otherwise, existing data will be skipped. Each boundary will be saved as a separate GPKG file named with the boundary ID, containing layers for buildings, street edges, street nodes, a cleaned version of street edges clean_edges, POI places, and infrastructure.

python -m src.data.load_overture \
    temp/datasets/boundaries.gpkg \
        temp/cities_data/overture \
            --parallel_workers 6

docs/schema/concepts/by-theme/places/overture_categories.csv

The Overture POI schema is based on overture_categories.csv.

Census Data (2021)

GeoStat Census data for 2021 is downloaded from. These census statistics are aggregated to 1km2 cells.

Download the census ZIP dataset for Version 2021 (22 January 2025).

Metrics

Compute metrics using the generate_metrics.py script. Provide the path to the boundaries GPKG, the directory containing the processed Overture data, the blocks GPKG, the tree canopies GPKG, the census GPKG, and an output directory for the generated metrics GPKG files.

python -m src.processing.generate_metrics \
    temp/datasets/boundaries.gpkg \
        temp/cities_data/overture \
            temp/datasets/blocks.gpkg \
                temp/datasets/tree_canopies.gpkg \
                    temp/cities_data/heights \
                        temp/Eurostat_Census-GRID_2021_V2/ESTAT_Census_2021_V2.gpkg \
                            temp/cities_data/processed

Data Quality Analysis

After computing metrics, you can assess POI (Point of Interest) data saturation across all cities using a grid-based multi-scale regression approach. This analysis compares each city's POI density against population-based expectations, identifying cities that are undersaturated (fewer POIs than expected) or saturated (at or above expected levels).

POI Saturation Assessment Workflow

Open src/analysis/poi_saturation_notebook.py in VS Code and run all cells sequentially. The notebook uses # %% cell markers for interactive execution.

Configuration: Modify the paths in the configuration cell:

  • BOUNDS_PATH - Path to boundaries GPKG
  • OVERTURE_DATA_DIR - Directory with Overture data
  • CENSUS_PATH - Path to census GPKG
  • OUTPUT_DIR - Where to save results

This workflow performs seven steps:

  1. Grid-level aggregation: Counts POIs within 1km² census grid cells and computes multi-scale population neighborhoods (local, intermediate, large radii)
  2. Random Forest regression: For each POI category, fits a model in log-space: log(POI_count) ~ log(pop_local) + log(pop_intermediate) + log(pop_large). Log transformation linearizes power-law relationships between population and POI counts.
  3. Z-score computation: Standardized residuals identify grid cells with more/fewer POIs than expected given their multi-scale population context
  4. City-level aggregation: Aggregates grid z-scores per city, computing mean (saturation level) and standard deviation (spatial variability)
  5. Quadrant classification: Cities are classified into four quadrants based on mean z-score × variability:
    • Consistently Undersaturated: Low POI coverage, uniform distribution (potential data gap)
    • Variable Undersaturated: Low coverage, high spatial variability
    • Consistently Saturated: Expected or above POI coverage, uniform distribution
    • Variable Saturated: High coverage, high spatial variability
  6. Feature importance analysis: Compares which population scale (local, intermediate, large) best predicts each POI category
  7. Report generation: Creates comprehensive markdown report with quadrant classifications and visualizations

Output Files

The analysis generates:

  • grid_stats.gpkg - Grid-level POI counts with multi-scale population neighborhoods
  • grid_counts_regress.gpkg - Grid cells with z-scores and predicted values per category
  • city_analysis_results.gpkg - City-level z-score statistics and quadrant classifications
  • city_assessment_report.md - Comprehensive analysis report

Visualizations:

  • eda_analysis.png - Model fit (R²) and z-score distributions by category
  • regression_diagnostics.png - Predicted vs observed POI counts per category
  • feature_importance.png - Population scale importance for each POI type
  • city_quadrant_analysis.png - 12-panel visualization showing per-category and between-category quadrant classification

Note: All GeoPackage outputs can be opened in QGIS for spatial visualization and exploration.

Interpreting Results

Z-scores represent continuous deviations from expected POI counts:

  • z < 0: Fewer POIs than expected (undersaturated)
  • z > 0: More POIs than expected (saturated)

The quadrant analysis combines:

  • Mean z-score: Overall saturation level (negative = undersaturated, positive = saturated)
  • Std z-score: Spatial variability (low = consistent across grids, high = variable)

Recommended interpretation:

  • Consistently Undersaturated cities: May indicate data quality issues; use with caution
  • Variable Undersaturated cities: Partial coverage; some areas may be reliable
  • Consistently/Variable Saturated cities: Suitable for most analyses

Validation Datasets

The data quality vignette (EG1) validates SOAR POI anomaly detection against official national datasets. These datasets are used for multi-city validation but are not required to run the main SOAR pipeline.

France: SIRENE Business Registry

Official Name: Base Sirène des entreprises et de leurs établissements (SIREN, SIRET)

Source: Institut national de la statistique et des études économiques (INSEE)

License: Open License 2.0 (Licence Ouverte / Etalab v2.0)

Description: National registry of all French businesses and establishments with economic activity codes (APE - Activité Principale Exercée), geographic coordinates, and administrative status.

Coverage: ~31 million establishments (as of January 2026)

Download:

Citation:

INSEE (2026). Base Sirène des entreprises et de leurs établissements (SIREN, SIRET).
Institut National de la Statistique et des Études Économiques.
https://www.data.gouv.fr/en/datasets/base-sirene-des-entreprises-et-de-leurs-etablissements-siren-siret/
Retrieved: January 2026

Required columns:

  • siret - Unique establishment identifier (14 digits)
  • activitePrincipaleEtablissement - Economic activity code (APE/NAF)
  • etatAdministratifEtablissement - Administrative status (A=active, F=closed)
  • coordonneeLambertAbscisseEtablissement - X coordinate (Lambert 93 projection)
  • coordonneeLambertOrdonneeEtablissement - Y coordinate (Lambert 93 projection)

Classification system: APE codes follow the French NAF classification (Nomenclature d'Activités Française), derived from EU NACE Rev. 2 standard. See: https://www.insee.fr/en/metadonnees/nafr2/

Validation usage: Harmonized to 5 POI categories via APE code mapping (see paper_research/code/eg1_data_quality/VALIDATION_FRAMEWORK.md for details).


Netherlands: BAG Building Registry

Official Name: Basisregistratie Adressen en Gebouwen (BAG) - Basic Registration of Addresses and Buildings

Source: Kadaster (Dutch Cadastre, Land Registry and Mapping Agency)

License: CC0 1.0 Universal (Public Domain)

Description: National registry of all buildings and addresses in the Netherlands with usage designations (gebruiksdoel), geometric footprints, and construction status.

Coverage: ~10 million buildings with ~18 million address objects (as of January 2026)

Download:

Citation:

Kadaster (2026). Basisregistratie Adressen en Gebouwen (BAG) 2.0 Extract.
Kadaster, Dutch Land Registry and Mapping Agency.
https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract
Retrieved: January 2026

Required file: 9999VBO08012026.zip from the BAG extract

  • VBO = Verblijfsobject (dwelling object / address object with usage purpose)

Required fields:

  • identificatie - Unique object identifier (16 digits)
  • gebruiksdoel - Usage purpose designation (functional category)
  • geometry - Building footprint or address point (RD New projection, EPSG:28992)
  • status - Object status (use only active records)

Classification system: Gebruiksdoel (usage purposes) include:

  • woonfunctie - Residential function
  • winkelfunctie - Shop/retail function
  • logiesfunctie - Lodging/accommodation function
  • bijeenkomstfunctie - Meeting/assembly function
  • gezondheidszorgfunctie - Healthcare function
  • onderwijsfunctie - Education function
  • And others (see BAG documentation)

Documentation: https://zakelijk.kadaster.nl/bag-2.0-extract

Validation usage: Harmonized to 5 POI categories via usage purpose mapping (see paper_research/code/eg1_data_quality/VALIDATION_FRAMEWORK.md for details).


Validation Data Preparation

Location: Place validation datasets in temp/validation/ directory

Preparation script: paper_research/code/eg1_data_quality/prepare_validation_data.py

This script:

  1. Loads SIRENE Parquet/CSV and filters to active establishments with coordinates
  2. Extracts BAG VBO data from ZIP and filters to objects with usage purposes
  3. Converts coordinates to WGS84 (EPSG:4326)
  4. Filters to European geographic extent
  5. Saves processed GeoPackage files: sirene_france.gpkg, bag_netherlands.gpkg

Usage:

# Place raw data in temp/validation/
# - StockEtablissement_utf8.parquet (or .csv)
# - lvbag-extract-nl/ (directory with ZIP files)

# Run preparation
python paper_research/code/eg1_data_quality/prepare_validation_data.py

# Outputs:
# - temp/validation/sirene_france.gpkg (~8-12 GB)
# - temp/validation/bag_netherlands.gpkg (~4-6 GB)

Validation methodology: See paper_research/code/eg1_data_quality/VALIDATION_FRAMEWORK.md and VALIDATION_PROCESS_SUMMARY.md for complete documentation of:

  • Category harmonization approach
  • City selection strategy (24 cities stratified by population)
  • Statistical analysis methods (Spearman rank correlation, systematic bias quantification)
  • Interpretation guidelines

Data Citation Guidelines

When using SOAR with validation:

For methods section:

The SOAR dataset was validated against official national registries: the French SIRENE business registry (INSEE, 2026) covering ~31 million establishments and the Netherlands BAG building registry (Kadaster, 2026) covering ~10 million buildings.

For acknowledgments:

This research used data from INSEE (Institut National de la Statistique et des Études Économiques) and Kadaster (Dutch Land Registry and Mapping Agency).

For data availability statement:

SIRENE data are publicly available from https://www.data.gouv.fr/en/datasets/base-sirene-des-entreprises-et-de-leurs-etablissements-siren-siret/ under Open License 2.0. BAG data are publicly available from https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract under CC0 1.0 Universal license.

About

Lightweight modelling workflow for EBDP TWIN2EXPAND project

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •