When performing a spatial join between region polygons and point coordinates, points that fall exactly on a shared boundary between two regions are matched to both adjacent polygons. This causes the resulting joined GeoDataFrame to have more rows than the input coords, silently producing incorrect results.
|
x_coords, y_coords = zip(*[(x, y) for y, x in coords]) |
|
coords = gpd.GeoDataFrame( |
|
{"geometry": gpd.points_from_xy(x_coords, y_coords)}, crs="EPSG:4326" |
|
).to_crs(regions.crs) |
|
regions.set_index("region_id", inplace=True) |
|
joined = regions.sjoin(coords, how="right").to_crs(regions.crs) |
Expected behaviour
len(joined) == len(coords)
Actual behaviour
len(joined) == len(coords) + N where N is the number of points landing on shared boundaries.
Inspecting the duplicates:
region_id index geometry
351695 E01019227 18235.0 POINT (-2.96825 55.00966)
351695 E01019228 18236.0 POINT (-2.96825 55.00966)
The same point is matched to two adjacent regions because the default predicate="intersects" counts boundary-straddling points as belonging to both polygons.
When performing a spatial join between region polygons and point coordinates, points that fall exactly on a shared boundary between two regions are matched to both adjacent polygons. This causes the resulting
joinedGeoDataFrame to have more rows than the inputcoords, silently producing incorrect results.Geocode/geocode/utilities.py
Lines 262 to 267 in 4312fa8
Expected behaviour
len(joined) == len(coords)Actual behaviour
len(joined) == len(coords) + Nwhere N is the number of points landing on shared boundaries.Inspecting the duplicates:
The same point is matched to two adjacent regions because the default
predicate="intersects"counts boundary-straddling points as belonging to both polygons.