Skip to content

Bug: sjoin produces duplicate rows for points on shared polygon boundaries #82

@Herald-TUOS

Description

@Herald-TUOS

When performing a spatial join between region polygons and point coordinates, points that fall exactly on a shared boundary between two regions are matched to both adjacent polygons. This causes the resulting joined GeoDataFrame to have more rows than the input coords, silently producing incorrect results.

x_coords, y_coords = zip(*[(x, y) for y, x in coords])
coords = gpd.GeoDataFrame(
{"geometry": gpd.points_from_xy(x_coords, y_coords)}, crs="EPSG:4326"
).to_crs(regions.crs)
regions.set_index("region_id", inplace=True)
joined = regions.sjoin(coords, how="right").to_crs(regions.crs)

Expected behaviour

len(joined) == len(coords)

Actual behaviour

len(joined) == len(coords) + N where N is the number of points landing on shared boundaries.

Inspecting the duplicates:

        region_id    index                   geometry
351695  E01019227  18235.0  POINT (-2.96825 55.00966)
351695  E01019228  18236.0  POINT (-2.96825 55.00966)

The same point is matched to two adjacent regions because the default predicate="intersects" counts boundary-straddling points as belonging to both polygons.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions