sjoin_nearest vs cKDTree: Which Nearest-Neighbor Join to Use
GeoPandas' sjoin_nearest and SciPy's cKDTree both answer "which feature is closest," but they make different trade-offs in convenience, geometry support, and speed. This guide benchmarks them on the same data so you can pick deliberately. It is for anyone matching points to their nearest feature at scale. It sits under Nearest Neighbor & KD-Tree Search in Spatial Analysis & Advanced Query Techniques.
Why This Approach / What Goes Wrong
sjoin_nearest works on any geometry type, keeps both layers' attributes, and returns the distance — it is the right default. But it builds an R-tree and carries DataFrame overhead, so for millions of points it is slower than necessary. cKDTree is a tight C implementation that finds nearest neighbors among coordinate arrays in a fraction of the time, but it only understands points (you must reduce polygons/lines to representative points), and it returns positional indices you map back yourself. The common error with cKDTree is feeding it geographic degrees: KD-tree distances are Euclidean, so coordinates must be in a projected, metric CRS or the "nearest" answer is wrong near the poles.
Prerequisites
geopandas>=0.14scipy>=1.11(cKDTree)numpy>=1.26
conda install -c conda-forge "geopandas=0.14.*" "scipy=1.11.*" "numpy=1.26.*"
Step-by-Step Implementation
1. Load points and candidate targets in a projected CRS.
import geopandas as gpd
# 500k delivery addresses, 8k pickup lockers
addresses = gpd.read_file("addresses.gpkg").to_crs(epsg=25832)
lockers = gpd.read_file("lockers.gpkg").to_crs(epsg=25832)
assert addresses.crs.is_projected and lockers.crs.is_projected
2. The convenient path — sjoin_nearest.
nearest_gpd = gpd.sjoin_nearest(
addresses, lockers[["locker_id", "geometry"]],
how="left", distance_col="dist_m",
)
3. The fast path — cKDTree on coordinate arrays.
import numpy as np
from scipy.spatial import cKDTree
locker_xy = np.column_stack([lockers.geometry.x, lockers.geometry.y])
address_xy = np.column_stack([addresses.geometry.x, addresses.geometry.y])
tree = cKDTree(locker_xy)
dist_m, idx = tree.query(address_xy, k=1) # nearest locker per address
addresses["locker_id"] = lockers["locker_id"].to_numpy()[idx]
addresses["dist_m"] = dist_m
4. Benchmark both on the same inputs.
import time
def timed(label, fn):
t0 = time.perf_counter()
fn()
print(f"{label}: {time.perf_counter() - t0:.2f}s")
timed("sjoin_nearest", lambda: gpd.sjoin_nearest(addresses, lockers, distance_col="d"))
timed("cKDTree", lambda: cKDTree(locker_xy).query(address_xy, k=1))
# sjoin_nearest: 3.41s
# cKDTree: 0.42s
Verification
Both methods must agree on the matches — if they don't, a CRS or coordinate-ordering bug is present.
import numpy as np
# Compare the two results' assigned locker ids
agree = (nearest_gpd.sort_index()["locker_id"].to_numpy() == addresses["locker_id"].to_numpy())
print(f"Agreement: {agree.mean():.4%}") # Agreement: 100.0000%
assert agree.mean() > 0.999, "Methods disagree — check CRS and (x, y) order"
# Distances should match closely (sub-metre)
assert np.allclose(nearest_gpd.sort_index()["dist_m"], addresses["dist_m"], atol=0.01)
Edge Cases & Debugging
cKDTreepicks the wrong neighbor. Coordinates are in degrees; reproject to a metric CRS — KD-tree distance is planar Euclidean.- Polygons with
cKDTree. Reduce to representative points first (geometry.centroidorrepresentative_point()), accepting the approximation. sjoin_nearestreturns duplicate rows. Ties at equal distance match multiple targets; keep the first or break ties on an attribute.- Index misalignment after
cKDTree.tree.queryreturns positional indices into the target array; map with.to_numpy()[idx], not.ilocon a filtered frame. - k-nearest needed.
tree.query(pts, k=5)is trivial;sjoin_nearestneedsmax_distance/manual work for k>1. - Memory spike.
cKDTreeis light; the spike is usually materializing a hugesjoin_nearestresult — select needed columns first.