sjoin_nearest vs cKDTree: Which Nearest-Neighbor Join to Use

GeoPandas' sjoin_nearest and SciPy's cKDTree both answer "which feature is closest," but they make different trade-offs in convenience, geometry support, and speed. This guide benchmarks them on the same data so you can pick deliberately. It is for anyone matching points to their nearest feature at scale. It sits under Nearest Neighbor & KD-Tree Search in Spatial Analysis & Advanced Query Techniques.

Both solve nearest-neighbor matching; the choice is convenience and geometry support versus raw point speed.

Why This Approach / What Goes Wrong

sjoin_nearest works on any geometry type, keeps both layers' attributes, and returns the distance — it is the right default. But it builds an R-tree and carries DataFrame overhead, so for millions of points it is slower than necessary. cKDTree is a tight C implementation that finds nearest neighbors among coordinate arrays in a fraction of the time, but it only understands points (you must reduce polygons/lines to representative points), and it returns positional indices you map back yourself. The common error with cKDTree is feeding it geographic degrees: KD-tree distances are Euclidean, so coordinates must be in a projected, metric CRS or the "nearest" answer is wrong near the poles.

Prerequisites

geopandas>=0.14
scipy>=1.11 (cKDTree)
numpy>=1.26

conda install -c conda-forge "geopandas=0.14.*" "scipy=1.11.*" "numpy=1.26.*"

Step-by-Step Implementation

1. Load points and candidate targets in a projected CRS.

import geopandas as gpd

# 500k delivery addresses, 8k pickup lockers
addresses = gpd.read_file("addresses.gpkg").to_crs(epsg=25832)
lockers = gpd.read_file("lockers.gpkg").to_crs(epsg=25832)
assert addresses.crs.is_projected and lockers.crs.is_projected

2. The convenient path — sjoin_nearest.

nearest_gpd = gpd.sjoin_nearest(
    addresses, lockers[["locker_id", "geometry"]],
    how="left", distance_col="dist_m",
)

3. The fast path — cKDTree on coordinate arrays.

import numpy as np
from scipy.spatial import cKDTree

locker_xy = np.column_stack([lockers.geometry.x, lockers.geometry.y])
address_xy = np.column_stack([addresses.geometry.x, addresses.geometry.y])

tree = cKDTree(locker_xy)
dist_m, idx = tree.query(address_xy, k=1)          # nearest locker per address

addresses["locker_id"] = lockers["locker_id"].to_numpy()[idx]
addresses["dist_m"] = dist_m

4. Benchmark both on the same inputs.

import time

def timed(label, fn):
    t0 = time.perf_counter()
    fn()
    print(f"{label}: {time.perf_counter() - t0:.2f}s")

timed("sjoin_nearest", lambda: gpd.sjoin_nearest(addresses, lockers, distance_col="d"))
timed("cKDTree", lambda: cKDTree(locker_xy).query(address_xy, k=1))
# sjoin_nearest: 3.41s
# cKDTree: 0.42s

Verification

Both methods must agree on the matches — if they don't, a CRS or coordinate-ordering bug is present.

import numpy as np

# Compare the two results' assigned locker ids
agree = (nearest_gpd.sort_index()["locker_id"].to_numpy() == addresses["locker_id"].to_numpy())
print(f"Agreement: {agree.mean():.4%}")            # Agreement: 100.0000%
assert agree.mean() > 0.999, "Methods disagree — check CRS and (x, y) order"

# Distances should match closely (sub-metre)
assert np.allclose(nearest_gpd.sort_index()["dist_m"], addresses["dist_m"], atol=0.01)

Edge Cases & Debugging

cKDTree picks the wrong neighbor. Coordinates are in degrees; reproject to a metric CRS — KD-tree distance is planar Euclidean.
Polygons with cKDTree. Reduce to representative points first (geometry.centroid or representative_point()), accepting the approximation.
sjoin_nearest returns duplicate rows. Ties at equal distance match multiple targets; keep the first or break ties on an attribute.
Index misalignment after cKDTree. tree.query returns positional indices into the target array; map with .to_numpy()[idx], not .iloc on a filtered frame.
k-nearest needed. tree.query(pts, k=5) is trivial; sjoin_nearest needs max_distance/manual work for k>1.
Memory spike. cKDTree is light; the spike is usually materializing a huge sjoin_nearest result — select needed columns first.