sjoin_nearest vs cKDTree: Which Nearest-Neighbor Join to Use

GeoPandas' sjoin_nearest and SciPy's cKDTree both answer "which feature is closest," but they make different trade-offs in convenience, geometry support, and speed. This guide benchmarks them on the same data so you can pick deliberately. It is for anyone matching points to their nearest feature at scale. It sits under Nearest Neighbor & KD-Tree Search in Spatial Analysis & Advanced Query Techniques.

sjoin_nearest versus cKDTree comparison A side-by-side comparison: sjoin_nearest handles any geometry and keeps attributes but is slower; cKDTree is fastest but points-only and returns indices. gpd.sjoin_nearest convenience scipy cKDTree raw speed Any geometry (lines, polys) Points / centroids only Keeps attributes + distance Returns array indices R-tree under the hood Balanced KD-tree k = 1 nearest by default k-nearest, easy ~1x baseline speed ~5-10x faster, points Rule of thumb: sjoin_nearest unless points-only speed dominates — then cKDTree
Both solve nearest-neighbor matching; the choice is convenience and geometry support versus raw point speed.

Why This Approach / What Goes Wrong

sjoin_nearest works on any geometry type, keeps both layers' attributes, and returns the distance — it is the right default. But it builds an R-tree and carries DataFrame overhead, so for millions of points it is slower than necessary. cKDTree is a tight C implementation that finds nearest neighbors among coordinate arrays in a fraction of the time, but it only understands points (you must reduce polygons/lines to representative points), and it returns positional indices you map back yourself. The common error with cKDTree is feeding it geographic degrees: KD-tree distances are Euclidean, so coordinates must be in a projected, metric CRS or the "nearest" answer is wrong near the poles.

Prerequisites

conda install -c conda-forge "geopandas=0.14.*" "scipy=1.11.*" "numpy=1.26.*"

Step-by-Step Implementation

1. Load points and candidate targets in a projected CRS.

import geopandas as gpd

# 500k delivery addresses, 8k pickup lockers
addresses = gpd.read_file("addresses.gpkg").to_crs(epsg=25832)
lockers = gpd.read_file("lockers.gpkg").to_crs(epsg=25832)
assert addresses.crs.is_projected and lockers.crs.is_projected

2. The convenient path — sjoin_nearest.

nearest_gpd = gpd.sjoin_nearest(
    addresses, lockers[["locker_id", "geometry"]],
    how="left", distance_col="dist_m",
)

3. The fast path — cKDTree on coordinate arrays.

import numpy as np
from scipy.spatial import cKDTree

locker_xy = np.column_stack([lockers.geometry.x, lockers.geometry.y])
address_xy = np.column_stack([addresses.geometry.x, addresses.geometry.y])

tree = cKDTree(locker_xy)
dist_m, idx = tree.query(address_xy, k=1)          # nearest locker per address

addresses["locker_id"] = lockers["locker_id"].to_numpy()[idx]
addresses["dist_m"] = dist_m

4. Benchmark both on the same inputs.

import time

def timed(label, fn):
    t0 = time.perf_counter()
    fn()
    print(f"{label}: {time.perf_counter() - t0:.2f}s")

timed("sjoin_nearest", lambda: gpd.sjoin_nearest(addresses, lockers, distance_col="d"))
timed("cKDTree", lambda: cKDTree(locker_xy).query(address_xy, k=1))
# sjoin_nearest: 3.41s
# cKDTree: 0.42s

Verification

Both methods must agree on the matches — if they don't, a CRS or coordinate-ordering bug is present.

import numpy as np

# Compare the two results' assigned locker ids
agree = (nearest_gpd.sort_index()["locker_id"].to_numpy() == addresses["locker_id"].to_numpy())
print(f"Agreement: {agree.mean():.4%}")            # Agreement: 100.0000%
assert agree.mean() > 0.999, "Methods disagree — check CRS and (x, y) order"

# Distances should match closely (sub-metre)
assert np.allclose(nearest_gpd.sort_index()["dist_m"], addresses["dist_m"], atol=0.01)

Edge Cases & Debugging