Spatial Join vs Attribute Join in GeoPandas

A merge joins rows on a shared key; an sjoin joins rows on a spatial relationship. Mixing them up produces either an empty result or a needlessly expensive one. This guide draws the line clearly and shows when each is correct. It is for anyone combining two datasets in GeoPandas. It sits under Spatial Joins & Merging in Geospatial Data Ingestion & Processing Workflows.

Why This Approach / What Goes Wrong

If both datasets already share a key — a parcel id, a region code — a plain attribute merge is exact and fast; there is no reason to involve geometry. You need a spatial join only when the relationship is geometric: which district contains this point, which parcels a flood polygon overlaps. The mistakes go both ways. People reach for sjoin when a key exists, paying for an R-tree and risking border ambiguity they didn't need. Others try to merge spatial data that has no common key and get nothing. And when an sjoin is right, the inputs must share a CRS or every predicate is false — silently. The detailed sjoin mechanics are in Performing Left Joins with GeoPandas sjoin.

Prerequisites

geopandas>=0.14
pandas>=2.0

conda install -c conda-forge "geopandas=0.14.*" "pandas=2.0.*"

Step-by-Step Implementation

1. When a shared key exists, use an attribute merge.

import geopandas as gpd
import pandas as pd

# parcels (geometry + parcel_id) and a non-spatial assessment table sharing parcel_id
parcels = gpd.read_file("parcels.gpkg")
assessments = pd.read_csv("assessments.csv")    # columns: parcel_id, assessed_value

parcels_valued = parcels.merge(assessments, on="parcel_id", how="left")
# Geometry untouched; rows matched purely on the key

2. When the only relationship is location, use sjoin.

import geopandas as gpd

# sensors (points) gain the district they fall within
sensors = gpd.read_file("sensors.gpkg")
districts = gpd.read_file("districts.gpkg")

# Both MUST share a CRS for predicates to evaluate
sensors = sensors.to_crs(districts.crs)

sensors_in_district = gpd.sjoin(
    sensors, districts[["district_name", "geometry"]],
    how="left", predicate="within",
)

3. Combine both when appropriate — spatially assign a region, then merge regional attributes by key.

import pandas as pd

region_stats = pd.read_csv("region_stats.csv")   # district_name, avg_income
enriched = sensors_in_district.merge(region_stats, on="district_name", how="left")

Verification

Check that a key-based merge preserved row count and a spatial join matched plausibly.

# Attribute merge: a left join must not change the number of parcels
assert len(parcels_valued) == len(parcels), "Key merge duplicated rows — non-unique key"
print("Unmatched assessments:", parcels_valued["assessed_value"].isna().sum())

# Spatial join: most sensors should land in some district
matched = sensors_in_district["district_name"].notna().mean()
print(f"Sensors matched to a district: {matched:.1%}")   # Sensors matched to a district: 98.7%
assert matched > 0.5, "Few matches — likely a CRS mismatch between inputs"

Edge Cases & Debugging

sjoin returns all nulls. The inputs are in different CRSs; reproject one to match before joining.
merge returns no matches. The key dtypes differ (int vs string) or values are formatted differently; normalize the key first.
Row count explodes after merge. The key isn't unique on the right side; deduplicate or aggregate before merging.
Points on borders match two polygons in sjoin. Use predicate="within" and a tie-break, or snap borders.
Used sjoin where a key existed. Slower and introduces border ambiguity — prefer merge when a reliable key is present.
Lost geometry after merge. Call merge on the GeoDataFrame (left), not on the plain DataFrame, so the geometry column is retained.