Spatial Join vs Attribute Join in GeoPandas
A merge joins rows on a shared key; an sjoin joins rows on a spatial relationship. Mixing them up produces either an empty result or a needlessly expensive one. This guide draws the line clearly and shows when each is correct. It is for anyone combining two datasets in GeoPandas. It sits under Spatial Joins & Merging in Geospatial Data Ingestion & Processing Workflows.
Why This Approach / What Goes Wrong
If both datasets already share a key — a parcel id, a region code — a plain attribute merge is exact and fast; there is no reason to involve geometry. You need a spatial join only when the relationship is geometric: which district contains this point, which parcels a flood polygon overlaps. The mistakes go both ways. People reach for sjoin when a key exists, paying for an R-tree and risking border ambiguity they didn't need. Others try to merge spatial data that has no common key and get nothing. And when an sjoin is right, the inputs must share a CRS or every predicate is false — silently. The detailed sjoin mechanics are in Performing Left Joins with GeoPandas sjoin.
Prerequisites
geopandas>=0.14pandas>=2.0
conda install -c conda-forge "geopandas=0.14.*" "pandas=2.0.*"
Step-by-Step Implementation
1. When a shared key exists, use an attribute merge.
import geopandas as gpd
import pandas as pd
# parcels (geometry + parcel_id) and a non-spatial assessment table sharing parcel_id
parcels = gpd.read_file("parcels.gpkg")
assessments = pd.read_csv("assessments.csv") # columns: parcel_id, assessed_value
parcels_valued = parcels.merge(assessments, on="parcel_id", how="left")
# Geometry untouched; rows matched purely on the key
2. When the only relationship is location, use sjoin.
import geopandas as gpd
# sensors (points) gain the district they fall within
sensors = gpd.read_file("sensors.gpkg")
districts = gpd.read_file("districts.gpkg")
# Both MUST share a CRS for predicates to evaluate
sensors = sensors.to_crs(districts.crs)
sensors_in_district = gpd.sjoin(
sensors, districts[["district_name", "geometry"]],
how="left", predicate="within",
)
3. Combine both when appropriate — spatially assign a region, then merge regional attributes by key.
import pandas as pd
region_stats = pd.read_csv("region_stats.csv") # district_name, avg_income
enriched = sensors_in_district.merge(region_stats, on="district_name", how="left")
Verification
Check that a key-based merge preserved row count and a spatial join matched plausibly.
# Attribute merge: a left join must not change the number of parcels
assert len(parcels_valued) == len(parcels), "Key merge duplicated rows — non-unique key"
print("Unmatched assessments:", parcels_valued["assessed_value"].isna().sum())
# Spatial join: most sensors should land in some district
matched = sensors_in_district["district_name"].notna().mean()
print(f"Sensors matched to a district: {matched:.1%}") # Sensors matched to a district: 98.7%
assert matched > 0.5, "Few matches — likely a CRS mismatch between inputs"
Edge Cases & Debugging
sjoinreturns all nulls. The inputs are in different CRSs; reproject one to match before joining.mergereturns no matches. The key dtypes differ (int vs string) or values are formatted differently; normalize the key first.- Row count explodes after
merge. The key isn't unique on the right side; deduplicate or aggregate before merging. - Points on borders match two polygons in
sjoin. Usepredicate="within"and a tie-break, or snap borders. - Used
sjoinwhere a key existed. Slower and introduces border ambiguity — prefermergewhen a reliable key is present. - Lost geometry after
merge. Callmergeon theGeoDataFrame(left), not on the plain DataFrame, so the geometry column is retained.