Querying GeoParquet with DuckDB Spatial
This guide runs spatial SQL straight against a GeoParquet file — local or on object storage — and returns a GeoDataFrame, all without a database server or a full in-memory load. It is for analysts whose data has outgrown a comfortable GeoDataFrame but who don't want to stand up PostGIS. It sits under DuckDB Spatial Analytics in Spatial Analysis & Advanced Query Techniques.
Why This Approach / What Goes Wrong
GeoParquet stores geometry as WKB in a columnar layout with per-row-group bounding-box statistics. DuckDB exploits both: it reads only the columns a query references and skips row groups whose bbox can't satisfy a spatial filter — so a windowed query over a multi-gigabyte file touches a fraction of it. The pitfalls are mechanical: forgetting to load the spatial extension, computing metric quantities on EPSG:4326 degrees, and losing the CRS when converting DuckDB output to GeoPandas. Pass geometry across the boundary as WKB and set the CRS explicitly, and the hand-off is clean.
Prerequisites
duckdb>=0.10geopandas>=0.14shapely>=2.0(WKB conversion)- A GeoParquet file (
.parquetwith ageometry/geomcolumn)
pip install "duckdb>=0.10" "geopandas>=0.14" "shapely>=2.0"
Step-by-Step Implementation
1. Connect and load the spatial extension.
import duckdb
con = duckdb.connect()
con.install_extension("spatial")
con.load_extension("spatial")
2. Inspect the file's schema and geometry column name.
schema = con.sql("DESCRIBE SELECT * FROM 'buildings.parquet'").df()
print(schema[["column_name", "column_type"]].to_string(index=False))
# column_name column_type
# building_id BIGINT
# area_m2 DOUBLE
# geom GEOMETRY
3. Run a windowed spatial query, returning geometry as WKB.
query = """
SELECT building_id, area_m2, ST_AsWKB(geom) AS wkb
FROM 'buildings.parquet'
WHERE ST_Intersects(
geom,
ST_MakeEnvelope(7.60, 45.00, 7.80, 45.10) -- lon/lat window, EPSG:4326
)
AND area_m2 > 120
"""
result_df = con.sql(query).df()
4. Rebuild a GeoDataFrame, setting the CRS explicitly.
import geopandas as gpd
buildings = gpd.GeoDataFrame(
result_df.drop(columns="wkb"),
geometry=gpd.GeoSeries.from_wkb(result_df["wkb"]),
crs="EPSG:4326",
)
5. (Optional) Query remote GeoParquet over HTTP range requests.
con.install_extension("httpfs")
con.load_extension("httpfs")
remote_count = con.sql(
"SELECT count(*) FROM 'https://data.example.com/parcels.parquet' "
"WHERE ST_Area(geom) > 1000"
).fetchone()[0]
print("Matching parcels:", remote_count)
Verification
Confirm the result decoded to real geometries with the expected CRS and that the spatial filter actually narrowed the set.
print("Rows returned:", len(buildings)) # Rows returned: 4127
print("CRS:", buildings.crs.to_epsg()) # CRS: 4326
print("Geom type:", buildings.geometry.geom_type.unique()) # ['Polygon']
assert buildings.geometry.notnull().all()
# All features fall inside the requested window
minx, miny, maxx, maxy = buildings.total_bounds
assert 7.60 <= minx and maxx <= 7.80, "Filter window not respected"
Edge Cases & Debugging
spatialfunctions undefined. Load the extension every session:con.load_extension("spatial").- CRS is
Noneafter conversion. Passcrs="EPSG:4326"(or the file's real CRS) when building theGeoDataFrame. - Areas in the millions or billions. You computed on degrees; wrap geometry in
ST_Transform(geom, 'EPSG:4326', 'EPSG:32632')beforeST_Area. - Slow despite filtering. The file lacks row-group bbox stats (older writers); rewrite it with a recent GeoParquet writer to enable skipping.
- Remote read fails. Load
httpfsand ensure the host serves HTTP range requests with CORS. - Geometry column named differently. It may be
geometry, notgeom; checkDESCRIBEoutput and adjust the SQL.