Querying GeoParquet with DuckDB Spatial

This guide runs spatial SQL straight against a GeoParquet file — local or on object storage — and returns a GeoDataFrame, all without a database server or a full in-memory load. It is for analysts whose data has outgrown a comfortable GeoDataFrame but who don't want to stand up PostGIS. It sits under DuckDB Spatial Analytics in Spatial Analysis & Advanced Query Techniques.

Why This Approach / What Goes Wrong

GeoParquet stores geometry as WKB in a columnar layout with per-row-group bounding-box statistics. DuckDB exploits both: it reads only the columns a query references and skips row groups whose bbox can't satisfy a spatial filter — so a windowed query over a multi-gigabyte file touches a fraction of it. The pitfalls are mechanical: forgetting to load the spatial extension, computing metric quantities on EPSG:4326 degrees, and losing the CRS when converting DuckDB output to GeoPandas. Pass geometry across the boundary as WKB and set the CRS explicitly, and the hand-off is clean.

Prerequisites

pip install "duckdb>=0.10" "geopandas>=0.14" "shapely>=2.0"

Step-by-Step Implementation

1. Connect and load the spatial extension.

import duckdb

con = duckdb.connect()
con.install_extension("spatial")
con.load_extension("spatial")

2. Inspect the file's schema and geometry column name.

schema = con.sql("DESCRIBE SELECT * FROM 'buildings.parquet'").df()
print(schema[["column_name", "column_type"]].to_string(index=False))
# column_name column_type
#  building_id      BIGINT
#     area_m2      DOUBLE
#        geom    GEOMETRY

3. Run a windowed spatial query, returning geometry as WKB.

query = """
SELECT building_id, area_m2, ST_AsWKB(geom) AS wkb
FROM 'buildings.parquet'
WHERE ST_Intersects(
    geom,
    ST_MakeEnvelope(7.60, 45.00, 7.80, 45.10)   -- lon/lat window, EPSG:4326
)
AND area_m2 > 120
"""
result_df = con.sql(query).df()

4. Rebuild a GeoDataFrame, setting the CRS explicitly.

import geopandas as gpd

buildings = gpd.GeoDataFrame(
    result_df.drop(columns="wkb"),
    geometry=gpd.GeoSeries.from_wkb(result_df["wkb"]),
    crs="EPSG:4326",
)

5. (Optional) Query remote GeoParquet over HTTP range requests.

con.install_extension("httpfs")
con.load_extension("httpfs")
remote_count = con.sql(
    "SELECT count(*) FROM 'https://data.example.com/parcels.parquet' "
    "WHERE ST_Area(geom) > 1000"
).fetchone()[0]
print("Matching parcels:", remote_count)

Verification

Confirm the result decoded to real geometries with the expected CRS and that the spatial filter actually narrowed the set.

print("Rows returned:", len(buildings))            # Rows returned: 4127
print("CRS:", buildings.crs.to_epsg())             # CRS: 4326
print("Geom type:", buildings.geometry.geom_type.unique())  # ['Polygon']
assert buildings.geometry.notnull().all()

# All features fall inside the requested window
minx, miny, maxx, maxy = buildings.total_bounds
assert 7.60 <= minx and maxx <= 7.80, "Filter window not respected"

Edge Cases & Debugging