Querying GeoParquet with DuckDB Spatial

This guide runs spatial SQL straight against a GeoParquet file — local or on object storage — and returns a GeoDataFrame, all without a database server or a full in-memory load. It is for analysts whose data has outgrown a comfortable GeoDataFrame but who don't want to stand up PostGIS. It sits under DuckDB Spatial Analytics in Spatial Analysis & Advanced Query Techniques.

Why This Approach / What Goes Wrong

GeoParquet stores geometry as WKB in a columnar layout with per-row-group bounding-box statistics. DuckDB exploits both: it reads only the columns a query references and skips row groups whose bbox can't satisfy a spatial filter — so a windowed query over a multi-gigabyte file touches a fraction of it. The pitfalls are mechanical: forgetting to load the spatial extension, computing metric quantities on EPSG:4326 degrees, and losing the CRS when converting DuckDB output to GeoPandas. Pass geometry across the boundary as WKB and set the CRS explicitly, and the hand-off is clean.

Prerequisites

duckdb>=0.10
geopandas>=0.14
shapely>=2.0 (WKB conversion)
A GeoParquet file (.parquet with a geometry/geom column)

pip install "duckdb>=0.10" "geopandas>=0.14" "shapely>=2.0"

Step-by-Step Implementation

1. Connect and load the spatial extension.

import duckdb

con = duckdb.connect()
con.install_extension("spatial")
con.load_extension("spatial")

2. Inspect the file's schema and geometry column name.

schema = con.sql("DESCRIBE SELECT * FROM 'buildings.parquet'").df()
print(schema[["column_name", "column_type"]].to_string(index=False))
# column_name column_type
#  building_id      BIGINT
#     area_m2      DOUBLE
#        geom    GEOMETRY

3. Run a windowed spatial query, returning geometry as WKB.

query = """
SELECT building_id, area_m2, ST_AsWKB(geom) AS wkb
FROM 'buildings.parquet'
WHERE ST_Intersects(
    geom,
    ST_MakeEnvelope(7.60, 45.00, 7.80, 45.10)   -- lon/lat window, EPSG:4326
)
AND area_m2 > 120
"""
result_df = con.sql(query).df()

4. Rebuild a GeoDataFrame, setting the CRS explicitly.

import geopandas as gpd

buildings = gpd.GeoDataFrame(
    result_df.drop(columns="wkb"),
    geometry=gpd.GeoSeries.from_wkb(result_df["wkb"]),
    crs="EPSG:4326",
)

5. (Optional) Query remote GeoParquet over HTTP range requests.

con.install_extension("httpfs")
con.load_extension("httpfs")
remote_count = con.sql(
    "SELECT count(*) FROM 'https://data.example.com/parcels.parquet' "
    "WHERE ST_Area(geom) > 1000"
).fetchone()[0]
print("Matching parcels:", remote_count)

Verification

Confirm the result decoded to real geometries with the expected CRS and that the spatial filter actually narrowed the set.

print("Rows returned:", len(buildings))            # Rows returned: 4127
print("CRS:", buildings.crs.to_epsg())             # CRS: 4326
print("Geom type:", buildings.geometry.geom_type.unique())  # ['Polygon']
assert buildings.geometry.notnull().all()

# All features fall inside the requested window
minx, miny, maxx, maxy = buildings.total_bounds
assert 7.60 <= minx and maxx <= 7.80, "Filter window not respected"

Edge Cases & Debugging

spatial functions undefined. Load the extension every session: con.load_extension("spatial").
CRS is None after conversion. Pass crs="EPSG:4326" (or the file's real CRS) when building the GeoDataFrame.
Areas in the millions or billions. You computed on degrees; wrap geometry in ST_Transform(geom, 'EPSG:4326', 'EPSG:32632') before ST_Area.
Slow despite filtering. The file lacks row-group bbox stats (older writers); rewrite it with a recent GeoParquet writer to enable skipping.
Remote read fails. Load httpfs and ensure the host serves HTTP range requests with CORS.
Geometry column named differently. It may be geometry, not geom; check DESCRIBE output and adjust the SQL.