DuckDB Spatial Analytics
DuckDB brings database-grade spatial SQL into the Python process itself — no server, no connection pool, just an in-process engine that reads GeoParquet, Shapefiles, and GeoJSON through its spatial extension. For analytical workloads that are too big for a comfortable GeoDataFrame but don't justify a PostGIS deployment, it is often the fastest path. This guide covers the spatial extension and its place beside PostGIS Integration with Python and Scaling with Dask-GeoPandas within Spatial Analysis & Advanced Query Techniques.
Architecture & Data Structures
DuckDB is a columnar, vectorized OLAP engine that runs embedded in your process, like SQLite but built for analytics. The spatial extension adds a GEOMETRY type, the full ST_* function family (backed by GEOS, the same library under Shapely), and readers for common geospatial formats via GDAL. Because it is columnar, it reads only the columns a query touches, and because it understands GeoParquet's row-group statistics, it can skip whole chunks that fall outside a bounding box.
import duckdb
con = duckdb.connect()
con.install_extension("spatial")
con.load_extension("spatial")
# Query a GeoParquet file directly — no load step
con.sql("""
SELECT count(*) AS n
FROM 'buildings.parquet'
WHERE ST_Area(geom) > 500
""").show()
Environment Configuration & Dependency Resolution
pip install "duckdb>=0.10" "geopandas>=0.14"
# The spatial extension is downloaded on first install_extension("spatial")
DuckDB ships as a single wheel with no system dependencies — a major reason to choose it on locked-down or Windows machines where the GDAL stack is painful. The spatial extension is fetched at runtime; in air-gapped environments, pre-download it and point con.execute("SET extension_directory=...") at the cached copy.
Vectorized Operations & Core Workflow
The core pattern reads GeoParquet, filters with a spatial predicate, and hands the result to GeoPandas. The full recipe is in Querying GeoParquet with DuckDB Spatial.
import duckdb
import geopandas as gpd
from shapely import wkb
con = duckdb.connect()
con.load_extension("spatial")
# Spatial filter in SQL; return geometry as WKB for a clean GeoPandas hand-off
df = con.sql("""
SELECT parcel_id, land_use, ST_AsWKB(geom) AS wkb
FROM 'parcels.parquet'
WHERE ST_Intersects(geom, ST_MakeEnvelope(7.6, 45.0, 7.8, 45.1))
""").df()
parcels = gpd.GeoDataFrame(
df.drop(columns="wkb"),
geometry=gpd.GeoSeries.from_wkb(df["wkb"]),
crs="EPSG:4326",
)
Geometry / Data Processing Details
DuckDB's spatial functions mirror PostGIS naming, so SQL ports across with little change. It excels at large aggregations — counting points in polygons, dissolving by attribute, computing per-zone statistics — that would be memory-heavy in pure GeoPandas. For the predicate semantics themselves, see Geometric Intersections & Overlays.
import duckdb
con = duckdb.connect()
con.load_extension("spatial")
# Points-in-polygons aggregation across two GeoParquet files, streamed from disk
con.sql("""
SELECT z.zone_id, count(*) AS sensor_count
FROM 'zones.parquet' z
JOIN 'sensors.parquet' s ON ST_Contains(z.geom, s.geom)
GROUP BY z.zone_id
ORDER BY sensor_count DESC
""").show()
CRS Alignment & Projection Pipeline
DuckDB's spatial extension treats coordinates as planar numbers; it carries CRS metadata from GeoParquet but does not automatically reproject between SRIDs in a join. Confirm all inputs share one CRS before spatial predicates, and use ST_Transform (which the extension supports via PROJ) when they don't. Establish the canonical CRS upstream with Coordinate Systems with PyProj.
import duckdb
con = duckdb.connect()
con.load_extension("spatial")
# Reproject 4326 → UTM 32N inside SQL before a metric area calculation
con.sql("""
SELECT id, ST_Area(ST_Transform(geom, 'EPSG:4326', 'EPSG:32632')) AS area_m2
FROM 'fields.parquet'
""").show()
Computing ST_Area on raw EPSG:4326 geometry yields square degrees — reproject first, exactly as you would in GeoPandas.
Production Export & Integration
- Read remote data in place. With the
httpfsextension, DuckDB queries GeoParquet on S3/R2 over range requests, fitting the Cloud-Native Geospatial Formats model. - Write GeoParquet back out.
COPY (...) TO 'out.parquet' (FORMAT PARQUET)for derived datasets. - Hand off to the web layer. Export filtered results as GeoJSON or feed a tiling pipeline for Web Mapping & Interactive Visualization.
- Choose DuckDB vs PostGIS. DuckDB for embedded, read-mostly analytics on files; PostGIS for shared, transactional, indexed storage.
Windows / Platform Edge Cases & Debugging
Catalog Error: Table Function with name ST_Read does not exist. The spatial extension isn't loaded;con.load_extension("spatial")each session.- Extension download fails offline. Pre-fetch and set
extension_directory. - Areas/distances absurd. Geometry is in degrees;
ST_Transformto a projected CRS first. - Join returns nothing. Inputs are in different CRSs; reproject one side.
- Out-of-memory on huge joins. Set
SET memory_limit='8GB'and let DuckDB spill to disk viaSET temp_directory. - GeoPandas hand-off loses CRS. Set
crs=explicitly when constructing theGeoDataFramefrom WKB.