DuckDB Spatial Analytics

DuckDB brings database-grade spatial SQL into the Python process itself — no server, no connection pool, just an in-process engine that reads GeoParquet, Shapefiles, and GeoJSON through its spatial extension. For analytical workloads that are too big for a comfortable GeoDataFrame but don't justify a PostGIS deployment, it is often the fastest path. This guide covers the spatial extension and its place beside PostGIS Integration with Python and Scaling with Dask-GeoPandas within Spatial Analysis & Advanced Query Techniques.

DuckDB in-process spatial query DuckDB reads GeoParquet directly from disk or object storage, runs vectorized spatial SQL in-process, and returns results to GeoPandas without a server. GeoParquet local / S3 / R2 columnar DuckDB (in-process) spatial extension ST_* + R-tree vectorized, columnar predicate pushdown GeoDataFrame result rows no server
DuckDB runs columnar spatial SQL inside your Python process, reading GeoParquet directly — including remotely over HTTP range requests.

Architecture & Data Structures

DuckDB is a columnar, vectorized OLAP engine that runs embedded in your process, like SQLite but built for analytics. The spatial extension adds a GEOMETRY type, the full ST_* function family (backed by GEOS, the same library under Shapely), and readers for common geospatial formats via GDAL. Because it is columnar, it reads only the columns a query touches, and because it understands GeoParquet's row-group statistics, it can skip whole chunks that fall outside a bounding box.

import duckdb

con = duckdb.connect()
con.install_extension("spatial")
con.load_extension("spatial")

# Query a GeoParquet file directly — no load step
con.sql("""
    SELECT count(*) AS n
    FROM 'buildings.parquet'
    WHERE ST_Area(geom) > 500
""").show()

Environment Configuration & Dependency Resolution

pip install "duckdb>=0.10" "geopandas>=0.14"
# The spatial extension is downloaded on first install_extension("spatial")

DuckDB ships as a single wheel with no system dependencies — a major reason to choose it on locked-down or Windows machines where the GDAL stack is painful. The spatial extension is fetched at runtime; in air-gapped environments, pre-download it and point con.execute("SET extension_directory=...") at the cached copy.

Vectorized Operations & Core Workflow

The core pattern reads GeoParquet, filters with a spatial predicate, and hands the result to GeoPandas. The full recipe is in Querying GeoParquet with DuckDB Spatial.

import duckdb
import geopandas as gpd
from shapely import wkb

con = duckdb.connect()
con.load_extension("spatial")

# Spatial filter in SQL; return geometry as WKB for a clean GeoPandas hand-off
df = con.sql("""
    SELECT parcel_id, land_use, ST_AsWKB(geom) AS wkb
    FROM 'parcels.parquet'
    WHERE ST_Intersects(geom, ST_MakeEnvelope(7.6, 45.0, 7.8, 45.1))
""").df()

parcels = gpd.GeoDataFrame(
    df.drop(columns="wkb"),
    geometry=gpd.GeoSeries.from_wkb(df["wkb"]),
    crs="EPSG:4326",
)

Geometry / Data Processing Details

DuckDB's spatial functions mirror PostGIS naming, so SQL ports across with little change. It excels at large aggregations — counting points in polygons, dissolving by attribute, computing per-zone statistics — that would be memory-heavy in pure GeoPandas. For the predicate semantics themselves, see Geometric Intersections & Overlays.

import duckdb

con = duckdb.connect()
con.load_extension("spatial")

# Points-in-polygons aggregation across two GeoParquet files, streamed from disk
con.sql("""
    SELECT z.zone_id, count(*) AS sensor_count
    FROM 'zones.parquet' z
    JOIN 'sensors.parquet' s ON ST_Contains(z.geom, s.geom)
    GROUP BY z.zone_id
    ORDER BY sensor_count DESC
""").show()

CRS Alignment & Projection Pipeline

DuckDB's spatial extension treats coordinates as planar numbers; it carries CRS metadata from GeoParquet but does not automatically reproject between SRIDs in a join. Confirm all inputs share one CRS before spatial predicates, and use ST_Transform (which the extension supports via PROJ) when they don't. Establish the canonical CRS upstream with Coordinate Systems with PyProj.

import duckdb

con = duckdb.connect()
con.load_extension("spatial")

# Reproject 4326 → UTM 32N inside SQL before a metric area calculation
con.sql("""
    SELECT id, ST_Area(ST_Transform(geom, 'EPSG:4326', 'EPSG:32632')) AS area_m2
    FROM 'fields.parquet'
""").show()

Computing ST_Area on raw EPSG:4326 geometry yields square degrees — reproject first, exactly as you would in GeoPandas.

Production Export & Integration

Windows / Platform Edge Cases & Debugging