DuckDB Spatial Analytics

DuckDB brings database-grade spatial SQL into the Python process itself — no server, no connection pool, just an in-process engine that reads GeoParquet, Shapefiles, and GeoJSON through its spatial extension. For analytical workloads that are too big for a comfortable GeoDataFrame but don't justify a PostGIS deployment, it is often the fastest path. This guide covers the spatial extension and its place beside PostGIS Integration with Python and Scaling with Dask-GeoPandas within Spatial Analysis & Advanced Query Techniques.

DuckDB runs columnar spatial SQL inside your Python process, reading GeoParquet directly — including remotely over HTTP range requests.

Architecture & Data Structures

DuckDB is a columnar, vectorized OLAP engine that runs embedded in your process, like SQLite but built for analytics. The spatial extension adds a GEOMETRY type, the full ST_* function family (backed by GEOS, the same library under Shapely), and readers for common geospatial formats via GDAL. Because it is columnar, it reads only the columns a query touches, and because it understands GeoParquet's row-group statistics, it can skip whole chunks that fall outside a bounding box.

import duckdb

con = duckdb.connect()
con.install_extension("spatial")
con.load_extension("spatial")

# Query a GeoParquet file directly — no load step
con.sql("""
    SELECT count(*) AS n
    FROM 'buildings.parquet'
    WHERE ST_Area(geom) > 500
""").show()

Environment Configuration & Dependency Resolution

pip install "duckdb>=0.10" "geopandas>=0.14"
# The spatial extension is downloaded on first install_extension("spatial")

DuckDB ships as a single wheel with no system dependencies — a major reason to choose it on locked-down or Windows machines where the GDAL stack is painful. The spatial extension is fetched at runtime; in air-gapped environments, pre-download it and point con.execute("SET extension_directory=...") at the cached copy.

Vectorized Operations & Core Workflow

The core pattern reads GeoParquet, filters with a spatial predicate, and hands the result to GeoPandas. The full recipe is in Querying GeoParquet with DuckDB Spatial.

import duckdb
import geopandas as gpd
from shapely import wkb

con = duckdb.connect()
con.load_extension("spatial")

# Spatial filter in SQL; return geometry as WKB for a clean GeoPandas hand-off
df = con.sql("""
    SELECT parcel_id, land_use, ST_AsWKB(geom) AS wkb
    FROM 'parcels.parquet'
    WHERE ST_Intersects(geom, ST_MakeEnvelope(7.6, 45.0, 7.8, 45.1))
""").df()

parcels = gpd.GeoDataFrame(
    df.drop(columns="wkb"),
    geometry=gpd.GeoSeries.from_wkb(df["wkb"]),
    crs="EPSG:4326",
)

Geometry / Data Processing Details

DuckDB's spatial functions mirror PostGIS naming, so SQL ports across with little change. It excels at large aggregations — counting points in polygons, dissolving by attribute, computing per-zone statistics — that would be memory-heavy in pure GeoPandas. For the predicate semantics themselves, see Geometric Intersections & Overlays.

import duckdb

con = duckdb.connect()
con.load_extension("spatial")

# Points-in-polygons aggregation across two GeoParquet files, streamed from disk
con.sql("""
    SELECT z.zone_id, count(*) AS sensor_count
    FROM 'zones.parquet' z
    JOIN 'sensors.parquet' s ON ST_Contains(z.geom, s.geom)
    GROUP BY z.zone_id
    ORDER BY sensor_count DESC
""").show()

CRS Alignment & Projection Pipeline

DuckDB's spatial extension treats coordinates as planar numbers; it carries CRS metadata from GeoParquet but does not automatically reproject between SRIDs in a join. Confirm all inputs share one CRS before spatial predicates, and use ST_Transform (which the extension supports via PROJ) when they don't. Establish the canonical CRS upstream with Coordinate Systems with PyProj.

import duckdb

con = duckdb.connect()
con.load_extension("spatial")

# Reproject 4326 → UTM 32N inside SQL before a metric area calculation
con.sql("""
    SELECT id, ST_Area(ST_Transform(geom, 'EPSG:4326', 'EPSG:32632')) AS area_m2
    FROM 'fields.parquet'
""").show()

Computing ST_Area on raw EPSG:4326 geometry yields square degrees — reproject first, exactly as you would in GeoPandas.

Production Export & Integration

Read remote data in place. With the httpfs extension, DuckDB queries GeoParquet on S3/R2 over range requests, fitting the Cloud-Native Geospatial Formats model.
Write GeoParquet back out. COPY (...) TO 'out.parquet' (FORMAT PARQUET) for derived datasets.
Hand off to the web layer. Export filtered results as GeoJSON or feed a tiling pipeline for Web Mapping & Interactive Visualization.
Choose DuckDB vs PostGIS. DuckDB for embedded, read-mostly analytics on files; PostGIS for shared, transactional, indexed storage.

Windows / Platform Edge Cases & Debugging

Catalog Error: Table Function with name ST_Read does not exist. The spatial extension isn't loaded; con.load_extension("spatial") each session.
Extension download fails offline. Pre-fetch and set extension_directory.
Areas/distances absurd. Geometry is in degrees; ST_Transform to a projected CRS first.
Join returns nothing. Inputs are in different CRSs; reproject one side.
Out-of-memory on huge joins. Set SET memory_limit='8GB' and let DuckDB spill to disk via SET temp_directory.
GeoPandas hand-off loses CRS. Set crs= explicitly when constructing the GeoDataFrame from WKB.