Streaming Overture Maps Data with DuckDB

Overture Maps publishes global buildings, places, and transportation as GeoParquet on public cloud storage — datasets far too large to download whole. This guide uses DuckDB to stream just the rows in a bounding box straight from the remote files into a GeoDataFrame. It is for anyone who needs authoritative global features for one city without ingesting the planet. It sits under Cloud-Native Geospatial Formats in Geospatial Data Ingestion & Processing Workflows.

Why This Approach / What Goes Wrong

The Overture building layer is hundreds of gigabytes of GeoParquet. Downloading it to filter locally is absurd; the cloud-native move is to push the filter to the data. DuckDB with the httpfs and spatial extensions reads the remote Parquet over HTTP range requests and, thanks to row-group bounding-box statistics, skips the vast majority of the file — so a city-sized query transfers a small fraction. The failure modes are a forgotten extension, a bounding-box filter on the wrong column (Overture stores a bbox struct you should filter on for pushdown), and trying to materialize a continent-scale result because the spatial predicate was too loose.

Prerequisites

duckdb>=0.10
geopandas>=0.14, shapely>=2.0
Network access to the Overture S3/Azure buckets

pip install "duckdb>=0.10" "geopandas>=0.14" "shapely>=2.0"

Step-by-Step Implementation

1. Connect and load the required extensions.

import duckdb

con = duckdb.connect()
for ext in ("httpfs", "spatial"):
    con.install_extension(ext)
    con.load_extension(ext)
con.execute("SET s3_region='us-west-2'")   # Overture's bucket region

2. Filter on the bbox struct for row-group pushdown, then refine with a precise predicate. Use a tight area of interest.

# Berlin-Mitte bounding box in EPSG:4326 (lon/lat)
xmin, ymin, xmax, ymax = 13.36, 52.50, 13.43, 52.54
release = "s3://overturemaps-us-west-2/release/2024-09-18.0/theme=buildings/type=building/*"

query = f"""
SELECT id, names.primary AS name, height, ST_AsWKB(geometry) AS wkb
FROM read_parquet('{release}', filename=true, hive_partitioning=1)
WHERE bbox.xmin BETWEEN {xmin} AND {xmax}
  AND bbox.ymin BETWEEN {ymin} AND {ymax}
"""
df = con.sql(query).df()

3. Rebuild a GeoDataFrame with an explicit CRS.

import geopandas as gpd

buildings = gpd.GeoDataFrame(
    df.drop(columns="wkb"),
    geometry=gpd.GeoSeries.from_wkb(df["wkb"]),
    crs="EPSG:4326",     # Overture geometry is WGS84
)

4. Persist locally as GeoParquet for repeated analysis.

buildings.to_parquet("berlin_mitte_buildings.parquet")

Verification

Confirm only the AOI came back and the geometry decoded correctly.

print("Buildings fetched:", len(buildings))     # Buildings fetched: 9241
assert buildings.crs.to_epsg() == 4326

minx, miny, maxx, maxy = buildings.total_bounds
assert 13.36 <= minx and maxx <= 13.43, "Rows leaked outside the AOI — check bbox filter"
assert buildings.geometry.is_valid.mean() > 0.99
print(f"Extent: {minx:.3f},{miny:.3f} → {maxx:.3f},{maxy:.3f}")

If the row count is in the millions, the bbox filter didn't apply — verify the column path and that you queried the bbox struct, not the geometry.

Edge Cases & Debugging

Query scans the whole dataset. You filtered on geometry instead of the bbox struct, defeating row-group skipping; filter bbox.xmin/ymin first.
httpfs not loaded. Remote reads fail; load it every session.
Slow or throttled reads. Set the correct s3_region; cross-region reads are slow and may cost egress.
Schema changed between releases. Overture evolves its schema; pin a specific release= date and check column paths.
names.primary missing. Some themes nest names differently; inspect with DESCRIBE SELECT * FROM read_parquet(...) LIMIT 1.
CRS lost on hand-off. Set crs="EPSG:4326" explicitly when constructing the GeoDataFrame.