Streaming Overture Maps Data with DuckDB
Overture Maps publishes global buildings, places, and transportation as GeoParquet on public cloud storage — datasets far too large to download whole. This guide uses DuckDB to stream just the rows in a bounding box straight from the remote files into a GeoDataFrame. It is for anyone who needs authoritative global features for one city without ingesting the planet. It sits under Cloud-Native Geospatial Formats in Geospatial Data Ingestion & Processing Workflows.
Why This Approach / What Goes Wrong
The Overture building layer is hundreds of gigabytes of GeoParquet. Downloading it to filter locally is absurd; the cloud-native move is to push the filter to the data. DuckDB with the httpfs and spatial extensions reads the remote Parquet over HTTP range requests and, thanks to row-group bounding-box statistics, skips the vast majority of the file — so a city-sized query transfers a small fraction. The failure modes are a forgotten extension, a bounding-box filter on the wrong column (Overture stores a bbox struct you should filter on for pushdown), and trying to materialize a continent-scale result because the spatial predicate was too loose.
Prerequisites
duckdb>=0.10geopandas>=0.14,shapely>=2.0- Network access to the Overture S3/Azure buckets
pip install "duckdb>=0.10" "geopandas>=0.14" "shapely>=2.0"
Step-by-Step Implementation
1. Connect and load the required extensions.
import duckdb
con = duckdb.connect()
for ext in ("httpfs", "spatial"):
con.install_extension(ext)
con.load_extension(ext)
con.execute("SET s3_region='us-west-2'") # Overture's bucket region
2. Filter on the bbox struct for row-group pushdown, then refine with a precise predicate. Use a tight area of interest.
# Berlin-Mitte bounding box in EPSG:4326 (lon/lat)
xmin, ymin, xmax, ymax = 13.36, 52.50, 13.43, 52.54
release = "s3://overturemaps-us-west-2/release/2024-09-18.0/theme=buildings/type=building/*"
query = f"""
SELECT id, names.primary AS name, height, ST_AsWKB(geometry) AS wkb
FROM read_parquet('{release}', filename=true, hive_partitioning=1)
WHERE bbox.xmin BETWEEN {xmin} AND {xmax}
AND bbox.ymin BETWEEN {ymin} AND {ymax}
"""
df = con.sql(query).df()
3. Rebuild a GeoDataFrame with an explicit CRS.
import geopandas as gpd
buildings = gpd.GeoDataFrame(
df.drop(columns="wkb"),
geometry=gpd.GeoSeries.from_wkb(df["wkb"]),
crs="EPSG:4326", # Overture geometry is WGS84
)
4. Persist locally as GeoParquet for repeated analysis.
buildings.to_parquet("berlin_mitte_buildings.parquet")
Verification
Confirm only the AOI came back and the geometry decoded correctly.
print("Buildings fetched:", len(buildings)) # Buildings fetched: 9241
assert buildings.crs.to_epsg() == 4326
minx, miny, maxx, maxy = buildings.total_bounds
assert 13.36 <= minx and maxx <= 13.43, "Rows leaked outside the AOI — check bbox filter"
assert buildings.geometry.is_valid.mean() > 0.99
print(f"Extent: {minx:.3f},{miny:.3f} → {maxx:.3f},{maxy:.3f}")
If the row count is in the millions, the bbox filter didn't apply — verify the column path and that you queried the bbox struct, not the geometry.
Edge Cases & Debugging
- Query scans the whole dataset. You filtered on
geometryinstead of thebboxstruct, defeating row-group skipping; filterbbox.xmin/yminfirst. httpfsnot loaded. Remote reads fail; load it every session.- Slow or throttled reads. Set the correct
s3_region; cross-region reads are slow and may cost egress. - Schema changed between releases. Overture evolves its schema; pin a specific
release=date and check column paths. names.primarymissing. Some themes nest names differently; inspect withDESCRIBE SELECT * FROM read_parquet(...) LIMIT 1.- CRS lost on hand-off. Set
crs="EPSG:4326"explicitly when constructing theGeoDataFrame.