Cloud-Native Geospatial Formats
Cloud-native formats are designed to be read in place over HTTP, a few bytes at a time, instead of downloaded whole. Cloud Optimized GeoTIFF (COG) does this for rasters; GeoParquet and FlatGeobuf do it for vectors; PMTiles does it for tiles. Together they let a pipeline query terabytes on object storage while pulling only the windows it needs. This guide frames the format family within Geospatial Data Ingestion & Processing Workflows and connects to the engines in DuckDB Spatial Analytics and the delivery layer in Web Mapping & Interactive Visualization.
Architecture & Data Structures
The shared idea is an internal index plus HTTP range requests. A COG arranges a GeoTIFF into tiles with overviews and a header that maps each tile to a byte range, so a client reads one window without fetching the file. GeoParquet stores features columnar, in row groups with per-group bounding-box statistics, so a reader skips groups outside a query window. PMTiles indexes map tiles by z/x/y. In every case the file lives on plain object storage — no database, no tile server — and the client is smart about which bytes to ask for.
import rasterio
from rasterio.windows import Window
# Read a single window from a COG on S3 — only that window transfers
cog_url = "https://example-bucket.s3.amazonaws.com/ortho_cog.tif"
with rasterio.open(cog_url) as src:
window = Window(col_off=4096, row_off=4096, width=512, height=512)
patch = src.read(1, window=window)
print(patch.shape) # (512, 512) — the full image was never downloaded
Environment Configuration & Dependency Resolution
conda install -c conda-forge "rasterio=1.3.*" "geopandas=0.14.*" "pyarrow=15.*" "gdal=3.8.*"
pip install "pmtiles>=3.2"
# Remote reads need GDAL's virtual filesystem (/vsis3/, /vsicurl/), bundled with conda-forge GDAL
For authenticated cloud reads, set the GDAL environment knobs (AWS_*, GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR, CPL_VSIL_CURL_ALLOWED_EXTENSIONS=.tif) so GDAL issues efficient range requests instead of listing buckets. The same rasterio install covered in Raster Data Handling with Rasterio provides COG support out of the box.
Vectorized Operations & Core Workflow
The everyday workflow: store the analytical dataset as GeoParquet, query windows with DuckDB or GeoPandas, and derive COGs/PMTiles as needed. The windowed-raster recipe is in Windowed Reads from Cloud Optimized GeoTIFF.
import geopandas as gpd
# Write GeoParquet — preserves CRS, compresses well, supports row-group skipping
parcels = gpd.read_file("parcels.gpkg").to_crs(epsg=4326)
parcels.to_parquet("parcels.parquet")
# Read back only a bounding-box window (pyogrio pushes the filter down)
window = gpd.read_parquet("parcels.parquet")
subset = window.cx[7.6:7.8, 45.0:45.1] # coordinate-based slice
Geometry / Data Processing Details
GeoParquet keeps geometry as WKB with CRS metadata in the file's schema, so a round trip is lossless — unlike Shapefile, which truncates field names and drops the CRS into a sidecar. This makes GeoParquet the right interchange format between processing stages; the trade-offs against legacy formats are detailed in GeoParquet vs Shapefile for Storage. For very large vector sources like global building or road datasets, streaming ingestion avoids ever holding the whole thing in memory — see Streaming Overture Maps Data with DuckDB.
CRS Alignment & Projection Pipeline
Cloud-native vector formats embed CRS metadata, so the discipline is to tag data correctly at write time and reproject deliberately, using Coordinate Systems with PyProj. COGs store their CRS in the GeoTIFF tags; windowed reads return data in the file's native CRS, so reproject the result, not the whole raster, to stay cheap.
import geopandas as gpd
fields = gpd.read_parquet("fields.parquet")
print(fields.crs) # EPSG:4326 — read from file metadata
# Reproject only the windowed subset you actually need
subset = fields.cx[10.0:10.2, 45.0:45.2].to_crs(epsg=32632)
subset["area_ha"] = subset.geometry.area / 1e4
Production Export & Integration
- GeoParquet as the canonical analytical store: columnar, compressed, CRS-aware, queryable by DuckDB in place.
- COG for rasters: one file serves full-resolution windows and pre-built overviews to web clients and pipelines alike.
- PMTiles as the derived rendering artifact for Vector Tile Pipelines with PMTiles.
- Validate cloud-readiness. Use
rio cogeo validateto confirm a GeoTIFF is genuinely cloud-optimized (tiled with overviews), not just a renamed TIFF.
Windows / Platform Edge Cases & Debugging
- Remote read downloads the whole file. The GeoTIFF isn't actually a COG (no internal tiling/overviews); re-encode with
rio cogeo create. - Slow S3 reads. Set
GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIRso GDAL stops listing the bucket on open. pyarrowcan't read the Parquet. Writer/reader version skew; pinpyarrowacross the pipeline.- CRS missing after a Shapefile round trip. Expected — migrate to GeoParquet to retain CRS in-file.
/vsicurl/403s. Credentials/headers not set; configure theAWS_*orGDAL_HTTP_HEADERSenvironment variables.- Row-group skipping doesn't help. The Parquet was written without bbox statistics by an old writer; rewrite with a current GeoParquet writer.