Cloud-Native Geospatial Formats

Cloud-native formats are designed to be read in place over HTTP, a few bytes at a time, instead of downloaded whole. Cloud Optimized GeoTIFF (COG) does this for rasters; GeoParquet and FlatGeobuf do it for vectors; PMTiles does it for tiles. Together they let a pipeline query terabytes on object storage while pulling only the windows it needs. This guide frames the format family within Geospatial Data Ingestion & Processing Workflows and connects to the engines in DuckDB Spatial Analytics and the delivery layer in Web Mapping & Interactive Visualization.

Each cloud-native format carries an internal index so a client can range-request exactly the window, row group, or tile it needs.

Architecture & Data Structures

The shared idea is an internal index plus HTTP range requests. A COG arranges a GeoTIFF into tiles with overviews and a header that maps each tile to a byte range, so a client reads one window without fetching the file. GeoParquet stores features columnar, in row groups with per-group bounding-box statistics, so a reader skips groups outside a query window. PMTiles indexes map tiles by z/x/y. In every case the file lives on plain object storage — no database, no tile server — and the client is smart about which bytes to ask for.

import rasterio
from rasterio.windows import Window

# Read a single window from a COG on S3 — only that window transfers
cog_url = "https://example-bucket.s3.amazonaws.com/ortho_cog.tif"
with rasterio.open(cog_url) as src:
    window = Window(col_off=4096, row_off=4096, width=512, height=512)
    patch = src.read(1, window=window)
    print(patch.shape)   # (512, 512) — the full image was never downloaded

Environment Configuration & Dependency Resolution

conda install -c conda-forge "rasterio=1.3.*" "geopandas=0.14.*" "pyarrow=15.*" "gdal=3.8.*"
pip install "pmtiles>=3.2"
# Remote reads need GDAL's virtual filesystem (/vsis3/, /vsicurl/), bundled with conda-forge GDAL

For authenticated cloud reads, set the GDAL environment knobs (AWS_*, GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR, CPL_VSIL_CURL_ALLOWED_EXTENSIONS=.tif) so GDAL issues efficient range requests instead of listing buckets. The same rasterio install covered in Raster Data Handling with Rasterio provides COG support out of the box.

Vectorized Operations & Core Workflow

The everyday workflow: store the analytical dataset as GeoParquet, query windows with DuckDB or GeoPandas, and derive COGs/PMTiles as needed. The windowed-raster recipe is in Windowed Reads from Cloud Optimized GeoTIFF.

import geopandas as gpd

# Write GeoParquet — preserves CRS, compresses well, supports row-group skipping
parcels = gpd.read_file("parcels.gpkg").to_crs(epsg=4326)
parcels.to_parquet("parcels.parquet")

# Read back only a bounding-box window (pyogrio pushes the filter down)
window = gpd.read_parquet("parcels.parquet")
subset = window.cx[7.6:7.8, 45.0:45.1]   # coordinate-based slice

Geometry / Data Processing Details

GeoParquet keeps geometry as WKB with CRS metadata in the file's schema, so a round trip is lossless — unlike Shapefile, which truncates field names and drops the CRS into a sidecar. This makes GeoParquet the right interchange format between processing stages; the trade-offs against legacy formats are detailed in GeoParquet vs Shapefile for Storage. For very large vector sources like global building or road datasets, streaming ingestion avoids ever holding the whole thing in memory — see Streaming Overture Maps Data with DuckDB.

CRS Alignment & Projection Pipeline

Cloud-native vector formats embed CRS metadata, so the discipline is to tag data correctly at write time and reproject deliberately, using Coordinate Systems with PyProj. COGs store their CRS in the GeoTIFF tags; windowed reads return data in the file's native CRS, so reproject the result, not the whole raster, to stay cheap.

import geopandas as gpd

fields = gpd.read_parquet("fields.parquet")
print(fields.crs)                       # EPSG:4326 — read from file metadata

# Reproject only the windowed subset you actually need
subset = fields.cx[10.0:10.2, 45.0:45.2].to_crs(epsg=32632)
subset["area_ha"] = subset.geometry.area / 1e4

Production Export & Integration

GeoParquet as the canonical analytical store: columnar, compressed, CRS-aware, queryable by DuckDB in place.
COG for rasters: one file serves full-resolution windows and pre-built overviews to web clients and pipelines alike.
PMTiles as the derived rendering artifact for Vector Tile Pipelines with PMTiles.
Validate cloud-readiness. Use rio cogeo validate to confirm a GeoTIFF is genuinely cloud-optimized (tiled with overviews), not just a renamed TIFF.

Windows / Platform Edge Cases & Debugging

Remote read downloads the whole file. The GeoTIFF isn't actually a COG (no internal tiling/overviews); re-encode with rio cogeo create.
Slow S3 reads. Set GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR so GDAL stops listing the bucket on open.
pyarrow can't read the Parquet. Writer/reader version skew; pin pyarrow across the pipeline.
CRS missing after a Shapefile round trip. Expected — migrate to GeoParquet to retain CRS in-file.
/vsicurl/ 403s. Credentials/headers not set; configure the AWS_* or GDAL_HTTP_HEADERS environment variables.
Row-group skipping doesn't help. The Parquet was written without bbox statistics by an old writer; rewrite with a current GeoParquet writer.