Cloud-Native Geospatial Formats

Cloud-native formats are designed to be read in place over HTTP, a few bytes at a time, instead of downloaded whole. Cloud Optimized GeoTIFF (COG) does this for rasters; GeoParquet and FlatGeobuf do it for vectors; PMTiles does it for tiles. Together they let a pipeline query terabytes on object storage while pulling only the windows it needs. This guide frames the format family within Geospatial Data Ingestion & Processing Workflows and connects to the engines in DuckDB Spatial Analytics and the delivery layer in Web Mapping & Interactive Visualization.

Cloud-native read model An object store holds COG, GeoParquet, and PMTiles files with internal indexes; clients issue HTTP range requests to read only the windows, row groups, or tiles they need. Read the bytes you need, not the whole file Object store (S3/R2) COG — raster windows GeoParquet — row groups PMTiles — map tiles Python client rasterio / duckdb reads a window HTTP range request internal index → byte offset
Each cloud-native format carries an internal index so a client can range-request exactly the window, row group, or tile it needs.

Architecture & Data Structures

The shared idea is an internal index plus HTTP range requests. A COG arranges a GeoTIFF into tiles with overviews and a header that maps each tile to a byte range, so a client reads one window without fetching the file. GeoParquet stores features columnar, in row groups with per-group bounding-box statistics, so a reader skips groups outside a query window. PMTiles indexes map tiles by z/x/y. In every case the file lives on plain object storage — no database, no tile server — and the client is smart about which bytes to ask for.

import rasterio
from rasterio.windows import Window

# Read a single window from a COG on S3 — only that window transfers
cog_url = "https://example-bucket.s3.amazonaws.com/ortho_cog.tif"
with rasterio.open(cog_url) as src:
    window = Window(col_off=4096, row_off=4096, width=512, height=512)
    patch = src.read(1, window=window)
    print(patch.shape)   # (512, 512) — the full image was never downloaded

Environment Configuration & Dependency Resolution

conda install -c conda-forge "rasterio=1.3.*" "geopandas=0.14.*" "pyarrow=15.*" "gdal=3.8.*"
pip install "pmtiles>=3.2"
# Remote reads need GDAL's virtual filesystem (/vsis3/, /vsicurl/), bundled with conda-forge GDAL

For authenticated cloud reads, set the GDAL environment knobs (AWS_*, GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR, CPL_VSIL_CURL_ALLOWED_EXTENSIONS=.tif) so GDAL issues efficient range requests instead of listing buckets. The same rasterio install covered in Raster Data Handling with Rasterio provides COG support out of the box.

Vectorized Operations & Core Workflow

The everyday workflow: store the analytical dataset as GeoParquet, query windows with DuckDB or GeoPandas, and derive COGs/PMTiles as needed. The windowed-raster recipe is in Windowed Reads from Cloud Optimized GeoTIFF.

import geopandas as gpd

# Write GeoParquet — preserves CRS, compresses well, supports row-group skipping
parcels = gpd.read_file("parcels.gpkg").to_crs(epsg=4326)
parcels.to_parquet("parcels.parquet")

# Read back only a bounding-box window (pyogrio pushes the filter down)
window = gpd.read_parquet("parcels.parquet")
subset = window.cx[7.6:7.8, 45.0:45.1]   # coordinate-based slice

Geometry / Data Processing Details

GeoParquet keeps geometry as WKB with CRS metadata in the file's schema, so a round trip is lossless — unlike Shapefile, which truncates field names and drops the CRS into a sidecar. This makes GeoParquet the right interchange format between processing stages; the trade-offs against legacy formats are detailed in GeoParquet vs Shapefile for Storage. For very large vector sources like global building or road datasets, streaming ingestion avoids ever holding the whole thing in memory — see Streaming Overture Maps Data with DuckDB.

CRS Alignment & Projection Pipeline

Cloud-native vector formats embed CRS metadata, so the discipline is to tag data correctly at write time and reproject deliberately, using Coordinate Systems with PyProj. COGs store their CRS in the GeoTIFF tags; windowed reads return data in the file's native CRS, so reproject the result, not the whole raster, to stay cheap.

import geopandas as gpd

fields = gpd.read_parquet("fields.parquet")
print(fields.crs)                       # EPSG:4326 — read from file metadata

# Reproject only the windowed subset you actually need
subset = fields.cx[10.0:10.2, 45.0:45.2].to_crs(epsg=32632)
subset["area_ha"] = subset.geometry.area / 1e4

Production Export & Integration

Windows / Platform Edge Cases & Debugging