Reprojecting Large Datasets Without Memory Errors

Calling .to_crs() on a multi-gigabyte GeoDataFrame loads the whole thing into RAM and frequently dies with a MemoryError. This guide reprojects datasets larger than memory by streaming them in chunks, never holding more than one batch at a time. It is for anyone reprojecting national or continental vector layers on a normal machine. It sits under Coordinate Reference System Transformations in Geospatial Data Ingestion & Processing Workflows.

Why This Approach / What Goes Wrong

gpd.read_file(...).to_crs(...) is a three-step memory spike: the full source loads, a second reprojected copy is built, and the writer buffers output — three copies of a huge dataset at once. The fix is to stream: read a batch of features, reproject just that batch, append it to the output, release it, repeat. Pyogrio's record-batch reader and the GeoParquet/FlatGeobuf writers make this clean. The recurring correctness trap is the transformer: build it once with always_xy=True and reuse it per batch, rather than re-resolving the CRS for every chunk, and never assume the source has a CRS — many large public datasets ship without one.

Prerequisites

geopandas>=0.14
pyogrio>=0.7 (batched I/O)
pyarrow>=15 (GeoParquet output)

conda install -c conda-forge "geopandas=0.14.*" "pyogrio=0.7.*" "pyarrow=15.*"

Step-by-Step Implementation

1. Inspect the source CRS and feature count without loading geometry.

import pyogrio

info = pyogrio.read_info("national_parcels.fgb")
print("CRS:", info["crs"], "| features:", info["features"])
# CRS: EPSG:4326 | features: 41872330
assert info["crs"] is not None, "Source lacks a CRS — set it before reprojecting"

2. Stream batches, reproject each, and write incrementally to GeoParquet.

import geopandas as gpd
import pyogrio

SOURCE = "national_parcels.fgb"
TARGET_EPSG = 25832          # metric UTM 32N
BATCH = 250_000

writer = None
for batch_df in pyogrio.read_dataframe(SOURCE, use_arrow=True, batch_size=BATCH, return_iterator=True):
    # batch_df is a GeoDataFrame in the source CRS
    reprojected = batch_df.to_crs(epsg=TARGET_EPSG)
    if writer is None:
        reprojected.to_parquet("parcels_utm.parquet")        # first batch creates the file
        writer = True
    else:
        reprojected.to_parquet("parcels_utm.parquet", append=True)
    del batch_df, reprojected   # release before the next batch

3. If the writer doesn't support append, write one file per batch and treat the folder as a partitioned dataset.

import os
import geopandas as gpd
import pyogrio

os.makedirs("parcels_utm_parts", exist_ok=True)
for i, batch_df in enumerate(
    pyogrio.read_dataframe(SOURCE, use_arrow=True, batch_size=BATCH, return_iterator=True)
):
    batch_df.to_crs(epsg=TARGET_EPSG).to_parquet(f"parcels_utm_parts/part_{i:05d}.parquet")

Verification

Confirm the output CRS is correct and no features were dropped, without reloading everything at once.

import pyogrio

out = pyogrio.read_info("parcels_utm.parquet")
print("Output CRS:", out["crs"], "| features:", out["features"])
# Output CRS: EPSG:25832 | features: 41872330
assert out["features"] == 41_872_330, "Feature count changed — a batch was dropped"
assert "25832" in str(out["crs"])

# Peak memory should stay near one batch, not the whole dataset
# (watch RSS during the run; it should plateau, not climb with feature count)

Edge Cases & Debugging

MemoryError persists. Lower BATCH; 250k polygon features can still be large if vertex-dense.
CRS is None on the source. set_crs per batch to the known EPSG before to_crs.
Axis-flipped output. A hand-built pyproj.Transformer without always_xy=True; GeoPandas .to_crs handles this, so prefer it per batch.
Slow throughput. Ensure use_arrow=True; the Arrow path is far faster than the legacy reader.
Append unsupported. Use the one-file-per-batch pattern (step 3); DuckDB and Dask read the folder as one dataset.
Mixed geometry types across batches. Keep the schema stable; coerce or split by geometry type before writing.