Reprojecting Large Datasets Without Memory Errors
Calling .to_crs() on a multi-gigabyte GeoDataFrame loads the whole thing into RAM and frequently dies with a MemoryError. This guide reprojects datasets larger than memory by streaming them in chunks, never holding more than one batch at a time. It is for anyone reprojecting national or continental vector layers on a normal machine. It sits under Coordinate Reference System Transformations in Geospatial Data Ingestion & Processing Workflows.
Why This Approach / What Goes Wrong
gpd.read_file(...).to_crs(...) is a three-step memory spike: the full source loads, a second reprojected copy is built, and the writer buffers output — three copies of a huge dataset at once. The fix is to stream: read a batch of features, reproject just that batch, append it to the output, release it, repeat. Pyogrio's record-batch reader and the GeoParquet/FlatGeobuf writers make this clean. The recurring correctness trap is the transformer: build it once with always_xy=True and reuse it per batch, rather than re-resolving the CRS for every chunk, and never assume the source has a CRS — many large public datasets ship without one.
Prerequisites
geopandas>=0.14pyogrio>=0.7(batched I/O)pyarrow>=15(GeoParquet output)
conda install -c conda-forge "geopandas=0.14.*" "pyogrio=0.7.*" "pyarrow=15.*"
Step-by-Step Implementation
1. Inspect the source CRS and feature count without loading geometry.
import pyogrio
info = pyogrio.read_info("national_parcels.fgb")
print("CRS:", info["crs"], "| features:", info["features"])
# CRS: EPSG:4326 | features: 41872330
assert info["crs"] is not None, "Source lacks a CRS — set it before reprojecting"
2. Stream batches, reproject each, and write incrementally to GeoParquet.
import geopandas as gpd
import pyogrio
SOURCE = "national_parcels.fgb"
TARGET_EPSG = 25832 # metric UTM 32N
BATCH = 250_000
writer = None
for batch_df in pyogrio.read_dataframe(SOURCE, use_arrow=True, batch_size=BATCH, return_iterator=True):
# batch_df is a GeoDataFrame in the source CRS
reprojected = batch_df.to_crs(epsg=TARGET_EPSG)
if writer is None:
reprojected.to_parquet("parcels_utm.parquet") # first batch creates the file
writer = True
else:
reprojected.to_parquet("parcels_utm.parquet", append=True)
del batch_df, reprojected # release before the next batch
3. If the writer doesn't support append, write one file per batch and treat the folder as a partitioned dataset.
import os
import geopandas as gpd
import pyogrio
os.makedirs("parcels_utm_parts", exist_ok=True)
for i, batch_df in enumerate(
pyogrio.read_dataframe(SOURCE, use_arrow=True, batch_size=BATCH, return_iterator=True)
):
batch_df.to_crs(epsg=TARGET_EPSG).to_parquet(f"parcels_utm_parts/part_{i:05d}.parquet")
Verification
Confirm the output CRS is correct and no features were dropped, without reloading everything at once.
import pyogrio
out = pyogrio.read_info("parcels_utm.parquet")
print("Output CRS:", out["crs"], "| features:", out["features"])
# Output CRS: EPSG:25832 | features: 41872330
assert out["features"] == 41_872_330, "Feature count changed — a batch was dropped"
assert "25832" in str(out["crs"])
# Peak memory should stay near one batch, not the whole dataset
# (watch RSS during the run; it should plateau, not climb with feature count)
Edge Cases & Debugging
MemoryErrorpersists. LowerBATCH; 250k polygon features can still be large if vertex-dense.- CRS is
Noneon the source.set_crsper batch to the known EPSG beforeto_crs. - Axis-flipped output. A hand-built
pyproj.Transformerwithoutalways_xy=True; GeoPandas.to_crshandles this, so prefer it per batch. - Slow throughput. Ensure
use_arrow=True; the Arrow path is far faster than the legacy reader. - Append unsupported. Use the one-file-per-batch pattern (step 3); DuckDB and Dask read the folder as one dataset.
- Mixed geometry types across batches. Keep the schema stable; coerce or split by geometry type before writing.