GeoPandas vs Fiona for Large Files
GeoPandas loads an entire layer into a DataFrame; Fiona streams it feature by feature. For files that fit in memory the first is far more convenient, but past that point Fiona's iterator is what keeps a process alive. This guide compares the two for large-file I/O and shows how to combine them. It is for anyone hitting MemoryError on a big read. It sits under GeoPandas DataFrames Explained in Mastering Core Geospatial Python Libraries.
Why This Approach / What Goes Wrong
gpd.read_file() materializes every feature, geometry, and attribute as a DataFrame — wonderful for vectorized analysis, fatal when the file is larger than RAM. Fiona (the same GDAL/OGR layer GeoPandas reads through) exposes features as a lazy iterator: you process one at a time, so memory stays flat regardless of file size. The right pattern for large files is rarely "all Fiona" or "all GeoPandas" — it is to stream-filter with Fiona down to the subset you care about, then hand that subset to GeoPandas for the vectorized work. The mistake is loading the whole file just to keep 2% of it. Modern GeoPandas can also push a bounding-box or attribute filter into the read via pyogrio, which closes much of the gap.
Prerequisites
geopandas>=0.14fiona>=1.9shapely>=2.0
conda install -c conda-forge "geopandas=0.14.*" "fiona=1.9.*" "shapely=2.0.*"
Step-by-Step Implementation
1. The convenient path (fits in memory): GeoPandas with a pushed-down filter.
import geopandas as gpd
# Read only features intersecting a bbox — pyogrio filters during the read
aoi = (7.6, 45.0, 7.8, 45.1) # xmin, ymin, xmax, ymax in the file's CRS
city_parcels = gpd.read_file("national_parcels.fgb", bbox=aoi)
print(len(city_parcels), "features loaded")
2. The streaming path (exceeds memory): Fiona iterator, flat memory.
import fiona
from shapely.geometry import shape
# Keep only commercial parcels from a file too large to load
kept = []
with fiona.open("national_parcels.fgb") as src:
src_crs = src.crs
for feature in src: # one feature at a time
if feature["properties"].get("use") == "commercial":
kept.append(feature)
print(f"Filtered {len(kept)} of {len(src)} features")
3. Hand the filtered survivors to GeoPandas for vectorized analysis.
import geopandas as gpd
commercial = gpd.GeoDataFrame.from_features(kept, crs=src_crs)
commercial = commercial.to_crs(commercial.estimate_utm_crs())
commercial["area_m2"] = commercial.geometry.area # vectorized, fast
Verification
Confirm the streaming filter held memory flat and produced the same result a full load would.
import fiona
# Count matches by streaming (no full load) and compare to the kept set
with fiona.open("national_parcels.fgb") as src:
streamed = sum(1 for f in src if f["properties"].get("use") == "commercial")
print("Streamed match count:", streamed) # Streamed match count: 18254
assert streamed == len(commercial), "Filter mismatch between Fiona and GeoDataFrame"
assert commercial.crs is not None
Edge Cases & Debugging
MemoryErroronread_file. The file exceeds RAM; switch to the Fiona stream or abbox/wherefiltered read.- Streaming is slow. Per-feature Python iteration is inherently slower than vectorized C; filter aggressively, then vectorize the remainder.
- CRS lost via
from_features. Passcrs=src.crsexplicitly when building theGeoDataFrame. bbox=not filtering. Ensure the GeoPandas build usespyogrio(the default in 0.14+); the legacy engine ignores some pushdowns.- Attribute filter at read time. Prefer
gpd.read_file(path, where="use='commercial'")(SQL-style) to filter in GDAL rather than in Python. - Memory still climbs while streaming. You appended full feature dicts; keep only the fields you need, or write survivors straight to disk.