GeoParquet vs Shapefile for Storage
The Shapefile has been the default vector format for thirty years, and it shows: 10-character field names, a 2 GB ceiling, a CRS in a separate file that goes missing, and multiple sidecar files per dataset. GeoParquet fixes all of it. This guide compares the two for storage and interchange and shows how to migrate. It is for anyone choosing a working format for a Python pipeline. It sits under Cloud-Native Geospatial Formats in Geospatial Data Ingestion & Processing Workflows.
Why This Approach / What Goes Wrong
The Shapefile's limits cause real data loss. Field names silently truncate to ten characters, so population_density becomes populatio and collides with population. The .prj file is optional and frequently absent, leaving data with no CRS — the single most common ingestion bug, addressed in Coordinate Reference System Transformations. Each .shp and .dbf caps at 2 GB. GeoParquet stores everything in one compressed columnar file with full field names and the CRS embedded in the schema, and it supports row-group skipping for windowed reads. The only reason to write a Shapefile today is a downstream tool that accepts nothing else.
Prerequisites
geopandas>=0.14pyarrow>=15(GeoParquet engine)
conda install -c conda-forge "geopandas=0.14.*" "pyarrow=15.*"
Step-by-Step Implementation
1. Read a legacy Shapefile and inspect the damage.
import geopandas as gpd
legacy = gpd.read_file("census_legacy.shp")
print(legacy.columns.tolist()) # ['populatio', 'median_inc', 'geometry'] — truncated
print(legacy.crs) # None if the .prj was missing
2. Repair names and CRS, then write GeoParquet.
legacy = legacy.rename(columns={"populatio": "population", "median_inc": "median_income"})
if legacy.crs is None:
legacy = legacy.set_crs(epsg=25832) # assert the known source CRS
legacy.to_parquet("census.parquet") # full names + CRS now embedded
3. Compare on-disk size and round-trip fidelity.
import os, geopandas as gpd
shp_bytes = sum(
os.path.getsize(f"census_legacy{ext}") for ext in (".shp", ".shx", ".dbf", ".prj")
)
pq_bytes = os.path.getsize("census.parquet")
print(f"Shapefile set: {shp_bytes/1e6:.1f} MB GeoParquet: {pq_bytes/1e6:.1f} MB")
# Shapefile set: 88.4 MB GeoParquet: 19.7 MB
4. Export Shapefile only when a legacy consumer demands it.
gpd.read_parquet("census.parquet").to_file("for_legacy_tool.shp") # accept the truncation
Verification
Confirm GeoParquet preserved the full schema and CRS that Shapefile lost.
import geopandas as gpd
restored = gpd.read_parquet("census.parquet")
assert "population" in restored.columns and "median_income" in restored.columns
assert restored.crs.to_epsg() == 25832
assert len(restored) == len(legacy)
print("Columns:", restored.columns.tolist())
print("CRS preserved:", restored.crs.to_epsg()) # CRS preserved: 25832
Edge Cases & Debugging
- CRS is
Noneafter reading a Shapefile. The.prjwas missing;set_crsto the known source EPSG before anything else. - Mangled column names. Shapefile truncation; rename explicitly — there is no way to recover the original names automatically.
to_parquetfails.pyarrownot installed or too old.- Other tools can't read GeoParquet. Older GIS software predates it; export a Shapefile or GeoPackage for those, keep GeoParquet internally.
- Categorical/datetime columns differ after round trip. Parquet preserves dtypes Shapefile coerces to strings — usually an improvement, but check downstream assumptions.
- Multi-layer needs. Shapefile is one layer per file; if you need many layers in one file, use GeoPackage, not Shapefile.