GeoParquet vs Shapefile for Storage

The Shapefile has been the default vector format for thirty years, and it shows: 10-character field names, a 2 GB ceiling, a CRS in a separate file that goes missing, and multiple sidecar files per dataset. GeoParquet fixes all of it. This guide compares the two for storage and interchange and shows how to migrate. It is for anyone choosing a working format for a Python pipeline. It sits under Cloud-Native Geospatial Formats in Geospatial Data Ingestion & Processing Workflows.

GeoParquet wins on integrity, size, and cloud-readiness; Shapefile persists only for backward compatibility.

Why This Approach / What Goes Wrong

The Shapefile's limits cause real data loss. Field names silently truncate to ten characters, so population_density becomes populatio and collides with population. The .prj file is optional and frequently absent, leaving data with no CRS — the single most common ingestion bug, addressed in Coordinate Reference System Transformations. Each .shp and .dbf caps at 2 GB. GeoParquet stores everything in one compressed columnar file with full field names and the CRS embedded in the schema, and it supports row-group skipping for windowed reads. The only reason to write a Shapefile today is a downstream tool that accepts nothing else.

Prerequisites

geopandas>=0.14
pyarrow>=15 (GeoParquet engine)

conda install -c conda-forge "geopandas=0.14.*" "pyarrow=15.*"

Step-by-Step Implementation

1. Read a legacy Shapefile and inspect the damage.

import geopandas as gpd

legacy = gpd.read_file("census_legacy.shp")
print(legacy.columns.tolist())      # ['populatio', 'median_inc', 'geometry'] — truncated
print(legacy.crs)                   # None if the .prj was missing

2. Repair names and CRS, then write GeoParquet.

legacy = legacy.rename(columns={"populatio": "population", "median_inc": "median_income"})
if legacy.crs is None:
    legacy = legacy.set_crs(epsg=25832)   # assert the known source CRS

legacy.to_parquet("census.parquet")       # full names + CRS now embedded

3. Compare on-disk size and round-trip fidelity.

import os, geopandas as gpd

shp_bytes = sum(
    os.path.getsize(f"census_legacy{ext}") for ext in (".shp", ".shx", ".dbf", ".prj")
)
pq_bytes = os.path.getsize("census.parquet")
print(f"Shapefile set: {shp_bytes/1e6:.1f} MB   GeoParquet: {pq_bytes/1e6:.1f} MB")
# Shapefile set: 88.4 MB   GeoParquet: 19.7 MB

4. Export Shapefile only when a legacy consumer demands it.

gpd.read_parquet("census.parquet").to_file("for_legacy_tool.shp")  # accept the truncation

Verification

Confirm GeoParquet preserved the full schema and CRS that Shapefile lost.

import geopandas as gpd

restored = gpd.read_parquet("census.parquet")
assert "population" in restored.columns and "median_income" in restored.columns
assert restored.crs.to_epsg() == 25832
assert len(restored) == len(legacy)
print("Columns:", restored.columns.tolist())
print("CRS preserved:", restored.crs.to_epsg())   # CRS preserved: 25832

Edge Cases & Debugging

CRS is None after reading a Shapefile. The .prj was missing; set_crs to the known source EPSG before anything else.
Mangled column names. Shapefile truncation; rename explicitly — there is no way to recover the original names automatically.
to_parquet fails. pyarrow not installed or too old.
Other tools can't read GeoParquet. Older GIS software predates it; export a Shapefile or GeoPackage for those, keep GeoParquet internally.
Categorical/datetime columns differ after round trip. Parquet preserves dtypes Shapefile coerces to strings — usually an improvement, but check downstream assumptions.
Multi-layer needs. Shapefile is one layer per file; if you need many layers in one file, use GeoPackage, not Shapefile.