GeoParquet vs Shapefile for Storage

The Shapefile has been the default vector format for thirty years, and it shows: 10-character field names, a 2 GB ceiling, a CRS in a separate file that goes missing, and multiple sidecar files per dataset. GeoParquet fixes all of it. This guide compares the two for storage and interchange and shows how to migrate. It is for anyone choosing a working format for a Python pipeline. It sits under Cloud-Native Geospatial Formats in Geospatial Data Ingestion & Processing Workflows.

GeoParquet versus Shapefile comparison GeoParquet stores one compressed columnar file with full field names and embedded CRS; Shapefile uses multiple sidecar files, truncates field names to ten characters, and keeps CRS separately. GeoParquet modern Shapefile legacy Single file, columnar .shp .shx .dbf .prj sidecars Full-length field names Names truncated to 10 chars CRS embedded in schema CRS in .prj, easily lost Compressed, no size cap 2 GB per .shp/.dbf limit Row-group bbox skipping No internal spatial index Use GeoParquet for pipelines; export Shapefile only for legacy consumers that require it
GeoParquet wins on integrity, size, and cloud-readiness; Shapefile persists only for backward compatibility.

Why This Approach / What Goes Wrong

The Shapefile's limits cause real data loss. Field names silently truncate to ten characters, so population_density becomes populatio and collides with population. The .prj file is optional and frequently absent, leaving data with no CRS — the single most common ingestion bug, addressed in Coordinate Reference System Transformations. Each .shp and .dbf caps at 2 GB. GeoParquet stores everything in one compressed columnar file with full field names and the CRS embedded in the schema, and it supports row-group skipping for windowed reads. The only reason to write a Shapefile today is a downstream tool that accepts nothing else.

Prerequisites

conda install -c conda-forge "geopandas=0.14.*" "pyarrow=15.*"

Step-by-Step Implementation

1. Read a legacy Shapefile and inspect the damage.

import geopandas as gpd

legacy = gpd.read_file("census_legacy.shp")
print(legacy.columns.tolist())      # ['populatio', 'median_inc', 'geometry'] — truncated
print(legacy.crs)                   # None if the .prj was missing

2. Repair names and CRS, then write GeoParquet.

legacy = legacy.rename(columns={"populatio": "population", "median_inc": "median_income"})
if legacy.crs is None:
    legacy = legacy.set_crs(epsg=25832)   # assert the known source CRS

legacy.to_parquet("census.parquet")       # full names + CRS now embedded

3. Compare on-disk size and round-trip fidelity.

import os, geopandas as gpd

shp_bytes = sum(
    os.path.getsize(f"census_legacy{ext}") for ext in (".shp", ".shx", ".dbf", ".prj")
)
pq_bytes = os.path.getsize("census.parquet")
print(f"Shapefile set: {shp_bytes/1e6:.1f} MB   GeoParquet: {pq_bytes/1e6:.1f} MB")
# Shapefile set: 88.4 MB   GeoParquet: 19.7 MB

4. Export Shapefile only when a legacy consumer demands it.

gpd.read_parquet("census.parquet").to_file("for_legacy_tool.shp")  # accept the truncation

Verification

Confirm GeoParquet preserved the full schema and CRS that Shapefile lost.

import geopandas as gpd

restored = gpd.read_parquet("census.parquet")
assert "population" in restored.columns and "median_income" in restored.columns
assert restored.crs.to_epsg() == 25832
assert len(restored) == len(legacy)
print("Columns:", restored.columns.tolist())
print("CRS preserved:", restored.crs.to_epsg())   # CRS preserved: 25832

Edge Cases & Debugging