Automating Shapefile Cleanup with Python: CRS Enforcement & Topology Repair
When ingesting legacy spatial datasets, Geospatial Data Ingestion & Processing Workflows frequently encounter corrupted geometries, missing CRS definitions, and attribute nulls. Automating shapefile cleanup with Python eliminates manual desktop GIS interventions and ensures reproducible, version-controlled data pipelines for production environments.
This guide targets a single, high-impact workflow: validate topologies, enforce explicit coordinate reference systems, and strip invalid records before downstream analysis. Ensure your ingestion layer correctly handles multi-part features first. Refer to our Shapefile & GeoJSON Parsing documentation for foundational I/O patterns.
Minimal Reproducible Code
import geopandas as gpd
from shapely.validation import make_valid
import warnings
def clean_shapefile(input_path: str, output_path: str, target_crs: str = "EPSG:4326") -> None:
"""
Load a shapefile, enforce CRS, repair invalid geometries,
drop nulls/empties, and write a clean output.
"""
gdf = gpd.read_file(input_path)
# 1. Explicit CRS validation & transformation
target_epsg = int(target_crs.split(":")[1])
if gdf.crs is None:
warnings.warn(
"No CRS detected. Assigning target CRS directly. Verify spatial accuracy."
)
gdf = gdf.set_crs(target_crs)
elif gdf.crs.to_epsg() != target_epsg:
gdf = gdf.to_crs(target_crs)
# 2. Topology validation & repair
invalid_mask = ~gdf.geometry.is_valid
if invalid_mask.any():
print(f"Repairing {invalid_mask.sum()} invalid geometries...")
gdf.loc[invalid_mask, "geometry"] = (
gdf.loc[invalid_mask, "geometry"].apply(make_valid)
)
# 3. Drop null geometries & empty features
gdf = gdf.dropna(subset=["geometry"])
gdf = gdf[~gdf.geometry.is_empty]
# 4. Export cleaned dataset
gdf.to_file(output_path, driver="ESRI Shapefile")
print(f"Cleanup complete. {len(gdf)} features saved to {output_path}")
Implementation Explanation
The script enforces deterministic behavior by validating gdf.crs upfront. Shapefiles frequently ship with missing .prj files or outdated datum definitions. The to_crs() method handles datum shifts safely during coordinate transformation. The set_crs() fallback assigns metadata when projection files are absent — it does not reproject coordinates, so only use it when you are certain of the source CRS.
Topology repair leverages shapely.validation.make_valid. This function resolves self-intersections and ring orientation errors without fragmenting valid polygon boundaries.
The dropna and is_empty filters guarantee pipeline stability. Downstream spatial joins and web mapping renderers will not crash on null features. The output remains strictly compliant with the ESRI Shapefile format.
Edge Cases & Debugging
- CRS Mismatch Warnings: If
gdf.crsreturnsNone, verify the.prjfile exists alongside the.shp. Legacy exports often strip projection metadata during batch conversion. - Geometry Fragmentation:
make_valid()may split invalid polygons intoMultiPolygongeometries. Add.explode(index_parts=True)post-cleanup if your pipeline requires strict single-part features. - Large File Memory Limits: For datasets exceeding 500 MB, avoid loading the entire file into RAM. Process in chunks using
fionaiterators or migrate todask-geopandasfor distributed execution. - Silent Failures: Wrap
make_validin atry/exceptblock when encounteringGEOSException. Log failed row indices to a separate CSV for manual topology inspection. - Shapefile field name truncation: ESRI Shapefile limits field names to 10 characters. If your attribute table has longer names, they will be silently truncated on write — rename columns before exporting.