Automating shapefile cleanup with Python: CRS Enforcement & Topology Repair

When ingesting legacy spatial datasets, Geospatial Data Ingestion & Processing Workflows frequently encounter corrupted geometries, missing CRS definitions, and attribute nulls. Automating shapefile cleanup with Python eliminates manual desktop GIS interventions. It ensures reproducible, version-controlled data pipelines for production environments.

This guide targets a single, high-impact workflow. We validate topologies, enforce explicit coordinate reference systems, and strip invalid records before downstream analysis. Ensure your ingestion layer correctly handles multi-part features first. Refer to our Shapefile & GeoJSON Parsing documentation for foundational I/O patterns.

Minimal Reproducible Code

import geopandas as gpd
from shapely.validation import make_valid
import warnings

def clean_shapefile(input_path: str, output_path: str, target_crs: str = "EPSG:4326") -> None:
 # Load dataset with explicit error handling
 gdf = gpd.read_file(input_path)

 # 1. Explicit CRS validation & transformation
 target_epsg = int(target_crs.split(":")[1])
 if gdf.crs is None:
 warnings.warn("No CRS detected. Assigning target CRS directly. Verify spatial accuracy.")
 gdf.set_crs(target_crs, inplace=True)
 elif gdf.crs.to_epsg() != target_epsg:
 gdf = gdf.to_crs(target_crs)

 # 2. Topology validation & repair
 invalid_mask = ~gdf.geometry.is_valid
 if invalid_mask.any():
 print(f"Repairing {invalid_mask.sum()} invalid geometries...")
 gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask, "geometry"].apply(make_valid)

 # 3. Drop null geometries & empty features
 gdf = gdf.dropna(subset=["geometry"])
 gdf = gdf[~gdf.geometry.is_empty]

 # 4. Export cleaned dataset
 gdf.to_file(output_path, driver="ESRI Shapefile")
 print(f"Cleanup complete. Saved to {output_path}")

Implementation Explanation

The script enforces deterministic behavior by validating gdf.crs upfront. Shapefiles frequently ship with missing .prj files or outdated datum definitions. The to_crs() method handles datum shifts safely during coordinate transformation. The set_crs() fallback assigns metadata when projection files are absent.

Topology repair leverages shapely.validation.make_valid. This function resolves self-intersections and ring orientation errors. It operates without fragmenting valid polygon boundaries.

Finally, dropna and is_empty filters guarantee pipeline stability. Downstream spatial joins and web mapping renderers will not crash on null features. The output remains strictly compliant with ESRI Shapefile specifications.

Edge Cases & Debugging