Automating shapefile cleanup with Python: CRS Enforcement & Topology Repair
When ingesting legacy spatial datasets, Geospatial Data Ingestion & Processing Workflows frequently encounter corrupted geometries, missing CRS definitions, and attribute nulls. Automating shapefile cleanup with Python eliminates manual desktop GIS interventions. It ensures reproducible, version-controlled data pipelines for production environments.
This guide targets a single, high-impact workflow. We validate topologies, enforce explicit coordinate reference systems, and strip invalid records before downstream analysis. Ensure your ingestion layer correctly handles multi-part features first. Refer to our Shapefile & GeoJSON Parsing documentation for foundational I/O patterns.
Minimal Reproducible Code
import geopandas as gpd
from shapely.validation import make_valid
import warnings
def clean_shapefile(input_path: str, output_path: str, target_crs: str = "EPSG:4326") -> None:
# Load dataset with explicit error handling
gdf = gpd.read_file(input_path)
# 1. Explicit CRS validation & transformation
target_epsg = int(target_crs.split(":")[1])
if gdf.crs is None:
warnings.warn("No CRS detected. Assigning target CRS directly. Verify spatial accuracy.")
gdf.set_crs(target_crs, inplace=True)
elif gdf.crs.to_epsg() != target_epsg:
gdf = gdf.to_crs(target_crs)
# 2. Topology validation & repair
invalid_mask = ~gdf.geometry.is_valid
if invalid_mask.any():
print(f"Repairing {invalid_mask.sum()} invalid geometries...")
gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask, "geometry"].apply(make_valid)
# 3. Drop null geometries & empty features
gdf = gdf.dropna(subset=["geometry"])
gdf = gdf[~gdf.geometry.is_empty]
# 4. Export cleaned dataset
gdf.to_file(output_path, driver="ESRI Shapefile")
print(f"Cleanup complete. Saved to {output_path}")
Implementation Explanation
The script enforces deterministic behavior by validating gdf.crs upfront. Shapefiles frequently ship with missing .prj files or outdated datum definitions. The to_crs() method handles datum shifts safely during coordinate transformation. The set_crs() fallback assigns metadata when projection files are absent.
Topology repair leverages shapely.validation.make_valid. This function resolves self-intersections and ring orientation errors. It operates without fragmenting valid polygon boundaries.
Finally, dropna and is_empty filters guarantee pipeline stability. Downstream spatial joins and web mapping renderers will not crash on null features. The output remains strictly compliant with ESRI Shapefile specifications.
Edge Cases & Debugging
- CRS Mismatch Warnings: If
gdf.crsreturnsNone, verify the.prjfile exists. Legacy exports often strip projection metadata during batch conversion. - Geometry Fragmentation:
make_valid()may split invalid polygons intoMultiPolygongeometries. Add.explode(index_parts=True)post-cleanup if your pipeline requires strict single-part features. - Large File Memory Limits: For datasets exceeding 500MB, avoid loading the entire file into RAM. Process in chunks using
fionaiterators or migrate todask-geopandasfor distributed execution. - Silent Failures: Wrap
make_validin atry/exceptblock when encounteringGEOSException. Log failed row indices to a separate CSV for manual topology inspection.