Automating Shapefile Cleanup with Python: CRS Enforcement & Topology Repair

When ingesting legacy spatial datasets, Geospatial Data Ingestion & Processing Workflows frequently encounter corrupted geometries, missing CRS definitions, and attribute nulls. Automating shapefile cleanup with Python eliminates manual desktop GIS interventions and ensures reproducible, version-controlled data pipelines for production environments.

This guide targets a single, high-impact workflow: validate topologies, enforce explicit coordinate reference systems, and strip invalid records before downstream analysis. Ensure your ingestion layer correctly handles multi-part features first. Refer to our Shapefile & GeoJSON Parsing documentation for foundational I/O patterns.

Minimal Reproducible Code

import geopandas as gpd
from shapely.validation import make_valid
import warnings

def clean_shapefile(input_path: str, output_path: str, target_crs: str = "EPSG:4326") -> None:
    """
    Load a shapefile, enforce CRS, repair invalid geometries,
    drop nulls/empties, and write a clean output.
    """
    gdf = gpd.read_file(input_path)

    # 1. Explicit CRS validation & transformation
    target_epsg = int(target_crs.split(":")[1])
    if gdf.crs is None:
        warnings.warn(
            "No CRS detected. Assigning target CRS directly. Verify spatial accuracy."
        )
        gdf = gdf.set_crs(target_crs)
    elif gdf.crs.to_epsg() != target_epsg:
        gdf = gdf.to_crs(target_crs)

    # 2. Topology validation & repair
    invalid_mask = ~gdf.geometry.is_valid
    if invalid_mask.any():
        print(f"Repairing {invalid_mask.sum()} invalid geometries...")
        gdf.loc[invalid_mask, "geometry"] = (
            gdf.loc[invalid_mask, "geometry"].apply(make_valid)
        )

    # 3. Drop null geometries & empty features
    gdf = gdf.dropna(subset=["geometry"])
    gdf = gdf[~gdf.geometry.is_empty]

    # 4. Export cleaned dataset
    gdf.to_file(output_path, driver="ESRI Shapefile")
    print(f"Cleanup complete. {len(gdf)} features saved to {output_path}")

Implementation Explanation

The script enforces deterministic behavior by validating gdf.crs upfront. Shapefiles frequently ship with missing .prj files or outdated datum definitions. The to_crs() method handles datum shifts safely during coordinate transformation. The set_crs() fallback assigns metadata when projection files are absent — it does not reproject coordinates, so only use it when you are certain of the source CRS.

Topology repair leverages shapely.validation.make_valid. This function resolves self-intersections and ring orientation errors without fragmenting valid polygon boundaries.

The dropna and is_empty filters guarantee pipeline stability. Downstream spatial joins and web mapping renderers will not crash on null features. The output remains strictly compliant with the ESRI Shapefile format.

Edge Cases & Debugging