DBSCAN vs HDBSCAN for Spatial Clustering

DBSCAN and HDBSCAN both find dense groups of points without being told how many clusters to expect, but they differ in one decisive way: DBSCAN uses a single fixed density threshold, while HDBSCAN adapts across densities. This guide compares them on spatial point data and shows when each is right. It is for anyone clustering incidents, sightings, or sensor hits. It sits under Spatial Clustering Algorithms in Spatial Analysis & Advanced Query Techniques.

DBSCAN versus HDBSCAN comparison DBSCAN uses one fixed epsilon distance and is fast but struggles with varying density; HDBSCAN varies density, needs no epsilon, and ranks cluster stability. DBSCAN fixed density HDBSCAN variable density Needs eps (metres) + min_samples No eps; min_cluster_size only One global density threshold Adapts to local density Fast, simple, well understood Slower; stability scores Mislabels mixed-density data Handles mixed density well Noise = below eps density Noise + soft membership Uniform density & a known scale → DBSCAN; varying density or unknown eps → HDBSCAN
The split is density: DBSCAN's single epsilon versus HDBSCAN's adaptive, parameter-light hierarchy.

Why This Approach / What Goes Wrong

DBSCAN groups points that are within eps of enough neighbors (min_samples); everything else is noise. It is fast and intuitive, but eps is a single global distance, so on data with both dense downtown clusters and sparse rural ones, no single eps fits — you either merge the rural points into noise or smear the dense ones together. HDBSCAN removes eps entirely, builds a hierarchy of clusters across density scales, and extracts the most stable ones, so it copes with varying density and needs only min_cluster_size. The universal mistake with both is geographic coordinates: their distance metric must be metric, so project first — or pass metric="haversine" on radians, never raw degrees with Euclidean distance.

Prerequisites

conda install -c conda-forge "geopandas=0.14.*" "scikit-learn=1.4.*" "numpy=1.26.*"

Step-by-Step Implementation

1. Load points and project to a metric CRS so distances are in metres.

import geopandas as gpd
import numpy as np

# crime_incidents: point events across a metro region
crime_incidents = gpd.read_file("crime_incidents.gpkg").to_crs(epsg=25832)
coords_m = np.column_stack([crime_incidents.geometry.x, crime_incidents.geometry.y])

2. DBSCAN with an explicit metric eps (e.g. 250 m, ≥5 points).

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=250, min_samples=5).fit(coords_m)
crime_incidents["dbscan"] = db.labels_      # -1 = noise

3. HDBSCAN with no eps — only a minimum cluster size.

from sklearn.cluster import HDBSCAN

hdb = HDBSCAN(min_cluster_size=15).fit(coords_m)
crime_incidents["hdbscan"] = hdb.labels_     # -1 = noise
crime_incidents["hdbscan_prob"] = hdb.probabilities_  # membership strength

4. Summarize how many groups each algorithm found.

def summarize(labels):
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = int((labels == -1).sum())
    return n_clusters, n_noise

print("DBSCAN :", summarize(crime_incidents["dbscan"]))   # DBSCAN : (38, 5120)
print("HDBSCAN:", summarize(crime_incidents["hdbscan"]))  # HDBSCAN: (52, 3987)

Verification

Confirm clustering ran in metric space and that labels are sane.

assert crime_incidents.crs.is_projected, "Cluster in a metric CRS, not degrees"

# Every point is labelled (cluster id or -1 noise), none left unassigned
assert crime_incidents["dbscan"].notna().all()
assert crime_incidents["hdbscan"].notna().all()

# HDBSCAN noise points carry ~0 membership probability
noise_mask = crime_incidents["hdbscan"] == -1
assert crime_incidents.loc[noise_mask, "hdbscan_prob"].max() < 1e-6
print("Both clusterings labelled all", len(crime_incidents), "points")

Edge Cases & Debugging