DBSCAN vs HDBSCAN for Spatial Clustering
DBSCAN and HDBSCAN both find dense groups of points without being told how many clusters to expect, but they differ in one decisive way: DBSCAN uses a single fixed density threshold, while HDBSCAN adapts across densities. This guide compares them on spatial point data and shows when each is right. It is for anyone clustering incidents, sightings, or sensor hits. It sits under Spatial Clustering Algorithms in Spatial Analysis & Advanced Query Techniques.
Why This Approach / What Goes Wrong
DBSCAN groups points that are within eps of enough neighbors (min_samples); everything else is noise. It is fast and intuitive, but eps is a single global distance, so on data with both dense downtown clusters and sparse rural ones, no single eps fits — you either merge the rural points into noise or smear the dense ones together. HDBSCAN removes eps entirely, builds a hierarchy of clusters across density scales, and extracts the most stable ones, so it copes with varying density and needs only min_cluster_size. The universal mistake with both is geographic coordinates: their distance metric must be metric, so project first — or pass metric="haversine" on radians, never raw degrees with Euclidean distance.
Prerequisites
geopandas>=0.14scikit-learn>=1.4(ships bothDBSCANandHDBSCAN)numpy>=1.26
conda install -c conda-forge "geopandas=0.14.*" "scikit-learn=1.4.*" "numpy=1.26.*"
Step-by-Step Implementation
1. Load points and project to a metric CRS so distances are in metres.
import geopandas as gpd
import numpy as np
# crime_incidents: point events across a metro region
crime_incidents = gpd.read_file("crime_incidents.gpkg").to_crs(epsg=25832)
coords_m = np.column_stack([crime_incidents.geometry.x, crime_incidents.geometry.y])
2. DBSCAN with an explicit metric eps (e.g. 250 m, ≥5 points).
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=250, min_samples=5).fit(coords_m)
crime_incidents["dbscan"] = db.labels_ # -1 = noise
3. HDBSCAN with no eps — only a minimum cluster size.
from sklearn.cluster import HDBSCAN
hdb = HDBSCAN(min_cluster_size=15).fit(coords_m)
crime_incidents["hdbscan"] = hdb.labels_ # -1 = noise
crime_incidents["hdbscan_prob"] = hdb.probabilities_ # membership strength
4. Summarize how many groups each algorithm found.
def summarize(labels):
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = int((labels == -1).sum())
return n_clusters, n_noise
print("DBSCAN :", summarize(crime_incidents["dbscan"])) # DBSCAN : (38, 5120)
print("HDBSCAN:", summarize(crime_incidents["hdbscan"])) # HDBSCAN: (52, 3987)
Verification
Confirm clustering ran in metric space and that labels are sane.
assert crime_incidents.crs.is_projected, "Cluster in a metric CRS, not degrees"
# Every point is labelled (cluster id or -1 noise), none left unassigned
assert crime_incidents["dbscan"].notna().all()
assert crime_incidents["hdbscan"].notna().all()
# HDBSCAN noise points carry ~0 membership probability
noise_mask = crime_incidents["hdbscan"] == -1
assert crime_incidents.loc[noise_mask, "hdbscan_prob"].max() < 1e-6
print("Both clusterings labelled all", len(crime_incidents), "points")
Edge Cases & Debugging
- Everything is one giant cluster.
epstoo large for DBSCAN; lower it, or switch to HDBSCAN if density varies. - Everything is noise.
epstoo small ormin_samplestoo high; relax them. - Clusters make no geographic sense. You clustered in degrees; reproject to a metric CRS first.
- HDBSCAN slower than expected. It is heavier than DBSCAN; subsample for exploration, then run on the full set.
- Need lat/lon directly. Use
metric="haversine"with coordinates in radians (np.radians), not Euclidean on degrees. - Unstable cluster ids across runs. Cluster numbering is arbitrary; join on geometry/attributes, not on the raw label integer.