Chapter 5: Clustering for Unsupervised Pattern Discovery

Chapter Introduction

The four chapters before this one were all supervised: somebody told you what y was — a target return, a click-through-rate, a regime label, a future state. Most of the data that actually arrives at a quantitative research desk has no such label. You have 3,000 stocks but no “ground truth” sector. You have a year of strategy returns from 200 PMs and no labelled “talented vs. untalented”. You have a million customer transaction histories and no master schema of behavioural types. Clustering is the discipline of finding structure in this kind of unlabeled data — partitioning observations into groups whose members resemble each other more than they resemble members of other groups.

Top hedge funds use clustering everywhere. Marcos López de Prado built his Hierarchical Risk Parity (HRP) portfolio construction around an agglomerative clustering of the correlation matrix, and the technique is now the default at several large asset managers. AQR clusters factor returns to find redundancy among “anomalies” reported in the academic literature. Two Sigma clusters alternative-data sources (satellite imagery features, credit-card aggregates) before any predictive modelling. Renaissance is widely believed to use density-based clustering to group recurring intraday price patterns. Citadel’s macro group clusters central-bank communications by topic. And outside finance the same machinery powers customer segmentation, gene expression analysis, anomaly detection in industrial IoT data, and recommender systems.

This chapter teaches five clustering algorithms — k-means, hierarchical agglomerative, DBSCAN, Gaussian mixture models, and spectral clustering — and one essential follow-up technique, cluster validation, that tells you whether the clusters you found are real or accidents. Each algorithm has different assumptions, different strengths, and a different failure mode, and a professional analyst picks the right one for the data instead of mechanically reaching for k-means. We close with a finance-specific application: the Hierarchical Risk Parity portfolio, where clustering and risk allocation come together.

As with the previous chapters, the example data is whatever illustrates the method most clearly. Synthetic 2-D blobs are easier to see than real returns, and we generate them inline. When a finance-specific demonstration is the cleanest illustration — clustering a 10-asset correlation matrix for HRP — we use it. None of the methods are finance-specific; all are used in finance because they generalise.

The Unsupervised Frame
Similarity, Distance, and Standardisation
K-Means — The Workhorse and Its Pitfalls
Hierarchical Agglomerative Clustering and Dendrograms
DBSCAN — Density-Based Clustering for Irregular Shapes
Gaussian Mixture Models — Probabilistic Clustering
Spectral Clustering — When the Geometry Is Wrong
Validating Clusters — Silhouette, Gap, and Stability
Hierarchical Risk Parity (HRP) — A Hedge-Fund Application

The Unsupervised Frame

Supervised learning answers “given \(x\), predict \(y\).” Unsupervised learning answers “given \(x\) alone, what structure is in it?” The two structural questions are:

Clustering — discover discrete groups: customers, regimes, asset cohorts, document topics.
Dimensionality reduction — discover a low-dimensional manifold the data lives near. Already met in Chapter 2 as PCA; we return to it in Chapter 6 with t-SNE and UMAP.

A clustering algorithm is a function \(C: \mathbb{R}^{n \times p} \to \{1, \dots, K\}^{n}\) that assigns each of \(n\) observations to one of \(K\) clusters. Without a ground-truth label, two questions immediately arise: how do you choose \(K\), and how do you know if the assignments are any good? Both questions get answers in §5.8 (validation). For now, the working principle is distance: observations in the same cluster should be close to each other; observations in different clusters should be far apart. Every algorithm below operationalises that principle differently.

Similarity, Distance, and Standardisation

All clustering rests on a notion of similarity between observations. The default is Euclidean distance:

\[ d(x, y) = \sqrt{\sum_{j=1}^{p} (x_j - y_j)^2}. \]

But other distances are appropriate in different settings:

Manhattan (\(\ell_1\)) — robust to outliers; preferred when many features matter equally.
Cosine — measures the angle between vectors, ignoring magnitude; the right metric for document similarity, sparse feature vectors, normalised return signatures.
Mahalanobis — Euclidean after whitening by the covariance matrix; the right choice when features are on different scales and correlated. Already met in Chapter 1.
Correlation distance — \(1 - \rho_{ij}\); the right choice for clustering asset return time series by co-movement, and the foundation of HRP.

Standardisation matters more than the choice of metric. If one column is in dollars (range: 0–10⁶) and another is in fractions (range: 0–1), Euclidean distance will be dominated by the dollar column. Standardise before clustering, or use a distance that is invariant to scale.

Module reference — distances and scaling

sklearn.preprocessing.StandardScaler() — z-score: \(x \to (x - \mu) / \sigma\) per column.
sklearn.preprocessing.MinMaxScaler() — affine to \([0, 1]\).
scipy.spatial.distance.pdist(X, metric='euclidean') — pairwise distances; metrics include 'euclidean', 'cityblock', 'cosine', 'mahalanobis', 'correlation'.
scipy.spatial.distance.squareform(d) — convert a condensed distance vector into a square \(n \times n\) matrix.

K-Means — The Workhorse and Its Pitfalls

K-means partitions the data into \(K\) clusters by alternating two steps:

Assignment — each point is assigned to the nearest cluster centre.
Update — each centre is recomputed as the mean of points assigned to it.

The algorithm converges to a local optimum of the within-cluster sum of squares (WCSS): \[ \sum_{k=1}^{K} \sum_{x_i \in C_k} \lVert x_i - \mu_k \rVert^2. \]

It is fast (linear in \(n\) per iteration), scalable, and the most-used clustering algorithm in the world. It is also wrong for many shapes. The implicit assumption is that each cluster is a roughly spherical, equal-sized blob. K-means fails when clusters are elongated, of very different sizes, or non-convex.

Module reference — sklearn.cluster.KMeans

KMeans(n_clusters=K, n_init=10, random_state=0).fit(X) returns a fitted estimator. Key attributes: .labels_ (cluster assignment per point), .cluster_centers_ (the \(K\) centroids), .inertia_ (WCSS). For very large data, MiniBatchKMeans is the streaming variant. n_init=10 runs the algorithm from 10 random starts and keeps the best; never use n_init=1 on real data — k-means is prone to bad local optima.

The blobs case is exactly what k-means is designed for and it nails the partition. The moons case shows the universal failure mode — k-means cuts the data with hyperplanes, so curved structure becomes invisible to it.

Choosing \(K\) — the elbow plot

WCSS strictly decreases with \(K\). Plot WCSS vs. \(K\); look for an “elbow” where the marginal improvement drops sharply. The elbow is the visually-chosen \(K\). The companion automated criterion is the silhouette score (next section) or the gap statistic.

Clustering assets by correlation — correlation clusters are typically non-convex in feature space, and k-means on raw returns will mix highly correlated assets across cluster boundaries. (2) Clustering regime data where one regime (calm) dominates 90% of observations while another (crisis) is sparse — k-means will split the calm regime into multiple sub-clusters and merge the crisis with one of them.

Hierarchical Agglomerative Clustering and Dendrograms

Hierarchical agglomerative clustering builds a nested family of clusters bottom-up:

Start with every point as its own cluster.
At each step, merge the two closest clusters.
Continue until everything is in a single cluster.

The output is a dendrogram — a tree of merges. Cutting the dendrogram at a chosen height yields a partition into \(K\) clusters for any \(K\) you like, without re-running the algorithm. This is the algorithm’s killer feature.

Three “linkage” choices determine what “closest” means at the merge step:

Single linkage — distance between two clusters is the minimum pairwise distance. Produces “stringy” clusters; sensitive to noise.
Complete linkage — distance is the maximum pairwise distance. Produces tight, ball-shaped clusters.
Average linkage — mean pairwise distance. Compromise; default in many references.
Ward linkage — minimises within-cluster variance at each merge; produces balanced, k-means-like partitions. Industry default for general use.

Module reference — scipy.cluster.hierarchy

linkage(X, method='ward') — returns a linkage matrix encoding the merges.
dendrogram(Z, color_threshold=h) — plot the tree.
fcluster(Z, t=K, criterion='maxclust') — extract a partition into K clusters.
sklearn.cluster.AgglomerativeClustering(n_clusters=K, linkage='ward') — sklearn-style interface; lacks the dendrogram but pairs cleanly with pipelines.

Notice the dendrogram’s height at each merge: large jumps indicate the merge combined genuinely dissimilar groups. The visual gap is itself a cluster-validity signal — if the height jumps are smooth, \(K\) is ill-defined and the data may not have natural clusters at all.

DBSCAN — Density-Based Clustering for Irregular Shapes

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) defines a cluster as a region of high point density connected through density-reachable neighbourhoods. Two parameters:

eps — the neighbourhood radius.
min_samples — the minimum number of points required to form a dense region.

A point is core if it has at least min_samples neighbours within eps. Two core points are in the same cluster if one is in the other’s eps-neighbourhood. Non-core points within eps of a core point are border points (assigned to that cluster). Points with neither property are labelled noise (-1).

The defining virtue: DBSCAN finds clusters of arbitrary shape and does not require you to specify \(K\). The defining vice: it is sensitive to eps and min_samples, and it doesn’t scale to clusters of very different densities.

DBSCAN handles the moons that broke k-means, and it also automatically flags outliers — there is no separate anomaly-detection step. This is precisely the property that makes DBSCAN attractive at quant funds: clustering and outlier detection in one pass.

Module reference — sklearn.cluster.DBSCAN

DBSCAN(eps=0.5, min_samples=5).fit(X). After fitting, .labels_ contains cluster ids and -1 for noise. .core_sample_indices_ gives the indices of points classified as core. To pick eps, use the standard “k-distance” plot: sort the distance to each point’s min_samples-th neighbour and look for the knee.

Gaussian Mixture Models — Probabilistic Clustering

A Gaussian mixture model (GMM) treats the data as drawn from a mixture of \(K\) Gaussian components: \[ p(x) = \sum_{k=1}^{K} \pi_k \, N(x \mid \mu_k, \Sigma_k), \qquad \sum_k \pi_k = 1. \] Each point has a probability of belonging to each cluster. Hard assignments are recovered by taking the most probable component, but the soft assignments are the point — they expose uncertainty about which cluster a borderline observation belongs to.

GMMs are fit by Expectation–Maximisation (EM):

E-step — compute the posterior probability that each point belongs to each component, given current parameters.
M-step — re-estimate \((\pi_k, \mu_k, \Sigma_k)\) as weighted averages over these probabilities.
Repeat until the log-likelihood converges.

Two GMM properties matter for finance applications:

They generalise k-means (which is GMM with isotropic, equal-variance components and hard assignments).
The full \(\Sigma_k\) form captures elliptical, correlated clusters that k-means would split incorrectly. This is exactly what you want when clustering asset returns.

Module reference — sklearn.mixture.GaussianMixture

GaussianMixture(n_components=K, covariance_type='full', random_state=0).fit(X). covariance_type choices: 'full', 'tied', 'diag', 'spherical' — pick based on the number of parameters you can afford to estimate. After fitting: .predict(X) for hard labels, .predict_proba(X) for soft assignments, .bic(X) and .aic(X) for model selection over \(K\).

GMMs are also the workhorse for regime detection when you have an unlabeled mixture of “calm” and “turbulent” return regimes. Fit a 2-component GMM to past returns and read off the posterior probability of being in the high-volatility component each day — that probability is itself a tradeable signal, used at multiple risk-parity shops.

GMM produces soft posterior probabilities of cluster membership; k-means hard-assigns each point to its nearest centre. (2) GMM allows elliptical, correlated clusters via the full covariance \(\Sigma_k\); k-means assumes spherical equal-variance clusters. The combination makes GMM the right choice when borderline points carry information and when feature dimensions are correlated.

Spectral Clustering — When the Geometry Is Wrong

Spectral clustering reformulates the problem in graph terms:

Build a similarity graph (e.g., \(w_{ij} = \exp(-\lVert x_i - x_j \rVert^2 / 2 \sigma^2)\) for nearby points, \(0\) otherwise).
Compute the graph Laplacian \(L = D - W\).
Embed the data into the first \(K\) eigenvectors of \(L\).
Run k-means in that low-dimensional embedding.

The trick is that the eigenvector embedding flattens curved manifolds. Spectral clustering finds clusters that no Euclidean algorithm could, and it is the right tool when the data lives on a known graph (a network of assets, a friendship graph, a citation network).

In a financial setting, spectral clustering is often the right choice for clustering an affinity matrix derived from correlations (rather than a raw correlation matrix), or for community-detection in supply-chain networks and inter-bank exposure graphs.

Validating Clusters — Silhouette, Gap, and Stability

Clustering without validation is fortune-telling. Three numerical checks belong in every clustering report:

Silhouette score

For each point \(i\), let \(a(i)\) be the mean distance to other points in its own cluster and \(b(i)\) the mean distance to points in the nearest other cluster. The silhouette is \[ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \in [-1, 1]. \] Close to \(1\): well-clustered. Around \(0\): on the boundary. Negative: probably assigned to the wrong cluster. The mean silhouette across the dataset is a single-number cluster quality score.

Gap statistic

Compare the observed within-cluster dispersion to the dispersion expected under a null (uniformly random) reference distribution. The optimal \(K\) is the one with the largest gap. This is the formal version of the elbow plot.

Cluster stability

Re-cluster a bootstrap resample of the data; measure how stable the assignments are. Stable assignments are real structure; unstable assignments are artefacts. This is the gold-standard but most-skipped check.

Module reference — sklearn.metrics

silhouette_score(X, labels) — mean silhouette.
silhouette_samples(X, labels) — per-point silhouettes; plot a “silhouette diagram” to see which clusters are problematic.
calinski_harabasz_score, davies_bouldin_score — alternative single-number cluster-quality metrics; useful sanity checks.

Hierarchical Risk Parity (HRP) — A Hedge-Fund Application

López de Prado’s Hierarchical Risk Parity (2016) is the most famous direct application of clustering in finance. The idea: classical mean–variance optimisation gives nonsensical weights because it inverts a noisy covariance matrix. HRP avoids the inversion by:

Computing the distance between every pair of assets as \(d_{ij} = \sqrt{(1 - \rho_{ij}) / 2}\) (a proper metric derived from correlation).
Building a dendrogram with hierarchical agglomerative clustering.
Allocating capital top-down through the dendrogram, inverse-variance weighting within each split.

The result is a portfolio that is risk-balanced across the natural cluster structure of the assets, robust to estimation error in the covariance matrix, and dominant on out-of-sample Sharpe versus classical mean–variance in most published studies.

The example below uses a simulated 10-asset return matrix to make every step transparent. In production you would use 50 or 500 real returns; the code structure is identical.

The HRP weights are smooth, positive, and reflect the dendrogram’s cluster structure. The minimum-variance weights, by contrast, often contain wild shorts that disappear as soon as you re-estimate the covariance matrix on a different sample. This stability — purchased by replacing one big matrix inversion with a sequence of cluster-local averages — is the entire reason HRP has taken over from classical mean–variance at risk-parity desks.

The classical mean–variance optimiser uses \(w \propto \Sigma^{-1}\mu\). The inverse of a noisy covariance matrix is extremely unstable — small estimation errors blow up into enormous weight swings, especially when assets are highly correlated. HRP replaces the global inversion with a sequence of inverse-variance allocations within each cluster split, so the operation is always on small, well-conditioned sub-problems. That is what makes HRP robust out-of-sample.

Chapter Wrap-up

You now have five clustering algorithms and the validation discipline to use them honestly:

Algorithm	Shape assumption	Specify K?	Outlier-aware	Best for
K-means	spherical, equal-size	yes	no	clean blobby data, fast iteration
Hierarchical	none (any linkage)	choose afterwards	no	small-medium data, want the dendrogram
DBSCAN	density-connected	no	yes	irregular shapes, automatic noise filter
GMM	elliptical, can be correlated	yes	soft via P	overlapping groups, regime detection
Spectral	graph-based, non-convex OK	yes	no	manifold structure, network data

The single highest-leverage use of clustering in finance is the HRP allocation we just built. The second is regime detection with a 2- or 3-component GMM on returns or vol. The third is document/news clustering for sentiment pipelines. None of them are exotic; all are everyday infrastructure at the funds named at the start of the chapter.

Chapter 6 closes the book by stepping back from “which algorithm” to “what is the pattern recognition discipline” — the unifying methodology that says when to reach for clustering, when for classification, when for sequence modelling, and how to validate everything you find before betting money on it.

← Chapter 4 · Contents · Chapter 6: Pattern Recognition →