Chapter 5: Clustering for Unsupervised Pattern Discovery
Chapter Introduction
The four chapters before this one were all supervised: somebody told you what y was — a target return, a click-through-rate, a regime label, a future state. Most of the data that actually arrives at a quantitative research desk has no such label. You have 3,000 stocks but no “ground truth” sector. You have a year of strategy returns from 200 PMs and no labelled “talented vs. untalented”. You have a million customer transaction histories and no master schema of behavioural types. Clustering is the discipline of finding structure in this kind of unlabeled data — partitioning observations into groups whose members resemble each other more than they resemble members of other groups.
Top hedge funds use clustering everywhere. Marcos López de Prado built his Hierarchical Risk Parity (HRP) portfolio construction around an agglomerative clustering of the correlation matrix, and the technique is now the default at several large asset managers. AQR clusters factor returns to find redundancy among “anomalies” reported in the academic literature. Two Sigma clusters alternative-data sources (satellite imagery features, credit-card aggregates) before any predictive modelling. Renaissance is widely believed to use density-based clustering to group recurring intraday price patterns. Citadel’s macro group clusters central-bank communications by topic. And outside finance the same machinery powers customer segmentation, gene expression analysis, anomaly detection in industrial IoT data, and recommender systems.
This chapter teaches five clustering algorithms — k-means, hierarchical agglomerative, DBSCAN, Gaussian mixture models, and spectral clustering — and one essential follow-up technique, cluster validation, that tells you whether the clusters you found are real or accidents. Each algorithm has different assumptions, different strengths, and a different failure mode, and a professional analyst picks the right one for the data instead of mechanically reaching for k-means. We close with a finance-specific application: the Hierarchical Risk Parity portfolio, where clustering and risk allocation come together.
As with the previous chapters, the example data is whatever illustrates the method most clearly. Synthetic 2-D blobs are easier to see than real returns, and we generate them inline. When a finance-specific demonstration is the cleanest illustration — clustering a 10-asset correlation matrix for HRP — we use it. None of the methods are finance-specific; all are used in finance because they generalise.
Table of Contents
- The Unsupervised Frame
- Similarity, Distance, and Standardisation
- K-Means — The Workhorse and Its Pitfalls
- Hierarchical Agglomerative Clustering and Dendrograms
- DBSCAN — Density-Based Clustering for Irregular Shapes
- Gaussian Mixture Models — Probabilistic Clustering
- Spectral Clustering — When the Geometry Is Wrong
- Validating Clusters — Silhouette, Gap, and Stability
- Hierarchical Risk Parity (HRP) — A Hedge-Fund Application
The Unsupervised Frame
Supervised learning answers “given \(x\), predict \(y\).” Unsupervised learning answers “given \(x\) alone, what structure is in it?” The two structural questions are:
- Clustering — discover discrete groups: customers, regimes, asset cohorts, document topics.
- Dimensionality reduction — discover a low-dimensional manifold the data lives near. Already met in Chapter 2 as PCA; we return to it in Chapter 6 with t-SNE and UMAP.
A clustering algorithm is a function \(C: \mathbb{R}^{n \times p} \to \{1, \dots, K\}^{n}\) that assigns each of \(n\) observations to one of \(K\) clusters. Without a ground-truth label, two questions immediately arise: how do you choose \(K\), and how do you know if the assignments are any good? Both questions get answers in §5.8 (validation). For now, the working principle is distance: observations in the same cluster should be close to each other; observations in different clusters should be far apart. Every algorithm below operationalises that principle differently.
Similarity, Distance, and Standardisation
All clustering rests on a notion of similarity between observations. The default is Euclidean distance:
\[ d(x, y) = \sqrt{\sum_{j=1}^{p} (x_j - y_j)^2}. \]
But other distances are appropriate in different settings:
- Manhattan (\(\ell_1\)) — robust to outliers; preferred when many features matter equally.
- Cosine — measures the angle between vectors, ignoring magnitude; the right metric for document similarity, sparse feature vectors, normalised return signatures.
- Mahalanobis — Euclidean after whitening by the covariance matrix; the right choice when features are on different scales and correlated. Already met in Chapter 1.
- Correlation distance — \(1 - \rho_{ij}\); the right choice for clustering asset return time series by co-movement, and the foundation of HRP.
Standardisation matters more than the choice of metric. If one column is in dollars (range: 0–10⁶) and another is in fractions (range: 0–1), Euclidean distance will be dominated by the dollar column. Standardise before clustering, or use a distance that is invariant to scale.
sklearn.preprocessing.StandardScaler()— z-score: \(x \to (x - \mu) / \sigma\) per column.sklearn.preprocessing.MinMaxScaler()— affine to \([0, 1]\).scipy.spatial.distance.pdist(X, metric='euclidean')— pairwise distances; metrics include'euclidean','cityblock','cosine','mahalanobis','correlation'.scipy.spatial.distance.squareform(d)— convert a condensed distance vector into a square \(n \times n\) matrix.
K-Means — The Workhorse and Its Pitfalls
K-means partitions the data into \(K\) clusters by alternating two steps:
- Assignment — each point is assigned to the nearest cluster centre.
- Update — each centre is recomputed as the mean of points assigned to it.
The algorithm converges to a local optimum of the within-cluster sum of squares (WCSS): \[ \sum_{k=1}^{K} \sum_{x_i \in C_k} \lVert x_i - \mu_k \rVert^2. \]
It is fast (linear in \(n\) per iteration), scalable, and the most-used clustering algorithm in the world. It is also wrong for many shapes. The implicit assumption is that each cluster is a roughly spherical, equal-sized blob. K-means fails when clusters are elongated, of very different sizes, or non-convex.
sklearn.cluster.KMeans
KMeans(n_clusters=K, n_init=10, random_state=0).fit(X) returns a fitted estimator. Key attributes: .labels_ (cluster assignment per point), .cluster_centers_ (the \(K\) centroids), .inertia_ (WCSS). For very large data, MiniBatchKMeans is the streaming variant. n_init=10 runs the algorithm from 10 random starts and keeps the best; never use n_init=1 on real data — k-means is prone to bad local optima.
The blobs case is exactly what k-means is designed for and it nails the partition. The moons case shows the universal failure mode — k-means cuts the data with hyperplanes, so curved structure becomes invisible to it.
Choosing \(K\) — the elbow plot
WCSS strictly decreases with \(K\). Plot WCSS vs. \(K\); look for an “elbow” where the marginal improvement drops sharply. The elbow is the visually-chosen \(K\). The companion automated criterion is the silhouette score (next section) or the gap statistic.
- Clustering assets by correlation — correlation clusters are typically non-convex in feature space, and k-means on raw returns will mix highly correlated assets across cluster boundaries. (2) Clustering regime data where one regime (calm) dominates 90% of observations while another (crisis) is sparse — k-means will split the calm regime into multiple sub-clusters and merge the crisis with one of them.
Hierarchical Agglomerative Clustering and Dendrograms
Hierarchical agglomerative clustering builds a nested family of clusters bottom-up:
- Start with every point as its own cluster.
- At each step, merge the two closest clusters.
- Continue until everything is in a single cluster.
The output is a dendrogram — a tree of merges. Cutting the dendrogram at a chosen height yields a partition into \(K\) clusters for any \(K\) you like, without re-running the algorithm. This is the algorithm’s killer feature.
Three “linkage” choices determine what “closest” means at the merge step:
- Single linkage — distance between two clusters is the minimum pairwise distance. Produces “stringy” clusters; sensitive to noise.
- Complete linkage — distance is the maximum pairwise distance. Produces tight, ball-shaped clusters.
- Average linkage — mean pairwise distance. Compromise; default in many references.
- Ward linkage — minimises within-cluster variance at each merge; produces balanced, k-means-like partitions. Industry default for general use.
scipy.cluster.hierarchy
linkage(X, method='ward')— returns a linkage matrix encoding the merges.dendrogram(Z, color_threshold=h)— plot the tree.fcluster(Z, t=K, criterion='maxclust')— extract a partition intoKclusters.sklearn.cluster.AgglomerativeClustering(n_clusters=K, linkage='ward')— sklearn-style interface; lacks the dendrogram but pairs cleanly with pipelines.
Notice the dendrogram’s height at each merge: large jumps indicate the merge combined genuinely dissimilar groups. The visual gap is itself a cluster-validity signal — if the height jumps are smooth, \(K\) is ill-defined and the data may not have natural clusters at all.
DBSCAN — Density-Based Clustering for Irregular Shapes
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) defines a cluster as a region of high point density connected through density-reachable neighbourhoods. Two parameters:
eps— the neighbourhood radius.min_samples— the minimum number of points required to form a dense region.
A point is core if it has at least min_samples neighbours within eps. Two core points are in the same cluster if one is in the other’s eps-neighbourhood. Non-core points within eps of a core point are border points (assigned to that cluster). Points with neither property are labelled noise (-1).
The defining virtue: DBSCAN finds clusters of arbitrary shape and does not require you to specify \(K\). The defining vice: it is sensitive to eps and min_samples, and it doesn’t scale to clusters of very different densities.
DBSCAN handles the moons that broke k-means, and it also automatically flags outliers — there is no separate anomaly-detection step. This is precisely the property that makes DBSCAN attractive at quant funds: clustering and outlier detection in one pass.
sklearn.cluster.DBSCAN
DBSCAN(eps=0.5, min_samples=5).fit(X). After fitting, .labels_ contains cluster ids and -1 for noise. .core_sample_indices_ gives the indices of points classified as core. To pick eps, use the standard “k-distance” plot: sort the distance to each point’s min_samples-th neighbour and look for the knee.
Gaussian Mixture Models — Probabilistic Clustering
A Gaussian mixture model (GMM) treats the data as drawn from a mixture of \(K\) Gaussian components: \[ p(x) = \sum_{k=1}^{K} \pi_k \, N(x \mid \mu_k, \Sigma_k), \qquad \sum_k \pi_k = 1. \] Each point has a probability of belonging to each cluster. Hard assignments are recovered by taking the most probable component, but the soft assignments are the point — they expose uncertainty about which cluster a borderline observation belongs to.
GMMs are fit by Expectation–Maximisation (EM):
- E-step — compute the posterior probability that each point belongs to each component, given current parameters.
- M-step — re-estimate \((\pi_k, \mu_k, \Sigma_k)\) as weighted averages over these probabilities.
- Repeat until the log-likelihood converges.
Two GMM properties matter for finance applications:
- They generalise k-means (which is GMM with isotropic, equal-variance components and hard assignments).
- The full \(\Sigma_k\) form captures elliptical, correlated clusters that k-means would split incorrectly. This is exactly what you want when clustering asset returns.
sklearn.mixture.GaussianMixture
GaussianMixture(n_components=K, covariance_type='full', random_state=0).fit(X). covariance_type choices: 'full', 'tied', 'diag', 'spherical' — pick based on the number of parameters you can afford to estimate. After fitting: .predict(X) for hard labels, .predict_proba(X) for soft assignments, .bic(X) and .aic(X) for model selection over \(K\).
GMMs are also the workhorse for regime detection when you have an unlabeled mixture of “calm” and “turbulent” return regimes. Fit a 2-component GMM to past returns and read off the posterior probability of being in the high-volatility component each day — that probability is itself a tradeable signal, used at multiple risk-parity shops.
- GMM produces soft posterior probabilities of cluster membership; k-means hard-assigns each point to its nearest centre. (2) GMM allows elliptical, correlated clusters via the full covariance \(\Sigma_k\); k-means assumes spherical equal-variance clusters. The combination makes GMM the right choice when borderline points carry information and when feature dimensions are correlated.
Spectral Clustering — When the Geometry Is Wrong
Spectral clustering reformulates the problem in graph terms:
- Build a similarity graph (e.g., \(w_{ij} = \exp(-\lVert x_i - x_j \rVert^2 / 2 \sigma^2)\) for nearby points, \(0\) otherwise).
- Compute the graph Laplacian \(L = D - W\).
- Embed the data into the first \(K\) eigenvectors of \(L\).
- Run k-means in that low-dimensional embedding.
The trick is that the eigenvector embedding flattens curved manifolds. Spectral clustering finds clusters that no Euclidean algorithm could, and it is the right tool when the data lives on a known graph (a network of assets, a friendship graph, a citation network).
In a financial setting, spectral clustering is often the right choice for clustering an affinity matrix derived from correlations (rather than a raw correlation matrix), or for community-detection in supply-chain networks and inter-bank exposure graphs.
Validating Clusters — Silhouette, Gap, and Stability
Clustering without validation is fortune-telling. Three numerical checks belong in every clustering report:
Silhouette score
For each point \(i\), let \(a(i)\) be the mean distance to other points in its own cluster and \(b(i)\) the mean distance to points in the nearest other cluster. The silhouette is \[ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \in [-1, 1]. \] Close to \(1\): well-clustered. Around \(0\): on the boundary. Negative: probably assigned to the wrong cluster. The mean silhouette across the dataset is a single-number cluster quality score.
Gap statistic
Compare the observed within-cluster dispersion to the dispersion expected under a null (uniformly random) reference distribution. The optimal \(K\) is the one with the largest gap. This is the formal version of the elbow plot.
Cluster stability
Re-cluster a bootstrap resample of the data; measure how stable the assignments are. Stable assignments are real structure; unstable assignments are artefacts. This is the gold-standard but most-skipped check.
sklearn.metrics
silhouette_score(X, labels)— mean silhouette.silhouette_samples(X, labels)— per-point silhouettes; plot a “silhouette diagram” to see which clusters are problematic.calinski_harabasz_score,davies_bouldin_score— alternative single-number cluster-quality metrics; useful sanity checks.
Hierarchical Risk Parity (HRP) — A Hedge-Fund Application
López de Prado’s Hierarchical Risk Parity (2016) is the most famous direct application of clustering in finance. The idea: classical mean–variance optimisation gives nonsensical weights because it inverts a noisy covariance matrix. HRP avoids the inversion by:
- Computing the distance between every pair of assets as \(d_{ij} = \sqrt{(1 - \rho_{ij}) / 2}\) (a proper metric derived from correlation).
- Building a dendrogram with hierarchical agglomerative clustering.
- Allocating capital top-down through the dendrogram, inverse-variance weighting within each split.
The result is a portfolio that is risk-balanced across the natural cluster structure of the assets, robust to estimation error in the covariance matrix, and dominant on out-of-sample Sharpe versus classical mean–variance in most published studies.
The example below uses a simulated 10-asset return matrix to make every step transparent. In production you would use 50 or 500 real returns; the code structure is identical.
The HRP weights are smooth, positive, and reflect the dendrogram’s cluster structure. The minimum-variance weights, by contrast, often contain wild shorts that disappear as soon as you re-estimate the covariance matrix on a different sample. This stability — purchased by replacing one big matrix inversion with a sequence of cluster-local averages — is the entire reason HRP has taken over from classical mean–variance at risk-parity desks.
The classical mean–variance optimiser uses \(w \propto \Sigma^{-1}\mu\). The inverse of a noisy covariance matrix is extremely unstable — small estimation errors blow up into enormous weight swings, especially when assets are highly correlated. HRP replaces the global inversion with a sequence of inverse-variance allocations within each cluster split, so the operation is always on small, well-conditioned sub-problems. That is what makes HRP robust out-of-sample.
Chapter Wrap-up
You now have five clustering algorithms and the validation discipline to use them honestly:
| Algorithm | Shape assumption | Specify K? | Outlier-aware | Best for |
|---|---|---|---|---|
| K-means | spherical, equal-size | yes | no | clean blobby data, fast iteration |
| Hierarchical | none (any linkage) | choose afterwards | no | small-medium data, want the dendrogram |
| DBSCAN | density-connected | no | yes | irregular shapes, automatic noise filter |
| GMM | elliptical, can be correlated | yes | soft via P | overlapping groups, regime detection |
| Spectral | graph-based, non-convex OK | yes | no | manifold structure, network data |
The single highest-leverage use of clustering in finance is the HRP allocation we just built. The second is regime detection with a 2- or 3-component GMM on returns or vol. The third is document/news clustering for sentiment pipelines. None of them are exotic; all are everyday infrastructure at the funds named at the start of the chapter.
Chapter 6 closes the book by stepping back from “which algorithm” to “what is the pattern recognition discipline” — the unifying methodology that says when to reach for clustering, when for classification, when for sequence modelling, and how to validate everything you find before betting money on it.
← Chapter 4 · Contents · Chapter 6: Pattern Recognition →