Chapter 2: Statistical Predictive Models

Chapter Introduction

Almost every published result in empirical finance — every Fama–French factor, every accrual anomaly, every Sharpe-ratio-improving signal — sits inside a regression. Almost every machine-learning model used at Two Sigma, DE Shaw, and AQR is, under the hood, a regularised regression or a tree ensemble built on top of one. Regression is to quantitative research what arithmetic is to bookkeeping: not glamorous, not new, completely unavoidable.

What is new — and what separates a present-day quantitative researcher from one trained twenty years ago — is the discipline around regression. Renaissance Technologies famously screens hundreds of thousands of candidate predictors and systematically selects the few that survive cross-validation, regularisation, and out-of-sample checks. AQR’s recent papers spend more pages on the selection procedure than on the regressions themselves. DE Shaw is widely known for pioneering the use of LASSO inside trading pipelines back in the 2000s. The methods in this chapter — multiple regression, residual diagnostics, nonlinear transformations, best-subset and stepwise selection, \(k\)-fold cross-validation, Ridge / LASSO / Elastic Net, and Principal Component Analysis — are the building blocks of every modern systematic strategy.

We treat regression in this chapter as a supervised pattern-recognition problem. Given a matrix of candidate features \(X\) and a target \(y\), the question is not “what is the relationship” but “which features, in which functional form, generalise — produce a model whose error on new data is as small as possible?” That reframing turns every section in this chapter from a static computation into an act of model selection. By the end of the chapter you will have a defensible procedure for taking a wide table of candidate predictors and producing a parsimonious, regularised, cross-validated model — the same procedure used by professional research desks.

The examples in this chapter use whatever data illustrates the method most clearly. Wine quality, used-car prices, energy consumption, simulated factor returns — all of them are vehicles for the statistical machinery. None of the techniques are finance-specific; all of them are used in finance because they work on any tabular data with one numerical target.

The Supervised Pattern-Recognition Frame
Multiple Linear Regression — Geometry and Estimation
Reading the Regression Output and Diagnostics
Nonlinear Transformations and Interactions
Variable Selection — Best Subset, Forward, Backward, Stepwise
Cross-Validation — The Honest Test
Regularised Regression — Ridge, LASSO, Elastic Net
PCA and the Idea of a Factor Model
Pattern Recognition Beyond Linear Models — Trees and Ensembles

The Supervised Pattern-Recognition Frame

A supervised problem is one in which every example in the training set comes with a label: the answer the model is supposed to produce. Predicting tomorrow’s return given today’s features, predicting whether a customer churns given their transaction history, predicting house prices from neighbourhood characteristics — these are all the same shape of problem. You have a matrix \(X \in \mathbb{R}^{n \times p}\) of \(n\) examples by \(p\) features and a vector \(y \in \mathbb{R}^{n}\) of labels. You want a function \(\hat f(\cdot)\) such that \(\hat f(x_{\text{new}})\) is close to \(y_{\text{new}}\) on examples you have not yet seen.

The phrase “have not yet seen” is the entire point of this chapter. A model that fits the training data perfectly is worthless if its performance on future data is poor. The history of quantitative finance is a graveyard of backtests that explained 80% of the in-sample variance and 0% of the out-of-sample variance. Every method in this chapter exists to push that out-of-sample number up.

The bias–variance trade-off in one paragraph

A model that is too simple (low capacity) under-fits — it has high bias and low variance, missing patterns that are really there. A model that is too complex (high capacity) over-fits — it has low bias and high variance, memorising the noise. The optimal model is somewhere in the middle, and you find it by measuring out-of-sample error and tuning a complexity knob (number of variables, polynomial degree, regularisation strength). Cross-validation, regularisation, and PCA are the three knobs in this chapter.

Module reference — scikit-learn conventions

Every scikit-learn estimator follows the same API: - .fit(X, y) — learn parameters from data. - .predict(X_new) — return predictions. - .score(X, y) — built-in metric (R² for regressors, accuracy for classifiers). - Hyperparameters go into the constructor (e.g., Lasso(alpha=0.1)). - For pipelines / cross-validation, see sklearn.pipeline.Pipeline and sklearn.model_selection.cross_val_score.

Multiple Linear Regression — Geometry and Estimation

Multiple linear regression posits

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \varepsilon_i, \qquad \varepsilon_i \sim \text{i.i.d. with mean }0. \]

Stacked into matrix form: \(y = X\beta + \varepsilon\), where \(X\) has a leading column of ones (the intercept). Ordinary least squares (OLS) picks \(\hat\beta\) to minimise \(\lVert y - X\beta \rVert^2\). The closed-form solution is

\[ \hat\beta_{\text{OLS}} = (X^\top X)^{-1} X^\top y, \]

provided \(X^\top X\) is invertible. Geometrically, \(\hat y = X\hat\beta\) is the projection of \(y\) onto the column space of \(X\); the residual vector \(y - \hat y\) is orthogonal to every column of \(X\). That orthogonality is what makes the predictions and residuals so well behaved.

Module reference — statsmodels.api.OLS

sm.OLS(y, X).fit() is the working horse. Conventionally X includes an explicit intercept column added via sm.add_constant(X). The fitted object has .params, .bse (standard errors), .tvalues, .pvalues, .rsquared, .rsquared_adj, .summary(), .predict(), .resid, and .get_prediction() (for CIs and PIs on new data).

Read the output. const is the estimated intercept (close to 0). x1–x4 are the estimated slopes — observe that the third one (\(\beta_3 = 0\) in truth) is small and its \(p\)-value is not significant; the rest are tightly estimated and significant.

What the coefficients mean

Each \(\hat\beta_j\) is the estimated change in \(y\) for a one-unit change in \(x_j\), holding the other \(x\) variables fixed. The clause “holding the others fixed” is what makes multiple regression different from a battery of one-variable regressions — and what makes it dangerous when the predictors are correlated (multi-collinearity).

Reading the Regression Output and Diagnostics

A fitted model is only trustworthy if it satisfies the classical assumptions, and an undisciplined analyst skips this step at her peril. The acronym for the assumptions is LINE: Linearity, Independence of errors, Normality of errors, Equal variance (homoskedasticity).

After fitting, four diagnostic plots reveal whether the assumptions hold:

Residuals vs. fitted should look like a flat noise cloud. A funnel shape signals heteroskedasticity; a curve signals missing nonlinearity.
QQ plot of residuals should fall on a 45° line. Heavy tails curve up at the top right and down at the bottom left.
Scale–location (sqrt of standardised residuals vs. fitted) — another heteroskedasticity check.
Leverage vs. residual — points with both high leverage and large residual exert disproportionate influence on the fit. These are the regression analogue of the outliers we hunted in Chapter 1.

Two fixes for heteroskedasticity are common: (a) transform \(y\) (log, square-root) so its variance stabilises; (b) use robust (“HC”) standard errors when reporting inference. Statsmodels does the latter with fit(cov_type='HC3').

Leverage, influence, and Cook’s distance

A single observation can dominate a regression. Cook’s distance combines residual size and leverage into a single number; values above \(4/n\) or above 1 are conventionally inspected. This is the regression analogue of anomaly detection — it tells you which training points are unusually influential on the model itself.

Nonlinear Transformations and Interactions

Linear regression is linear in the coefficients, not in the variables. You are free to introduce polynomial terms, log transforms, splines, and interactions and still use OLS.

When to use what

log(x) — when the effect of \(x\) on \(y\) is multiplicative (concentrations, prices, sizes). A log-linear model says “1% change in \(x\) produces \(\beta\)% change in \(y\).”
Polynomial terms (\(x^2\), \(x^3\)) — when the response curves smoothly. Use sparingly; a cubic on 100 points is already over-parameterised.
Splines (patsy-style bs(x, df=4)) — local polynomial pieces stitched together. The modern replacement for high-degree polynomials.
Interactions (\(x_1 \cdot x_2\)) — when the effect of \(x_1\) depends on the level of \(x_2\). Critical in factor models where a value signal works only in low-volatility regimes, etc.

Module reference — patsy formulas

patsy.dmatrices('y ~ x1 + x2 + x1:x2 + np.log(x3) + bs(x4, df=4)', data=df) builds the design matrix from a string formula. The : means interaction; * means main effects plus interaction; bs() calls a B-spline basis. Inside statsmodels you can pass the formula directly: smf.ols('y ~ x + I(x**2)', data=df).fit().

Add a quadratic term in the offending predictor: change y ~ x to y ~ x + I(x**2). The U-shape is the signature of an unmodelled second-order curvature.

Variable Selection — Best Subset, Forward, Backward, Stepwise

When you have \(p\) candidate predictors and reason to believe only some of them matter, you face a model-selection problem. The naïve answer — fit the full model and trust the \(p\)-values — fails badly when predictors are correlated or when \(p\) is large. The disciplined answer is to define a selection procedure and a selection criterion, run the procedure end to end, and report the model it lands on.

Selection criteria

R² — never use alone; always increases when you add a variable.
Adjusted R² — penalises extra variables; legitimate but weak.
AIC = \(-2\ln L + 2 k\) — penalises by twice the parameter count; the canonical “small enough penalty” criterion.
BIC = \(-2\ln L + k\ln n\) — penalises more aggressively as sample size grows; the “select a sparser model” criterion.

When AIC and BIC point at the same model, you have confidence. When they disagree, BIC selects fewer variables.

Selection procedures

Best subset — fit every one of the \(2^p\) subsets, keep the best by chosen criterion. Exhaustive but exponential.
Forward — start with no variables; at each step add the one that most improves the criterion. Greedy, fast, often near-optimal.
Backward — start with all variables; at each step remove the least useful.
Stepwise — alternates forward and backward at each step.

For \(p \le 10\), run best subset. For larger \(p\), forward and stepwise are the workhorses. Modern practice prefers regularisation (next section) over stepwise — but stepwise is still ubiquitous in academic finance.

The procedure should pick (x1, x3, x5) — the truly nonzero coefficients — without us telling it which ones to choose. That is the entire game.

Cross-Validation — The Honest Test

Adjusted R², AIC, and BIC are in-sample criteria. Cross-validation is the out-of-sample one. The standard recipe:

Split the data into \(K\) roughly equal folds (typically \(K = 5\) or \(10\)).
For each fold \(k\), train on the other \(K-1\) folds and predict on fold \(k\).
Compute the prediction error on fold \(k\) (RMSE or \(R^2_{\text{OS}}\)).
Average the \(K\) errors.

The averaged out-of-sample RMSE is what you would honestly report to a portfolio manager who is about to deploy your model. If your in-sample \(R^2\) is 0.7 but your cross-validated \(R^2\) is 0.05, the model is over-fit and not tradeable.

Module reference — sklearn.model_selection

train_test_split(X, y, test_size=0.2, random_state=0) — single holdout split.
KFold(n_splits=5, shuffle=True, random_state=0) — k-fold iterator.
cross_val_score(model, X, y, cv=5, scoring='neg_root_mean_squared_error') — one-liner.
TimeSeriesSplit(n_splits=5) — preserves time ordering. Always use this on time series; never shuffle.

A small gap between in-sample and CV R² means the model generalises. A large gap means over-fit.

KFold shuffles rows and lets the model train on future data and predict the past — a form of look-ahead bias that grossly inflates apparent performance. TimeSeriesSplit strictly trains on past data and tests on future data, preserving the causal direction of the world.

Regularised Regression — Ridge, LASSO, Elastic Net

When predictors are many and possibly correlated, OLS produces wide and noisy coefficients. The cure is regularisation: penalise large coefficients during fitting. The two canonical penalties:

\[ \hat\beta_{\text{Ridge}} = \arg\min_\beta\;\lVert y - X\beta\rVert^2 + \lambda\sum_j \beta_j^2, \] \[ \hat\beta_{\text{LASSO}} = \arg\min_\beta\;\lVert y - X\beta\rVert^2 + \lambda\sum_j |\beta_j|. \]

Ridge (\(\ell_2\)) — shrinks coefficients smoothly toward zero. Never exactly zero. Good when you believe many predictors contribute small amounts.
LASSO (\(\ell_1\)) — shrinks and sets some coefficients exactly to zero. Performs simultaneous variable selection and shrinkage. Good when you believe most predictors are irrelevant. This is the property that made LASSO the standard tool at modern quant funds — out of thousands of candidate signals, it automatically zeroes out the dross.
Elastic Net — a convex combination of the two penalties. Use when predictors come in correlated groups.

The penalty strength \(\lambda\) (called alpha in scikit-learn) is the complexity knob. Pick it by cross-validation.

Notice the picture: OLS coefficients are noisy across all 30 features; Ridge shrinks everyone modestly; LASSO turns off most features and keeps the few real ones. The LASSO sparsity is automatic feature selection. This is the recipe at the centre of every modern quant fund’s signal pipeline.

Module reference — sklearn.linear_model

Ridge(alpha=...) / RidgeCV(alphas=[...]) — closed-form, very fast.
Lasso(alpha=...) / LassoCV(alphas=[...], cv=5) — coordinate descent.
ElasticNet(alpha=..., l1_ratio=...) / ElasticNetCV.
Always standardise predictors first (StandardScaler) — penalties are scale-sensitive.

With \(p\) close to \(n\), OLS coefficients have huge variance and many will be statistically indistinguishable from noise. LASSO simultaneously regularises (reducing variance) and sets uninformative coefficients to exactly zero — handing you back a parsimonious, interpretable model with stable out-of-sample performance. This is the everyday workflow for systematic signal selection.

PCA and the Idea of a Factor Model

Principal Component Analysis (PCA) finds the orthogonal directions in feature space along which the data varies most. The first principal component is the linear combination of inputs with maximum variance; the second is orthogonal to the first and has the next-most variance; and so on. PCA is at the heart of every factor model in finance.

In equity returns, the first principal component is virtually always the market: a roughly equally-weighted average of stock returns. The second and third components are interpreted (after rotation) as size and value, the famous Fama–French factors. The same machinery, applied to the term structure of interest rates, produces level, slope, and curvature — the three factors that explain >99% of yield-curve variation.

Mechanics

Standardise the columns of \(X\).
Compute the covariance matrix \(\Sigma = \frac{1}{n-1} X^\top X\).
Eigendecompose \(\Sigma = V \Lambda V^\top\).
The eigenvectors (columns of \(V\)) are the principal components; the eigenvalues are their variances.

Module reference — sklearn.decomposition.PCA

PCA(n_components=k).fit_transform(X) returns the top-\(k\) projected scores. .explained_variance_ratio_ is the fraction of variance captured by each component. .components_ is the matrix of eigenvectors (loadings). Always standardise first if features are on different scales.

In the scree plot the first bar dwarfs the others — the data lives essentially on a 1-D manifold (the common factor). Once you see that picture, the entire APT / Fama–French / Risk-Premia literature is just systematic study of which factors deserve to be in the model.

Pattern Recognition Beyond Linear Models — Trees and Ensembles

LASSO and PCA are linear. Real markets, real customer behaviour, real biological data are full of nonlinear interactions. The modern toolkit complements linear regression with tree ensembles — random forests and gradient-boosted trees — which automatically discover nonlinear patterns and interactions without you having to specify them.

A decision tree splits the feature space into axis-aligned rectangles, predicting a constant within each rectangle. A single tree over-fits badly; the magic is in the ensemble:

Random Forest — fit many trees on bootstrapped samples and a random subset of features at each split, then average. Reduces variance dramatically.
Gradient Boosting — fit each new tree to the residuals of the ensemble so far. Reduces bias.

Tree ensembles are the workhorses at Two Sigma, AQR’s machine-learning teams, and basically every data-science consultancy on the planet. They handle missing values, mixed-type predictors, and nonlinearities without ceremony.

Module reference — sklearn.ensemble

RandomForestRegressor(n_estimators=200, max_features='sqrt').
GradientBoostingRegressor(n_estimators=300, learning_rate=0.05, max_depth=3).
HistGradientBoostingRegressor — faster, scales to large datasets.
.feature_importances_ after .fit() — the per-feature relative contribution to splits; the modern analogue of LASSO sparsity.

On data with a multiplicative interaction and a sinusoid, linear regression captures essentially nothing while the ensembles pick up most of the signal. The price you pay is interpretability — a forest of 200 trees is much harder to explain than a 5-coefficient regression. That trade-off is the central theme of modern quantitative model selection.

LASSO produces a sparse linear model whose remaining nonzero coefficients are directly interpretable; Random Forest’s feature_importances_ ranks features by their contribution to nonlinear splits across hundreds of trees but does not give you a closed-form equation.

Chapter Wrap-up

Multiple regression is the basic tool; the entire chapter has been about how to use it responsibly. The pipeline a professional analyst runs on a fresh predictive-modelling problem is now in your hands:

Frame the problem as supervised. Pick the target. Pick the error metric.
Standardise the predictors. Plot the marginal distributions; transform if needed (log, polynomial, spline).
Diagnose the simple linear fit. Residuals, QQ plot, leverage.
Select variables: best subset for small \(p\), regularisation (LASSO / Elastic Net) for large \(p\).
Cross-validate — out-of-sample, time-respecting if temporal.
Reduce dimension when predictors are many and correlated (PCA).
Reach for nonlinear ensembles when residual diagnostics or domain knowledge demand it.
Report in-sample vs. CV-sample performance side by side; never hide the gap.

In Chapter 3 we step back and ask: what would change if we admitted uncertainty about the parameters themselves and combined that with prior beliefs? The answer is Bayesian inference — and it reframes everything in this chapter.

← Chapter 1 · Contents · Chapter 3: Bayesian Methods →