Chapter 3: Rethinking Statistics with Bayesian Methods
Chapter Introduction
The classical statistics of Chapters 1 and 2 treats parameters as fixed unknowns and data as random. A Bayesian inverts the relationship: parameters are themselves random variables with distributions reflecting belief, and the data is used to update those beliefs. Once you have the posterior distribution of the parameter, every downstream question — point estimate, interval, prediction, decision — becomes a question about that distribution.
For a long time Bayesian methods were viewed as philosophically interesting but computationally impractical. Markov Chain Monte Carlo (MCMC), invented for the Manhattan Project and rediscovered by statisticians in the 1980s, broke that limitation: any posterior, no matter how complicated, can be approximated by sampling. Today every major systematic fund uses Bayesian machinery somewhere — Bridgewater builds Bayesian regime-switching factor models; AQR fits hierarchical Bayesian shrinkage estimators of expected returns; Renaissance is widely understood (without confirmed publications) to use Bayesian decision theory for position sizing. The reason is simple: financial data is short, fat-tailed, and noisy, and prior information from theory and from cross-section is too valuable to throw away.
This chapter teaches the four Bayesian moves that matter most in applied quantitative work. First, the conjugate update — the closed-form posterior that you can do on the back of an envelope. Second, MCMC — the engine that lets you turn any prior + likelihood combination into a posterior with code. Third, robust regression with Student-t errors — the fix for the fat tails that break classical OLS inference. Fourth, Bayesian change-point and regime detection — the pattern-recognition machinery that quant funds use to know when the world has shifted.
We do all of this in raw NumPy and SciPy. PyMC and Stan are wonderful, but Pyodide can’t load them in the browser, and more importantly: writing your own Metropolis sampler in twenty lines is the single best way to understand what those libraries are doing for you. After this chapter you will be able to read every paper on Bayesian quantitative methods and know exactly which step is being automated.
Table of Contents
- The Bayesian Frame — Priors, Likelihoods, Posteriors
- The Conjugate Update — Closed-Form Bayes
- Markov Chain Monte Carlo (MCMC) by Hand
- Robust Regression for Fat Tails
- Bayesian Linear Regression
- Pattern Recognition: Bayesian Change-Point Detection
- Hierarchical Models — Pooling Across Groups
The Bayesian Frame — Priors, Likelihoods, Posteriors
Bayes’ theorem, in its statistical form:
\[ \underbrace{p(\theta \mid y)}_{\text{posterior}} \;=\; \frac{\overbrace{p(y \mid \theta)}^{\text{likelihood}}\;\overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(y)}_{\text{marginal evidence}}} \;\propto\; p(y \mid \theta)\, p(\theta). \]
The prior \(p(\theta)\) encodes what you believed about the parameter before seeing the data. The likelihood \(p(y \mid \theta)\) is the same likelihood that classical statistics uses. Their product, normalised, is the posterior — what you should believe about \(\theta\) after the data is in.
Why a Bayesian cares about the prior
A flat (improper uniform) prior makes the Bayesian and the classical analyses numerically coincide for most simple models. But in practice you almost always have some prior — the Sharpe ratio of a random equity strategy is not centred on 10, the mortality rate among 30-year-olds is not 50%, the correlation between two unrelated assets is not far from 0. Encoding that prior produces estimates that are more stable in small samples and more honest in large ones. The cost is that you have to choose the prior; the benefit is that you cannot accidentally pretend you have no prior beliefs.
Three Bayesian outputs and what they mean
- Posterior mean / median / mode — point estimates, in order of decreasing robustness to skew.
- Credible interval — a \(1-\alpha\) credible interval is a region with \(1-\alpha\) posterior probability. Unlike a frequentist confidence interval, the Bayesian credible interval does mean “the parameter is in here with 95% probability.”
- Posterior predictive — for a new observation \(\tilde y\), the distribution \(p(\tilde y \mid y) = \int p(\tilde y \mid \theta) p(\theta \mid y) d\theta\). The predictive automatically accounts for parameter uncertainty.
The Conjugate Update — Closed-Form Bayes
When the prior and likelihood are a conjugate pair, the posterior has the same parametric form as the prior. You update by arithmetic, no integration needed. The two most useful pairs:
Beta–Binomial
Prior: \(\theta \sim \text{Beta}(a, b)\). Likelihood: \(y \sim \text{Binomial}(n, \theta)\) with \(k\) successes. Posterior: \[ \theta \mid y \sim \text{Beta}(a + k,\; b + n - k). \] \(a\) and \(b\) behave like “pseudo-counts”: a \(\text{Beta}(2, 2)\) prior is the equivalent of having observed two successes and two failures already.
Normal–Normal (known variance)
Prior: \(\mu \sim N(\mu_0, \tau_0^2)\). Likelihood: \(y_i \sim N(\mu, \sigma^2)\) with \(\sigma\) known and sample mean \(\bar y\) over \(n\) observations. Posterior: \[ \mu \mid y \sim N\!\left(\frac{\mu_0/\tau_0^2 + n\bar y / \sigma^2}{1/\tau_0^2 + n/\sigma^2},\; \left(1/\tau_0^2 + n/\sigma^2\right)^{-1}\right). \] The posterior mean is a precision-weighted average of the prior mean and the sample mean. Precision = 1 / variance.
scipy.stats beta and norm
stats.beta(a, b)— Beta distribution with shape parameters \(a\) and \(b\)..ppf(0.025)and.ppf(0.975)give the 95% credible interval directly.stats.norm(loc, scale)— Normal.locis the mean,scaleis the standard deviation (not variance).- The conjugate-update arithmetic is just Python; no special library is needed.
The last line — the posterior probability that the true rate exceeds a business threshold — is the kind of statement a Bayesian analysis lets you make directly. A frequentist can compute a p-value for the same hypothesis but cannot interpret it as a probability.
The posterior is Beta(2+8, 2+2) = Beta(10, 4), mean = 10/(10+4) ≈ 0.714. The 10 observations have pulled the prior of 0.5 most of the way toward the MLE of 0.8 but not all the way.
Markov Chain Monte Carlo (MCMC) by Hand
When the prior and likelihood are not conjugate, there is no closed form for the posterior. MCMC builds a Markov chain whose stationary distribution is the posterior; running the chain produces samples from the posterior. The most basic MCMC algorithm is Metropolis–Hastings:
- Start at any \(\theta_0\).
- Propose \(\theta^* = \theta_t + \epsilon\) where \(\epsilon\) is a small random step (e.g., \(N(0, \sigma_{\text{prop}}^2)\)).
- Compute the acceptance ratio \(\alpha = \min\!\left(1,\, \frac{p(\theta^* \mid y)}{p(\theta_t \mid y)}\right) = \min\!\left(1,\, \frac{p(y \mid \theta^*) p(\theta^*)}{p(y \mid \theta_t) p(\theta_t)}\right)\).
- With probability \(\alpha\), set \(\theta_{t+1} = \theta^*\); otherwise set \(\theta_{t+1} = \theta_t\).
- After a burn-in period, the chain has converged and the samples are draws from the posterior.
Everything in scientific Bayesian computation — Stan, PyMC, NumPyro — is an elaboration of this idea. The proposal kernel may be smarter (Hamiltonian Monte Carlo, NUTS), the chain may run in parallel, the diagnostics may be fancier, but the core is what follows.
The trace plot is the most important diagnostic — it should look like a “fuzzy caterpillar” mixing around a stable mean. A trace that wanders, gets stuck, or trends has not converged. Acceptance rates between roughly 0.20 and 0.50 are healthy; outside that range, tune the proposal standard deviation.
When to use MCMC
Use MCMC when there is no conjugate prior, when you want full posterior distributions (not just point estimates), when you want to mix hierarchical structure with arbitrary likelihoods, or when you want to do honest model comparison via marginal likelihoods. Don’t use MCMC when a closed-form conjugate posterior exists — it’s a needless approximation.
Robust Regression for Fat Tails
OLS regression assumes the errors are Gaussian. When the errors are heavy-tailed — as is universally true for asset returns — a single extreme observation can dominate the fit, pull \(\hat\beta\) around, and break inference. The Bayesian fix is direct: replace the Gaussian likelihood with a Student-t likelihood.
The Student-t with low degrees of freedom \(\nu\) has heavy tails. As \(\nu \to \infty\) it converges to Gaussian; at \(\nu = 4\) it has noticeably heavier tails; at \(\nu = 1\) it is the Cauchy. Fitting a regression with t-distributed errors automatically down-weights outliers — the likelihood penalises large residuals quadratically only in the centre, then linearly (or weaker) in the tails.
OLS chases the outliers; the t-regression ignores them. This is precisely why every serious asset-pricing study published in the last decade reports both OLS and a heavy-tailed alternative — and most flag a difference between the two as a red warning.
scipy.optimize.minimize
minimize(neg_loglik, x0, method='BFGS') performs general-purpose unconstrained minimisation. The function being minimised takes the parameter vector as its first argument and returns a scalar (the negative log-likelihood). For high-dimensional problems, supply analytical gradients via the jac= argument.
Bayesian Linear Regression
Classical OLS gives a point estimate \(\hat\beta\) and a confidence ellipse around it. Bayesian linear regression gives the joint posterior distribution of \(\beta\) and the noise variance \(\sigma^2\). With a conjugate Normal-Inverse-Gamma prior the posterior is available in closed form, and it has a beautiful interpretation: the posterior precision is the prior precision plus the data precision, and the posterior mean is a precision-weighted average.
Practically, two things change when you go Bayesian on a regression:
- You can place informative priors on individual coefficients — particularly powerful in finance, where economic theory often tells you that \(\beta_{\text{market}}\) should be near 1.
- The posterior predictive distribution propagates parameter uncertainty into predictions. A naive OLS prediction interval ignores the uncertainty in \(\hat\beta\); the Bayesian predictive doesn’t.
The posterior band is narrower than what an OLS prediction interval would produce on the same 30 points — because the slope prior has done useful work. As \(n\) grows the prior fades and Bayes converges to OLS; in small samples the prior is your friend.
Pattern Recognition: Bayesian Change-Point Detection
A change-point is a moment in a time series at which the data-generating distribution shifts: the volatility of a stock jumps after Lehman, a customer-acquisition cost regime changes after a new ad campaign, the failure rate of a manufacturing line creeps up after a part wears out. Detecting change-points is one of the cleanest pattern-recognition problems in statistics, and the Bayesian solution is elegant.
For a sequence \(y_1, \dots, y_T\), model a single change-point \(\tau\): \[ y_t \sim N(\mu_1, \sigma^2) \text{ for } t \le \tau, \qquad y_t \sim N(\mu_2, \sigma^2) \text{ for } t > \tau. \] Place a uniform prior on \(\tau \in \{1, \dots, T-1\}\) and flat priors on the means. The posterior over \(\tau\) is then proportional to the integrated likelihood of the data given the split — a one-dimensional grid sum.
The posterior places a sharp peak almost exactly at the true change-point. Multi-change-point extensions (BCP, BOCD — Bayesian Online Change-point Detection) generalise this idea to streaming data and are the foundation of regime-detection pipelines at quant funds.
The shape of the posterior reveals how certain the change-point location is. A sharply peaked posterior means the regime shift is unambiguous; a broad or multi-modal posterior means several candidate change-points are roughly equally plausible — which is itself important information for any downstream decision (re-hedging, re-training, alerting).
Hierarchical Models — Pooling Across Groups
The single most important Bayesian idea for cross-sectional asset pricing is partial pooling. Suppose you have 500 stocks and you want to estimate each stock’s expected return. Two extremes:
- No pooling — estimate each stock’s mean separately. Noisy: each estimate uses only ~250 observations.
- Complete pooling — assume all 500 stocks have the same mean. Biased: ignores cross-sectional differences.
The Bayesian middle ground is a hierarchical model: each stock’s mean is drawn from a common distribution whose parameters are themselves estimated.
\[ \mu_i \sim N(\mu_0, \tau^2), \qquad y_{it} \mid \mu_i \sim N(\mu_i, \sigma^2). \]
The posterior for each \(\mu_i\) becomes a shrinkage estimator: each stock’s mean is pulled toward the cross-sectional mean by an amount that depends on how noisy that stock’s individual data is. This is the statistical machinery behind the famous Black–Litterman model (1991), Jorion’s “Bayes–Stein” estimator (1986), and every modern hierarchical factor model used at AQR, Bridgewater, and BlackRock’s QIS group.
The shrunk estimates are closer to the truth, on average, than the unbiased individual means. This is one of the most counter-intuitive results in statistics: a biased estimator can have lower mean-squared error than the unbiased MLE. It is the same Stein-shrinkage principle that underpins the Black–Litterman model.
Each group’s individual mean is noisy in small samples. Borrowing strength from the global mean trades a tiny amount of bias for a large reduction in variance, and prediction error is governed by the bias–variance trade-off. When within-group sample sizes are small relative to between-group spread, shrinkage almost always wins on RMSE.
Chapter Wrap-up
Bayesian inference is the most general statistical framework available, and four of its moves are non-negotiable for modern quantitative work:
- Conjugate updates — small models, fast iteration, no MCMC needed.
- MCMC — the engine that lets you fit any prior + likelihood combination; understand Metropolis–Hastings even if you call PyMC.
- Heavy-tailed (t-likelihood) regression — the fix for the fat tails that destroy classical inference on financial data.
- Hierarchical shrinkage — the cross-sectional partner of regularised regression; the statistical core of Black–Litterman and every modern multi-asset model.
Plus the pattern-recognition cousins: Bayesian change-point detection for regime shifts, and posterior predictive checks for honest validation.
In Chapter 4 we add the one thing this chapter assumed away: time. Autocorrelation, stationarity, volatility clustering, cointegration, and Markov-switching regimes — the dynamic backbone of every systematic trading strategy.
← Chapter 2 · Contents · Chapter 4: Time Series Models →