Chapter 1: Distributions, Tails, and Anomalies
Chapter Introduction
Every statistical method on Wall Street and at the world’s most successful quantitative funds — Renaissance Technologies, Two Sigma, DE Shaw, Citadel, Bridgewater — begins where this chapter begins: with the careful, almost paranoid, examination of a single variable and then of a pair of variables. Before any predictive model is fit, before any neural network is trained, before any backtest is run, somebody at one of those firms has stared at a distribution, asked whether its tails are heavier than a Gaussian would predict, decided whether a difference between two samples is a real signal or a coincidence, and quantified whether two series move together in a way that is not an accident of the data being shown.
Jim Simons, the founder of Renaissance, has said in public interviews that his fund’s edge does not come from any single brilliant idea; it comes from systematically looking for patterns that other people miss, and then testing those patterns to a level of statistical rigor that other people are not willing to apply. The “pattern hunt” is built out of four primitives, and they form the spine of this chapter: distributions, confidence intervals, hypothesis tests, and measures of association. Add to those a fifth — extreme value theory for the patterns that live in the tails — and a sixth — multiple-testing correction for the patterns that look real only because you searched too hard — and you have the entire statistical foundation that distinguishes professional quantitative research from amateur backtesting.
This chapter teaches all of it. The examples use whatever data illustrates the concept most clearly — sometimes financial returns, sometimes weather, sometimes synthetic data generated on the fly — because the statistical machinery is universal. A heavy tail in monthly rainfall obeys the same generalized extreme value distribution as a heavy tail in equity drawdowns. The Pearson correlation that lies to you about S&P sector co-movement lies in exactly the same way about ice-cream sales and shark attacks. Learning to reshape your statistical thinking — from “what is the mean” to “what is the shape, the tail, the relationship, and the noise floor” — is the single largest jump from being a person who knows statistics to being a person who can use them to find tradeable patterns.
A note on the word “reshaping”
By “reshaping statistics” we do not mean reshaping a DataFrame. We mean the disciplined act of seeing the world distributionally rather than averagely. Most introductory statistics courses train you to compute a mean and a standard deviation, run a t-test, and report a number. That training is fine for a tame world. It is dangerous for a world where outliers do most of the work, where one observation in a thousand drives the year’s P&L, and where the next observation may not look like any of the previous ones. The chapter asks you to internalize a distributional mindset: every number is a draw from a distribution, every comparison is a comparison of distributions, every “relationship” is a property of the joint distribution, and every honest statement about the world has uncertainty attached.
What you will be able to do by the end
You will be able to fit and visualize empirical and theoretical distributions in Python; build a confidence interval by simulation when no closed-form exists; run and correctly interpret one-sample, two-sample, and paired hypothesis tests; quantify linear, monotonic, and arbitrary nonlinear association between two variables; identify and model the tail of a distribution with GEV and Generalized Pareto fits; detect statistical outliers and anomalies using both classical (Mahalanobis, robust z-score) and modern (Isolation Forest) machinery; and apply Bonferroni and Benjamini–Hochberg corrections when you have searched many hypotheses at once. These are not toy skills — they are the daily, executable techniques of professional quantitative research desks.
Table of Contents
- Distributions — Empirical, Theoretical, and Kernel Density
- Confidence Intervals by Simulation (Bootstrap)
- Hypothesis Testing in the Real World
- Association — Linear, Monotonic, and Nonlinear
- Extreme Value Theory — Patterns in the Tails
- Anomaly Detection as Pattern Recognition
- Multiple Testing Correction
Distributions — Empirical, Theoretical, and Kernel Density
A distribution is the most honest summary of a variable you can produce. A mean throws away information; a standard deviation throws away even more; a histogram throws away very little; a fitted density throws away essentially nothing. The first habit of a serious analyst is to look at the shape of every variable before doing anything else with it.
There are three complementary ways to look at a univariate distribution: the empirical one (histogram or empirical CDF), the theoretical one (a parametric family like Normal, Student-t, or Generalized Extreme Value), and the smoothed empirical one (a kernel density estimate, or KDE). Hedge funds use all three: histograms for a first look, theoretical fits to extrapolate into tails they haven’t seen, and KDEs when they don’t yet want to commit to a parametric family.
Why the shape matters before any model
Suppose you are about to fit a linear regression. Linear regression treats the noise as Gaussian. If your dependent variable has a tail that is materially heavier than Gaussian — which is essentially always true for asset returns, insurance claims, server latencies, and many medical measurements — then your standard errors are wrong, your p-values are wrong, your confidence intervals are wrong, and any decision that depends on those numbers is therefore wrong. The only way to know is to look at the shape first.
scipy.stats distributions
SciPy exposes ~100 probability distributions under scipy.stats. Each one is an object with the same four core methods: .pdf(x) (density), .cdf(x) (cumulative probability), .ppf(q) (quantile / inverse CDF), and .rvs(size=n) (random samples). You parameterize via a loc (shift) and scale (stretch) plus distribution-specific shape parameters. Example: stats.norm(loc=0, scale=1).pdf(0) is the standard normal density at zero; stats.t(df=4).cdf(2) is the probability a t(4) variate is at most 2.
The histogram is what the data says; the red curve is what the theory says; the dashed curve is what a smoother data-driven density says. When all three agree, you have a tame variable. When they disagree — when the empirical histogram is consistently above the theoretical pdf in the tails — you have a fat-tailed variable, and the conventional Gaussian machinery will lead you astray.
The empirical CDF — the most honest one-line summary
The empirical CDF \(\hat F_n(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbf{1}\{X_i \le x\}\) assigns to every value \(x\) the fraction of the data that is at most \(x\). It is non-parametric (assumes nothing), monotone, and converges uniformly to the true CDF (Glivenko–Cantelli). When you cannot decide which theoretical family to fit, plot the empirical CDF — it never lies.
Notice how the t(3) ECDF rises later from zero (its left tail extends further) and approaches one later (right tail extends further). The Gaussian saturates by \(\pm 3\); the t(3) is still climbing at \(\pm 6\). That difference is the tail risk.
Fitting a theoretical distribution by maximum likelihood
When you commit to a family — say, Student-t — SciPy will pick the parameters that maximize the likelihood of your data under that family.
.fit() on a SciPy distribution
stats.t.fit(data) returns the maximum-likelihood estimates of (df, loc, scale). stats.norm.fit(data) returns (loc, scale). The general pattern is that each distribution object has a .fit() classmethod that takes a 1-D array and returns the parameter tuple. You can hold a parameter fixed with the fscale= / floc= / fdf= keywords.
The log-likelihood comparison above is the rigorous version of “looking at the QQ plot and saying the tails don’t fit.” In practice you would also report AIC (Akaike) or BIC (Bayesian information criterion), which penalize the t for its extra parameter.
The data has more mass in the extreme negatives than the Gaussian predicts — a heavier left tail. For asset returns this corresponds to crash risk being larger than Gaussian-based VaR would suggest.
Confidence Intervals by Simulation (Bootstrap)
The textbook confidence interval for a mean — \(\bar X \pm 1.96\, s / \sqrt n\) — only works when \(\bar X\) is approximately Gaussian. For most quantities a real analyst cares about (the median, a Sharpe ratio, a quantile, a regression coefficient under heteroskedasticity, a maximum drawdown) there is no clean closed form. The bootstrap, introduced by Bradley Efron in 1979, gives you a CI for any statistic by resampling the data itself.
The bootstrap recipe
- From your sample of size \(n\), draw \(n\) observations with replacement. Call the resample \(X^{*}\).
- Compute your statistic \(\hat\theta^{*} = T(X^{*})\).
- Repeat \(B\) times (usually \(B = 2{,}000\) or \(10{,}000\)).
- The \(2.5^{\text{th}}\) and \(97.5^{\text{th}}\) percentiles of the \(B\) replicates form a 95% confidence interval (percentile method).
The bootstrap works because the empirical distribution of the data is a good stand-in for the true distribution; resampling from it mimics drawing fresh samples from the population.
numpy resampling
rng.choice(arr, size=n, replace=True) draws n elements from arr with replacement. Pair it with np.random.default_rng(seed) for reproducibility. np.percentile(samples, [2.5, 97.5]) gives the percentile CI.
A point estimate without a CI is professionally embarrassing. The bootstrap is the universal solvent — when in doubt, resample.
When the bootstrap fails
The bootstrap is not magic. It fails when the statistic depends on the order of the data (use a block bootstrap for time series — see Chapter 6), when the statistic is a sample maximum or minimum (the empirical distribution doesn’t extrapolate beyond what was seen), and when the sample size is so small that the empirical distribution is a poor approximation to the truth (a few dozen points).
Hypothesis Testing in the Real World
A hypothesis test is a structured way to decide whether what you see in the data is real or could have been produced by chance. It is the most-used and most-abused tool in applied statistics. Used well, it is the gate-keeper that stops you from trading a backtested strategy that is fake. Used badly, it is the engine of the replication crisis in science and of every blown-up alpha at every quant fund.
The logic — null, alternative, p-value
You set up a null hypothesis \(H_0\) that represents “nothing interesting is happening” (mean is zero, two groups have the same mean, correlation is zero). The alternative \(H_1\) is what you would like to claim (mean is positive, group A is different from group B, correlation is non-zero). You compute a test statistic \(T\) from the data whose distribution under \(H_0\) is known, and then a p-value — the probability under \(H_0\) of getting a test statistic at least as extreme as the one you observed. A small p-value (conventionally \(p < 0.05\)) is evidence against \(H_0\).
Three things to know that every textbook glosses over:
- The p-value is not the probability that \(H_0\) is true. It is \(P(\text{data this extreme} \mid H_0)\), not \(P(H_0 \mid \text{data})\).
- \(p < 0.05\) means roughly “if there were truly no effect, you’d see something this extreme one time in twenty.” If you run 20 different tests on noise, you should expect one to be significant.
- A non-significant result does not prove \(H_0\). Absence of evidence is not evidence of absence.
Common tests and when to use them
scipy.stats tests
Each test returns a (statistic, p_value) tuple (or a named result object). - stats.ttest_1samp(x, popmean) — is the mean of x equal to popmean? - stats.ttest_ind(x, y, equal_var=False) — Welch’s two-sample t-test (does NOT assume equal variances). - stats.ttest_rel(x, y) — paired t-test on matched samples. - stats.mannwhitneyu(x, y) — non-parametric alternative when the data is not Gaussian. - stats.wilcoxon(x, y) — non-parametric paired test. - stats.kstest(x, 'norm') — Kolmogorov–Smirnov goodness-of-fit. - stats.shapiro(x) — Shapiro–Wilk test of normality.
The parametric (t) and non-parametric (Mann–Whitney) tests usually agree when the data is roughly Gaussian. When they disagree, trust the non-parametric one if your sample is small or visibly non-Gaussian.
Hedge-fund pitfall: data snooping
If you test 200 candidate “signals” at the 5% level, you should expect about 10 false positives by pure chance. Renaissance’s Robert Mercer is widely quoted as saying that the difference between Renaissance and other funds is not that they find better signals, but that they don’t get fooled by the false ones. We come back to multiple-testing correction in §3.7.
Probably not. Under the null of no effect across all 50, you’d expect \(50 \times 0.05 = 2.5\) false positives by chance alone. Four “significant” results is well within what noise produces. You must apply a multiple-testing correction — see §3.7.
Association — Linear, Monotonic, and Nonlinear
Two variables can be related in many shapes, and a single correlation coefficient summarises only the linear part of the relationship. Top quantitative funds maintain a hierarchy of association measures and apply them in escalating sequence: Pearson for linear, Spearman for monotonic, distance correlation or mutual information for arbitrary nonlinear dependence.
Pearson — the linear story
\(\rho_{XY} = \dfrac{\operatorname{Cov}(X, Y)}{\sigma_X \sigma_Y} \in [-1, 1].\)
Pearson’s \(\rho\) is the cosine of the angle between the centred \(X\) and \(Y\) vectors. It is exactly the right measure when the relationship is linear and the noise is roughly Gaussian. It is the wrong measure for an upside-down parabola (Pearson can be zero even when \(Y = -X^2\)), for monotone non-linear curves, and for relationships dominated by outliers.
Spearman — the rank story
Replace \(X\) and \(Y\) by their ranks, then compute Pearson on the ranks. The result, Spearman’s \(\rho_S\), is the linear correlation of the ranks. It is robust to outliers and detects any monotone relationship — straight, curved, or step-like.
Distance correlation and mutual information — the nonlinear story
Both measures equal zero if and only if \(X\) and \(Y\) are statistically independent. They detect U-shapes, V-shapes, sinusoids, and anything else a Pearson would miss. Renaissance-style pattern hunters compute all three on every candidate pair and flag the cases where Pearson is small but distance correlation is large — those are the hidden nonlinear patterns.
np.corrcoef(x, y)— Pearson correlation matrix.stats.spearmanr(x, y)— Spearman.stats.kendalltau(x, y)— Kendall’s tau (another rank measure, slower but more interpretable for small samples).sklearn.feature_selection.mutual_info_regression(X, y)— estimated mutual information for continuous targets.- For distance correlation use the
dcorpackage (not in Pyodide by default; below we implement it in 6 lines of NumPy).
Look at the Quadratic panel: Pearson is essentially zero, Spearman is zero, but distance correlation is large. A research desk that only checks Pearson would conclude these two variables are unrelated and would throw away the signal. The lesson is permanent: always compute at least one nonlinear association measure before concluding “no relationship.”
Investigate the joint scatter plot. Distance correlation is detecting a nonlinear pattern that Pearson cannot see. Consider a polynomial or spline transformation of one of the variables before any linear model.
Extreme Value Theory — Patterns in the Tails
For most decisions that matter — bank capital requirements, dam heights, server-capacity planning, options-tail hedging — the right question is not “what is the mean” but “what does the worst-case look like.” Extreme Value Theory (EVT) is the branch of statistics built specifically for the tails of distributions, and it is used at every serious risk desk in the world.
EVT comes in two complementary flavours:
- Block maxima → Generalized Extreme Value (GEV) distribution. Partition the data into blocks (e.g., years, months), take the maximum from each block, and fit a GEV to those maxima. The Fisher–Tippett–Gnedenko theorem says that, regardless of the parent distribution (under mild conditions), the limiting distribution of normalised block maxima belongs to the GEV family.
- Peaks Over Threshold (POT) → Generalized Pareto Distribution (GPD). Pick a high threshold \(u\), keep only the exceedances \(X - u\) given \(X > u\), and fit a GPD. The Pickands–Balkema–de Haan theorem says these exceedances asymptotically follow a GPD.
The shape parameter \(\xi\) (often written \(k\) in software) determines the tail behaviour:
- \(\xi > 0\): heavy tail (Fréchet) — asset returns, insurance claims, internet traffic, earthquake magnitudes.
- \(\xi = 0\): light tail (Gumbel) — exponential decay, e.g., temperature extremes.
- \(\xi < 0\): bounded tail (Weibull) — phenomena with a hard cap, e.g., human lifespan.
scipy.stats extreme value distributions
stats.genextreme(c)— GEV with shapec = -ξ(note SciPy’s sign convention is opposite the EVT literature).stats.genpareto(c)— Generalized Pareto with shapec = ξ.- Both support
.fit(),.pdf(),.cdf(),.ppf(),.rvs()like every other SciPy distribution.
The 100-year flood — using the fit to extrapolate
The point of EVT is not to describe what you saw; it is to extrapolate to what you have not yet seen. After fitting GEV, you can ask: what is the 99.9th percentile of the maximum? In risk-management language, a return level at frequency \(1/p\) is the value \(z_p\) such that \(P(\text{max} > z_p) = p\).
A risk manager would read those numbers as: “We expect to exceed this value once every X periods.” If you have annual blocks, the 1-in-1000 number is the 1000-year level — exactly the language used in dam engineering and bank stress tests.
ξ > 0 means a Fréchet (heavy) tail: the probability of extreme observations decays polynomially, not exponentially. Concretely, the 1-in-1000 event is much larger than a Gaussian-based estimate would suggest, and any risk metric (VaR, expected shortfall) computed under a Gaussian assumption will substantially under-state tail risk.
Anomaly Detection as Pattern Recognition
Once you can describe a distribution, you can recognise points that do not belong to it. Every quantitative fund runs an anomaly-detection layer over its incoming data: bad ticks, broken sensors, accidental decimal shifts, mislabelled corporate actions, and — most interestingly — real anomalies that mark a regime change. The same machinery is used at every credit-card fraud desk and every server-monitoring team.
Three families of detectors
- Distributional / Z-score — flag points whose standardised distance from the mean exceeds a threshold. Robust variants use the median and the median absolute deviation (MAD) instead of mean and SD, because the mean and SD themselves get corrupted by the outliers you are trying to detect.
- Mahalanobis distance — in higher dimensions, points are unusual when they are far from the centre of the joint distribution in covariance-weighted distance. Mahalanobis is the multivariate generalisation of the z-score.
- Model-free / Isolation Forest — an ensemble of random binary trees that isolates a point by random splits. Anomalies require very few splits to be isolated; normal points require many. Used heavily in modern fraud detection and at Two Sigma–style ML teams.
Mahalanobis in three lines
For a multivariate Gaussian-ish cloud, Mahalanobis distance gives an exact statistical threshold via the chi-squared distribution: if \(X \sim N(\mu, \Sigma)\) in \(d\) dimensions, then \((X-\mu)^\top \Sigma^{-1} (X-\mu) \sim \chi^2_d\). Anything beyond the 99th percentile of \(\chi^2_d\) is unusual at the 1% level.
Multiple Testing Correction
If you test one hypothesis at \(\alpha = 0.05\), you accept a 5% chance of being wrong when \(H_0\) is true. If you test a thousand hypotheses at \(\alpha = 0.05\), you accept on average 50 false positives when all the nulls are true. This is the data-snooping problem, and it is the silent killer of backtested trading strategies and published medical findings alike.
Two corrections worth memorising
Bonferroni — control of the family-wise error rate. With \(m\) tests, declare significance only when \(p_i < \alpha / m\). Strict and conservative: it controls the probability of even one false positive across the whole family. Used when a single false positive is unacceptable (e.g., flagging a critical bug, naming a single drug to take to a Phase 3 trial).
Benjamini–Hochberg (BH) — control of the false discovery rate. Order the p-values \(p_{(1)} \le p_{(2)} \le \dots \le p_{(m)}\). Find the largest \(k\) such that \(p_{(k)} \le (k/m)\,\alpha\). Reject the first \(k\) hypotheses. BH controls the expected proportion of false positives among rejections, not the probability of any false positive. It is much more powerful when you expect many true signals.
statsmodels.stats.multitest
multipletests(pvals, alpha=0.05, method='fdr_bh') returns a tuple (reject, p_corrected, alpha_sidak, alpha_bonf). Method names: 'bonferroni', 'holm', 'fdr_bh', 'fdr_by', 'sidak'. Always store the original p-values and the method choice — those numbers are the audit trail for any decision you make.
The Bonferroni count is small and trustworthy; the BH count is larger and still trustworthy in the sense that most of the rejections are real. The naive count is what you would have reported before you knew about multiple-testing correction — and a substantial fraction of those are spurious. Renaissance-style discipline is to always report all three columns.
With Bonferroni you require \(p < 0.05/500 = 0.0001\) for any signal to count. Almost all of the 35 with \(p < 0.05\) are above that cutoff and so are not significant after correction. The few that survive (often zero) are the only ones you should consider taking further — and even those need out-of-sample validation.
Chapter Wrap-up
Every method in this chapter answers a question that a hedge-fund analyst will be asked on the job:
- “What’s the shape of this variable?” → distributions, KDE, ECDF.
- “How sure are you of that number?” → bootstrap confidence intervals.
- “Is this difference real?” → hypothesis tests (parametric and non-parametric).
- “Are these two variables related — and if so, how?” → Pearson, Spearman, distance correlation.
- “How bad can it get?” → EVT, GEV, GPD, return levels.
- “Is this observation real or a glitch?” → robust z-score, Mahalanobis, Isolation Forest.
- “Did I search too hard?” → Bonferroni, Benjamini–Hochberg.
In Chapter 2 we move from single variable and pair of variables to many variables predicting one — the regression and variable-selection machinery on which most published quantitative-finance literature is built. In Chapter 3 we generalise to Bayesian inference, where the cost of ignoring prior information becomes explicit. In Chapter 4 we add time, autocorrelation, and volatility clustering — the dimensions that turn a static statistical model into a tradeable one. Chapters 5 and 6 then close the loop with clustering and a capstone on pattern recognition — the unsupervised and synthesising machinery that lets you discover structures without a label vector.
Pearson measures only the linear part of the relationship and can be zero even when X and Y are deterministically related; distance correlation is zero only when X and Y are statistically independent, so it detects arbitrary nonlinear dependence.
Use the closed-form CI when the statistic has a known distribution (mean of approximately Gaussian data, regression coefficient under classical assumptions). Use the bootstrap whenever the closed form is wrong (heavy tails, heteroskedasticity), unknown (Sharpe ratio, max drawdown, quantile), or for arbitrary functionals — it works on any statistic computable from the data.
← Back to Contents · Chapter 2: Statistical Predictive Models →