Chapter 4: Time Series Models
Chapter Introduction
Almost every quantitative trading strategy that has ever made money has, at its core, exploited one of three time-series phenomena: autocorrelation in returns (momentum or mean reversion), autocorrelation in squared returns (volatility clustering), or long-run equilibria between assets (cointegration / pairs trading). The same three phenomena show up far outside finance — in temperature records, electricity demand, web traffic, and patient vital signs — and the same statistical machinery handles all of them. This chapter is the dynamic complement to the first three: where Chapter 1 looked at the shape of a variable and Chapter 2 looked at predictors in a static cross-section, Chapter 4 introduces time itself as a primary feature, with all the dependence and non-stationarity that brings.
The chapter is organised around the three workflows a quant researcher repeats on every new dataset. First, can the series be analysed as is, or does it need differencing? The stationarity question, answered with ADF and KPSS tests. Second, what is the autocorrelation structure of the levels and the volatility? This is where the ACF, PACF, ARIMA, and GARCH machinery lives. Third, when does a pattern in the past generalise to the future, and how should the model adapt when regimes change? That is the regime-detection / Markov-switching / change-point material that ends the chapter.
A hedge fund’s research workflow looks like a pipeline: load → resample → stationarity-check → diagnose autocorrelation → fit ARIMA → check residuals → fit GARCH on residuals → backtest. We walk the same path. The data is whatever illustrates the method most clearly — simulated AR(1) and GARCH(1,1) data we generate inline, plus an example with synthetic temperature and traffic data — but every method here is the everyday workflow at Renaissance, AQR’s time-series team, and DE Shaw’s macro group.
A note on why time series cannot be skipped: a static regression of return on yesterday’s return, repeated naively, will mis-state every standard error in your output because the residuals are autocorrelated. A naive backtest that shuffles dates will look profitable for reasons that have nothing to do with the strategy. Time series is the discipline that prevents both errors.
Table of Contents
- Time Series in Pandas — Indexing, Resampling, Lags
- Stationarity — ADF and KPSS Tests
- Autocorrelation Patterns — ACF and PACF
- ARIMA — Modelling the Conditional Mean
- Volatility Clustering — ARCH and GARCH
- Cointegration and Pairs Trading
- Regime Detection and Change-Points
- Putting It Together — A Simple Mean-Reversion Backtest
Time Series in Pandas — Indexing, Resampling, Lags
Time series in pandas live in Series or DataFrame objects whose index is a DatetimeIndex. Once the index is a proper datetime, three families of operations become trivial: lag/shift, rolling windows, and resampling to a different frequency. Every quantitative time-series workflow begins by getting these three operations right.
pd.date_range(start, periods=n, freq='B')— business-day calendar (other useful:'D','H','5min','M','Q','A').s.shift(k)— lag bykperiods (NaNs at the start).s.diff(k)— first/k-th difference:s - s.shift(k).s.pct_change()— percent change vs. previous:s / s.shift(1) - 1.s.rolling(window=20).mean()/.std()/.apply(fn)— moving statistics.s.resample('M').last()/.sum()/.ohlc()— change frequency.s.asfreq('B')— reindex to a regular calendar (does NOT aggregate, only re-labels).
The same three operations — lag, rolling, resample — recur in every subsequent section. Internalise them and the rest of the chapter is incremental.
Stationarity — ADF and KPSS Tests
A series is (weakly) stationary if its mean, variance, and autocovariances do not depend on time. Almost every classical time-series technique — ARMA, regression with lagged predictors, vector autoregression — assumes stationarity. Most economic and financial series in levels are not stationary: prices wander, GDP trends upward, interest rates have multi-decade swings. First differences (returns, GDP growth, yield changes) are usually stationary, which is why returns rather than prices are modelled in equity research.
Two complementary tests, applied together, give a clean diagnosis:
- Augmented Dickey–Fuller (ADF) — null: the series has a unit root (is non-stationary). Reject (low p) → stationary.
- KPSS — null: the series is stationary. Reject (low p) → non-stationary.
The combination is read as a 2×2 table:
| ADF rejects | ADF does not reject | |
|---|---|---|
| KPSS rejects | conflicting → look more | non-stationary, consistent |
| KPSS does not reject | stationary, consistent | conflicting → look more |
Conflicts usually mean the series is mean-reverting around a deterministic trend, which a more careful detrending step resolves.
statsmodels.tsa.stattools
adfuller(series)— returns(test_stat, p_value, lags_used, n_obs, critical_values, icbest).kpss(series, regression='c')—'c'for stationarity around a constant,'ct'for stationarity around a trend.- For both, the first two elements of the return are what you usually print.
For the random walk, ADF does not reject (high p) and KPSS rejects (low p) — consistent non-stationarity. For the AR(1), ADF rejects and KPSS does not — consistent stationarity. This is the diagnosis you want before fitting anything.
You fail to reject the unit-root null — the series is non-stationary in levels. The standard response is to take first differences (i.e., model returns, not prices) and re-run ADF on the differenced series, which is almost always stationary.
Autocorrelation Patterns — ACF and PACF
The autocorrelation function (ACF) at lag \(k\) is the correlation between \(y_t\) and \(y_{t-k}\). The partial autocorrelation function (PACF) at lag \(k\) is the correlation between \(y_t\) and \(y_{t-k}\) after removing the influence of all intermediate lags. The pair forms the classic Box–Jenkins diagnostic for identifying ARMA orders:
| Pattern | ACF behaviour | PACF behaviour | Suggested model |
|---|---|---|---|
| AR(\(p\)) | Decays gradually | Cuts off after \(p\) | AR(\(p\)) |
| MA(\(q\)) | Cuts off after \(q\) | Decays gradually | MA(\(q\)) |
| ARMA(\(p, q\)) | Decays gradually | Decays gradually | ARMA(\(p, q\)) |
Most financial return series have ACF and PACF that are flat — returns are nearly uncorrelated. But the ACF of squared or absolute returns is strongly positive and slowly decaying: this is the volatility clustering signature that motivates the next section.
The PACF picks out the AR(1) structure cleanly — a single spike at lag 1, nothing afterward. This is exactly the pattern you would see in pure mean-reversion of, say, the spread between two cointegrated stocks (next section).
The Ljung–Box test
For a global test of “is there any autocorrelation up to lag \(k\) in this series?”, use the Ljung–Box statistic. Under the null of no autocorrelation, the test statistic is distributed \(\chi^2_k\). A small p-value is evidence of autocorrelation that you should model.
ARIMA — Modelling the Conditional Mean
ARIMA(\(p, d, q\)) combines:
- AR(\(p\)) — autoregressive on the past \(p\) values: \(y_t = \phi_1 y_{t-1} + \dots + \phi_p y_{t-p} + \varepsilon_t\).
- I(\(d\)) — integrated of order \(d\): model the \(d\)-th difference. \(d = 0\) for stationary series, \(d = 1\) for series whose first differences are stationary, etc.
- MA(\(q\)) — moving average on past errors: \(y_t = \varepsilon_t + \theta_1 \varepsilon_{t-1} + \dots + \theta_q \varepsilon_{t-q}\).
The model captures linear, short-memory dynamics in the conditional mean. It does not model volatility (that’s GARCH, next section). For most equity return series, the best-fitting ARIMA is just (0, 0, 0) — returns are unpredictable from past returns — but for many other time series (electricity load, inventory levels, fixed-income spreads) ARIMA fits beautifully.
statsmodels.tsa.arima.model.ARIMA
ARIMA(series, order=(p, d, q)).fit() returns a fitted result. Methods: .summary(), .forecast(steps=k), .get_forecast(steps=k).conf_int(), .aic, .bic. For automatic order selection, the pmdarima.auto_arima package wraps a search but is not in Pyodide by default — you can implement the grid yourself in a dozen lines.
The estimated \(\phi\) coefficients should be close to 0.6 and -0.2. The forecast decays toward the unconditional mean as the horizon grows — a defining property of stationary ARMA models.
Residual diagnostics
After fitting, you must check that the residuals look like white noise. If they don’t, the model has missed structure. The Ljung–Box test on residuals is the standard automated check; the ACF of residuals is the visual check.
Because the squared (or absolute) returns are strongly autocorrelated. ARIMA models the conditional mean; GARCH models the conditional variance. The combination — ARIMA-GARCH — is the workhorse for return-and-risk modelling at every quant fund.
Volatility Clustering — ARCH and GARCH
Plot daily equity returns; you will see clusters of large moves followed by clusters of calm. This is volatility clustering, and it is the single most robust empirical regularity in financial data. Engle’s 1982 ARCH and Bollerslev’s 1986 GARCH model it directly.
GARCH(1,1): \[ r_t = \mu + \varepsilon_t, \qquad \varepsilon_t = \sigma_t z_t,\; z_t \sim N(0, 1), \qquad \sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2. \]
Three intuitions:
- \(\omega\) is the long-run baseline variance. Unconditional variance is \(\omega / (1 - \alpha - \beta)\), provided \(\alpha + \beta < 1\).
- \(\alpha\) controls how strongly yesterday’s shock spikes today’s variance — the “news” channel.
- \(\beta\) controls volatility persistence — how slowly variance decays back to baseline. Typical equity estimates have \(\alpha \approx 0.07\), \(\beta \approx 0.90\): a small reaction to news, a very long memory.
Without GARCH, value-at-risk is wrong, options pricing is wrong, and dynamic hedging is wrong. With GARCH, you have a model of the conditional volatility you can update each day.
statsmodels.tsa.arima.model.ARIMA does not include GARCH; the standard library for GARCH in Python is arch (Kevin Sheppard’s package). It is not in Pyodide by default, so in the cell below we implement a basic GARCH(1,1) maximum-likelihood fit in pure NumPy / SciPy — 25 lines and you understand it forever.
The fitted \(\hat\alpha\) and \(\hat\beta\) should recover the simulation’s true values, and the bottom plot — the time path of \(\hat\sigma_t\) — visibly tracks the clustering you can see in the returns. GARCH is, at its core, a clean pattern-recognition model: it learns the recurring pattern that big moves tend to be followed by big moves.
Extensions worth knowing
- EGARCH (Nelson 1991) — asymmetric volatility response: negative returns spike vol more than positive returns (“leverage effect”).
- GJR-GARCH — a simpler asymmetric model using an indicator on past negative shocks.
- t-GARCH — Student-t innovations to capture fat-tailed shocks even after volatility scaling.
- Multivariate (DCC-)GARCH — joint dynamics of variances and correlations across many assets. The plumbing of every modern portfolio-vol model.
Cointegration and Pairs Trading
Two non-stationary series \(y_t\) and \(x_t\) are cointegrated if there is a linear combination \(y_t - \beta x_t\) that is stationary. Economically: the two series wander individually but maintain a long-run equilibrium relationship. Cointegration is the statistical foundation of pairs trading — the canonical statistical-arbitrage strategy popularised at Morgan Stanley’s Black Box group in the 1980s and refined at DE Shaw and Renaissance throughout the 1990s.
The Engle–Granger two-step test:
- Regress \(y_t\) on \(x_t\) to estimate \(\hat\beta\) and recover residuals \(\hat u_t = y_t - \hat\beta x_t\).
- Run an ADF test on \(\hat u_t\). If you reject the unit-root null in the residuals, the pair is cointegrated.
A simple mean-reverting trading rule: open a long-short position when the spread \(\hat u_t\) deviates from zero by more than \(z\) standard deviations, close when it reverts. The expected holding period is short and the strategy is dollar-neutral.
statsmodels.tsa.stattools.coint(y, x)— runs the Engle–Granger test in one call, returns(t_statistic, p_value, critical_values).- For more than two series, use
statsmodels.tsa.vector_ar.vecm.coint_johansen(the Johansen test).
The cointegration test rejects, the spread is stationary, and a simple \(\pm 2\sigma\) entry rule would trade the swings. The serious work — at firms like Citadel’s “Surveyor” team — is in finding cointegrating baskets of 5–20 stocks where the relationship is more stable than any pairwise one, and in managing the inevitable cointegration breaks.
Regime Detection and Change-Points
A single ARIMA-GARCH fit assumes the parameters are constant over time. Markets disagree: there are quiet regimes and turbulent regimes, trending regimes and mean-reverting regimes. Regime detection is the pattern-recognition layer on top of time-series modelling that says “the model that worked last month may not be the right one for this month.”
Markov-switching models
A Markov-switching model lets the parameters of an ARMA / GARCH process depend on a latent discrete state \(s_t \in \{1, \dots, K\}\) that follows a Markov chain with transition matrix \(P\). The state is unobserved; you infer it from the data via the Hamilton filter, which is just a forward pass of Bayes’ rule.
statsmodels.tsa.regime_switching
MarkovRegression(y, k_regimes=2, switching_variance=True)— Hamilton’s seminal model.MarkovAutoregression(y, k_regimes=2, order=1)— Markov-switching AR.- The fitted object exposes
.smoothed_marginal_probabilities— the smoothed posterior probability of being in each regime at each time. That probability is the regime indicator.
The fitted posterior tracks the true regime closely — without ever having been told the regime sequence. Once you have this probability time-series, downstream uses are abundant: regime-conditional risk limits, regime-conditional trading rules, regime-conditional rebalancing schedules. Bridgewater’s “All-Weather” and AQR’s “Risk-Parity” products both incorporate regime-aware allocation logic of this flavour.
Online change-point detection
For streaming data, online change-point detectors (e.g., BOCD, the algorithm of Adams & MacKay 2007) update the posterior over “time since the last change-point” at each new observation. The same machinery is in production at Google, Netflix, and DataDog for anomaly detection. The Bayesian backbone is the change-point posterior we worked out in Chapter 3.
The maximum-likelihood estimator of GARCH parameters is high-variance in short samples; a long window stabilises it. The dynamics (recent variance, recent shocks) are already embedded in the recursive variance equation, so you don’t need to re-fit to track volatility — just to update the parameters every quarter or year. Re-fitting daily creates noisy parameter estimates that hurt out-of-sample performance.
Putting It Together — A Simple Mean-Reversion Backtest
The chapter closes with a worked example that uses every block we have built. The data is a single mean-reverting series — interpret it as the spread between two cointegrated assets, or the deviation of a temperature reading from its seasonal mean, or any pattern in any domain.
The recipe:
- Verify the spread is stationary (ADF).
- Compute a rolling z-score using a 60-period window.
- Trade rule: short when \(z > 2\), long when \(z < -2\), close when \(|z| < 0.5\). Use yesterday’s signal to trade today’s return, never today’s signal — that is look-ahead bias.
- Cross-validate by splitting the series into halves and reporting in-sample and out-of-sample Sharpe.
Two honest readings of this output:
- The in-sample Sharpe is the upper bound — what you would have seen if you had tuned the rule on this very data. The out-of-sample Sharpe is what an honest researcher would have reported. The gap between the two is the cost of optimism every quantitative researcher has to budget for.
- A real implementation would add transaction costs (each entry/exit costs slippage + commission), funding costs (financing the long-short position), and stop-losses (to control losses when the cointegration relationship breaks). Each of these layers eats Sharpe; what survives is your real edge.
The point is structural: this entire backtest used only the tools developed in the chapter — stationarity testing, rolling statistics, lag-aware positioning. There is no proprietary magic. The discipline is the magic.
Chapter Wrap-up
A working quantitative time-series toolkit fits in seven habits:
- Get the index right —
DatetimeIndexandfreqfirst, everything else later. - Test for stationarity — ADF + KPSS, together, on every new series.
- Look at ACF and PACF of levels and of squared returns. Returns may be uncorrelated; volatility almost never is.
- Model the conditional mean with ARIMA. Check residuals with Ljung–Box.
- Model the conditional variance with GARCH (and an asymmetric variant if leverage matters).
- Test for cointegration between any two non-stationary series you suspect are related. The stationary spread is the trade.
- Watch for regime change — Markov-switching or online change-point detectors keep your model from being lulled into using yesterday’s parameters tomorrow.
Combined with the distribution, inference, regression, and Bayesian machinery of the previous three chapters, you now have the statistical kit that powers every serious systematic strategy. None of it is exotic. All of it is permanent. The edge in the modern industry is not in the methods — they are public — but in the discipline with which each step is executed, the honesty with which out-of-sample performance is reported, and the patience to keep iterating after the first ten candidate signals fail.
Stationarity testing (skipped → fitting ARMA to non-stationary data gives spurious regressions and inflated R²) and look-ahead bias (skipped → backtests trade on today’s signal applied to today’s return, which produces fictitious P&L). Both are silent killers — your code runs, your numbers look fine, your strategy is fake.
← Chapter 3 · Contents · Chapter 5: Clustering →