The Backtest Is Not the Market

2026-05-05 · 13 min read

A trader who would never trust a fifty-year-old physicist's predictions about today's plasma will trust their own backtest's predictions about today's market. Why?

Because the backtest produced numbers, and numbers feel like science. They are not. They are interpretation, formalized into a shape that hides where the interpretation happened. This post is about that shape, and why options traders should stop mistaking it for the market.

The Hidden Interpretation

Backtesting does not remove judgment. It hides judgment inside choices that look mechanical but are not. Every backtest contains at least four interpretive decisions, and none of them are tested by the backtest itself:

Period selection. "We tested from January 2015 through December 2024." That window was chosen. It excludes 2008. It excludes the 1998 LTCM crisis. It includes one March 2020 episode and one 2022 rates regime break. The fingerprint of the test is set before any rule fires.
Rule definition. "Enter when IV rank is greater than 60 and DTE is between 30 and 45 and the put delta is below 0.20." Each of those thresholds is a guess. None of them is the idea you actually had.
Regime assumption. The implicit claim that the chosen window was meaningfully comparable to the present. That claim is the whole load-bearing column of the methodology, and it is never validated, only assumed.
Filter cascades. Every "and also exclude earnings weeks" or "skip Fed meeting days" is another decision about what the past supposedly contained.

The most important decision in any backtest is the one the backtest cannot test: why this past data should matter to the present. That decision is interpretive. It was always interpretive. The backtest just made it invisible.

Formalization Distorts the Idea Being Tested

Real trading thoughts are flexible. A real thought looks like this: "Vol is compressed and skew is oddly flat into a Fed week, dealers look short gamma, and the surface feels like it is mispricing event risk."

That sentence has at least five contextual variables, two of which (the dealer-positioning intuition, the "feel" of the surface) cannot be cleanly written into a rule engine. To run the backtest, you compress the thought into something like IV_rank < 30 AND skew_30d_25d < 0.05 AND days_to_FOMC < 7.

That is no longer the original idea. It is a caricature of the idea, optimized for tractability rather than fidelity. The interpretive nuance, the part that contained the actual edge, was deleted in the act of formalization. The thing you wanted to test no longer exists in the test.

And then the optimization loop arrives. Because rules now exist, parameters can be tweaked. Thresholds get tuned. Sharpe goes up. Win rate climbs. Conviction in the thesis increases, even though the thesis itself was abandoned three iterations ago. This is not a user error. It is the natural pull of the methodology.

Numerical Authority Is the Dangerous Output

Before formalization the trader says: "I think this setup is favorable." After formalization the trader says: "This strategy has a 1.8 Sharpe and a 67.4 percent win rate, with a profit factor of 1.42 over the last nine years."

Same idea. Now it has the texture of science. The numbers are persuasive in a way the original thought never was. They imply rigor. They imply repeatability. They imply that the question "does this work?" has been answered in the affirmative by something more authoritative than gut.

It has not. The numbers describe the past performance of a caricature, conditioned on a chosen window and a chosen rule set, in a market that does not promise to repeat any of it. Precision is not the same as accuracy, and authoritative-looking precision is the most dangerous output a backtest can produce, because it converts an interpretive guess into a number that traders feel they have permission to size up against.

Reflexivity Closes the Trap

George Soros's reflexivity is not a fringe idea. It is the central observation about markets that every quantitative methodology pretends not to know.

Any pattern, once observed, gets traded. Once traded, it changes. Once it changes, the backtest that found it stops working. This is not a defect of one specific test. It is a structural property of any market with sufficient participants and sufficient computing power. Edges discovered through historical search are self-erosion devices by design.

The implication is uncomfortable. The backtests that show the cleanest results are precisely the ones most likely to have already been mined to extinction by faster, better-capitalized firms. If a clean, scalable edge was discoverable in obvious public price data, assume it was discovered. If it remained tradable after discovery, it was traded. Survival in the historical record is not evidence of robustness. It is evidence that no one with capital has yet found a reason to compete it away.

Markets Do Not Repeat. Projections Do.

Two periods that look similar in a backtest's filter set can be embedded in completely different worlds. Consider two SPX moments, each filtered identically: IV rank near 35, the 30-delta put roughly 45 days out, realized vol percentile in the lower half. A backtest condition matches both.

Window A: term structure in healthy contango, skew curvature normal, dealer gamma positive, no event in the horizon, cross-asset correlations behaving. The market is genuinely calm.
Window B: same surface readings on the filter axes, but term structure has flattened, skew curvature is being suppressed by a short-vol carry trade running at extreme leverage, dealer positioning has gone gamma-negative, and the next FOMC is six trading days out. The market looks calm in projection. The market is not calm.

These two windows match on every dimension a backtest cares about. They are nothing alike. The "similarity" is in the projection, not in the market. Two shadows that line up do not imply two objects that line up.

This is the dimensional collapse problem. A real market state has thousands of dimensions: macroeconomic regime, monetary policy stance, liquidity microstructure, cross-asset positioning, options dealer hedging flows, retail concentration, geopolitical posture. A backtest condition has five or six. The collapse is the entire methodology, and the collapse is also where the entire failure mode lives. Anyone who lived through the February 2018 short-vol unwind remembers what happens when a "calm" filter readout sat on top of a fragile market state.

The Trader's Honest Workflow

Almost no one actually uses backtesting the way the textbooks describe it. The textbook claim is: scan a large hypothesis space, identify systematic edges, validate them with out-of-sample data, then deploy. The actual workflow is closer to this:

Form a thesis based on something the trader already wanted to believe.
Find the historical period that resembles it.
Run the backtest.
Accept the result if favorable; reject parameters and try again if not.
Stop iterating once the numbers look respectable.
Treat the final result as confirmation of the original thesis.

Backtesting in the wild is not discovery. It is justification. It produces a numerical receipt for an idea the trader had already decided to trust. This is the use case backtesting is best suited to in practice, and also the least defensible.

You Do Not Need Backtesting to Learn From the Past

Looking at history is qualitative pattern recognition. It builds intuition, surfaces structural tendencies, exposes regime breaks. Backtesting is rigid rule encoding. It forces a fluid observation into a binary decision and then summarizes the result with an authoritative number.

Almost everything genuinely useful that people attribute to backtesting was learnable by looking at the data, not by running formalized simulations:

The volatility risk premium (implied vol biased above realized vol on average) is visible in any chart of 30-day IV minus 30-day realized vol. You do not need a strategy backtest to see it; you need the historical series itself.
The persistence of equity put skew is a feature of every equity surface ever quoted. Observable directly.
The 2008, 2020, and 2022 regime breaks each invalidated entire styles of strategy. The lesson is "options markets reprice catastrophically when the dealer hedging chain breaks." A Sharpe ratio attached to it adds nothing.
The 1987 crash, the 1998 LTCM unwind, the 2018 short-vol implosion, and the March 2020 dislocation are case studies, not statistical curiosities. One good post-mortem teaches more than ten thousand backtests over the same period.

Where Backtesting Earns Its Keep, Honestly

To be clear: the argument is not that historical computation has no use. The argument is that its legitimate use is narrower than the industry pretends. Backtesting earns its keep in three places:

Falsification. Show that a strategy supposed to survive 2008, 2020, and 2022 does not. Use the methodology to break ideas, not validate them.
Stress testing. Once a thesis is already trusted on independent grounds, expose its failure mode. The output is "this is how this idea dies."
Forced clarification. Writing down "what does high IV mean" precisely enough to compile sometimes reveals the original idea was vague.

All three are about exposing weakness, not finding strength. They use backtesting as a microscope, not a discovery engine.

The Refined Position

Backtesting is not a source of truth. It is a conditional tool whose validity depends entirely on a regime-similarity assumption that is itself interpretive, not statistical. The math is correct. The application is the failure point. For most retail traders, most of the time, the application is so degraded that the practice produces worse outcomes than no analysis at all, because it manufactures conviction where none was warranted.

The strongest version of this argument is not "backtesting is useless." That version is easy to dismiss. The strongest version is this: backtesting fails not because the math is wrong but because formalizing a fluid, interpretive system into rigid rules destroys the very signal you were trying to measure.

History as Memory, Not as Authority

None of this means history is irrelevant. History is indispensable as memory, context, and case study. It is how traders build the intuition that lets them recognize regimes when they are happening, the structural tendencies that persist across decades, and the failure modes that recur in different costumes. The error begins when history is converted into a mechanical permission structure: a number that tells the trader they may believe.

Markets must be remembered, but they cannot be replayed. The distinction between historical memory and historical replay is the distinction between an experienced trader and a backtest report.

The Question That Comes Next

If formal backtesting is the wrong primary lens for options, what is the right one? The answer begins with a property of options that stocks do not have.

Compared with options, the stock's primary market object is a line. A stock, as a traded object, is mostly presented to the trader as a price through time, and that is why all stock analysis ends up, at root, as time-series analysis: moving averages, momentum, mean reversion, breakouts, regression to a trend. There is nothing else for the trader to read at the level the trader trades at. The line is the object.

Options are different. At any given tick, an options chain is not a number. It is a surface, with strikes on one axis and expirations on the other and implied volatility raising and falling across that grid like a topography. The surface has shape, slope, curvature, local distortions, and term-structure inflections that exist right now, in the present, available to be read directly without any regime-similarity assumption at all.

Most options backtests destroy that surface in the first step, by reducing it to a handful of scalars (IV rank, delta, DTE, premium) so the rule engine can operate on it. The surface vanishes into the projection. The trade is then made against the shadow.

That is the hinge of the next post. Options are not historical objects to be mined. They are present-tense surface instruments, and the right way to read them is the way physicists read fields, not the way bookkeepers read ledgers.

The backtest is not the market. It is a compressed memory of selected conditions, formalized into rules and summarized into numbers. The market is the living structure in front of you. For options traders, that structure is not a line. It is a surface. The next question is not what worked before. The next question is what the surface is saying now.