WOLFX Research · reports/gauntlet-scoreboard.md

The WOLFX Backtest Gauntlet — Live Scoreboard

WOLFX Research · Updated 2026-04-24

Every strategy WOLFX trades has cleared a walk-forward 70/15/15 backtest with hard gates. Every strategy WOLFX considers but rejects is published here — the rejections are how you know the filter is real.

Running score: 4 PASS / 26 rounds (15 %). One documented near-miss (Round 13). FIVE LIVE strategies gauntleted post-deployment, ALL FIVE rejected (Rounds 21, 22, 23, 24, 25). Live-strategy validation campaign COMPLETE — zero gauntlet-validated live alpha.

The passes

Round	Strategy	Test Sharpe	Shipped
7	Cross-Asset Futures Trend (ES/NQ/RTY/YM/6E/6J, 12-1 skip-month)	0.895	V167 · 2026-04-24 · whitepaper
8	VIX Contango Carry (short VXX / long VXZ, regime-gated)	1.41	V169 · 2026-04-24 · whitepaper
14	Overnight Drift Reversal (SPY MOC→MOO, 5d-intraday filter)	1.229	V174 · 2026-04-25 · whitepaper
19	Intraday VWAP Breakout (top-50 SP500, 50bp threshold, long+short)	1.83 test / 0.47 full	V182 · 2026-04-28 · whitepaper TBD · GUARDED PASS — slice trajectory train 0.40 → val -0.60 → test 1.83 is regime-flip-shaped, not steady-edge-shaped. PF clears gate by 1bp (1.21 vs 1.20). Canary at 0.5% NAV (not spec'd 5%) + rolling-Sharpe kill-switch when 20-trade rolling Sharpe < 0 for 30 trades.

(Round 15 intentionally skipped — FOMC Pre-Announcement Drift only fires 8 times/year, can't clear the v5 ≥50-trade gate. Proposed as a sizing multiplier on Round 14, not standalone.)

Both strategies are in 30-day paper-shadow canary via V170 scheduler. Flag flip to live execution happens only after rolling Sharpe ≥ 0.5 with no monthly drawdown > 3 %.

The rejections

Round	Strategy	Test Sharpe	Verdict	Finding
1	Overnight Gap Continuation	-8.11	NO-GO	Signal evaporated when realistic fill costs applied. Infrastructure gap on premarket data.
2	Crypto Funding Arbitrage v1	3.15 IS / -5.68 OOS	NO-GO	Textbook overfit. Great in-sample, dead out-of-sample.
3	Crypto Funding v2 (long-only, z < -3)	7.22 (spurious)	NO-GO	Forensic finding: the claimed 71.4 % WR was implicitly bundled with a "price at 20-day low" filter. The filter, not the funding signal, was doing the work.
4	Cointegration Pairs Trading	-1.51	NO-GO	Mega-cap dispersion in 2025-2026 broke the cointegration assumptions that made this work in 2015.
5	Momentum + VIX Long/Short	-1.23	NO-GO	Benign-VIX regimes produce short-squeeze spikes that shred the short leg. MaxDD 25 %.
6	Momentum long-only (decomposed)	+0.88	NO-GO-drawdown	Sharpe ok, but MaxDD 12.85 % eats the risk budget.
—	PEAD — Post-Earnings Drift	-2.77	HARD NO-GO	THE SIGNAL HAS INVERTED. A positive earnings surprise now predicts -0.61 % forward return. Classical decades-old anomaly is now anti-signal.
9	G10 FX Trend+Carry	0.164	NO-GO	Strategy lost money over 9 years. AUD/USD + USD/JPY profitable legs couldn't offset EUR/USD, GBP/USD, CHF, etc.
10	Commodity Basis Carry	-1.58	NO-GO	Proxy rejected, not the underlying premium. Inverted-momentum-as-basis shorted the 2024-2026 gold/palladium rally. Real term-structure data required.
11	Treasury Curve Carry 2s10s (Yahoo futures ratio)	0.54 test / -1.08 train	NO-GO	Test slice looked fine (3 of 4 gates pass) but strategy lost 18 % over the full 10-year window. Only worked post-QE. A regime bet, not a carry premium.
12	Treasury Curve Carry 2s10s (FRED daily yields, IEF/SHY)	-3.34	NO-GO	Retested Round 11 with real yield data to falsify "maybe the proxy was the problem." Result: real yields were worse than the proxy. Full-window Sharpe -0.99, final NAV down 36 %. One trade alone (Oct 2023 short SHY at 3.78× weight) lost $44.8K when 2Y yields fell into the rate-cut cycle. The underlying signal — not the proxy — is wrong for the 2022-2026 hiking/cutting regime.
13	DXY Regime Switch (long-only, Variant B)	1.20 test / 0.61 full	NEAR-MISS / NO-GO	Sharpe 1.20, PF 2.95, MaxDD -1.47 %, full-window Sharpe positive — every substantive gate clears with margin. Fails only on trade count (2 vs gate 20). The 20-trade gate is miscalibrated for a regime classifier that fires ~3 times per test slice by design. Honest verdict under strict rules: NO-GO. Under the same gate-calibration argument that Round 7 (Trend) accepted, this would be a PASS — that's a calibration decision, not a statistical one. Flagged for re-evaluation if the trade-count gate is recalibrated per signal class.
16	RRP-Driven Treasury Carry Reversal (FRED RRPONTSYD → SHY)	-0.89 test / 0.24 full	NO-GO	Test slice fails 3 of 5 gates (Sharpe -0.89, PF 0.88, only 29 trades). Walk-forward: train -0.03, val +1.81, test -0.89 — textbook in-sample-fit / out-of-sample-collapse. The validation slice caught the late-2023 RRP drain wave during the Fed pivot; the test slice is in a post-RRP-trough regime where the facility sits near zero and meaningful drains stop happening. Two side findings: (1) the v5 proposal had a units bug — RRPONTSYD is in billions, not millions; harness corrected; (2) the premium, if it existed, has likely been arbitraged away in the four years since Copeland-Duffie-Yang published.
17	Speculator Crowding Reversal (CFTC COT, 12 commodities cross-sectional)	0.11 test / 0.13 full	NO-GO	Test PF 1.03 (gate 1.2). 444 legs — huge sample, so the near-zero Sharpe is statistically firm, not noise. Train slice ran -35% MaxDD during 2021-22 commodity supercycle when speculators stayed crowded long AND prices kept rising — exactly the regime Boons-Prado warn breaks the reversion. Recommendation was retest at 4-week hold (paper's documented 4-8w reversion window).
17b	Speculator Crowding Reversal — 4w / 6w / 8w hold retest	-0.37 / -0.79 / -0.39 test	PERMANENT NO-GO	Tested at every horizon the paper documents. ALL THREE produce negative test-slice Sharpe (PF 0.93, 0.86, 0.93 — all below 1.0). Sample is 408-432 legs each — not noise. Full-window Sharpes 0.31-0.73 are positive (train+val carried edge), but the held-out test slice (~late-2024 → Apr 2026) flipped sign. Strategy is genuinely dead at retail-data scale. Two cleanly-separated explanations: (a) premium decay since paper's 1986-2018 sample (RFS publication + COT-factor ETFs likely arbitraged); (b) Yahoo continuous-futures roll noise. Action identical regardless: permanent shelf. Five horizons tested (1w/2w/4w/6w/8w) — strategy gets removed from future v7 alpha pipelines.
18	HY-OAS-Gated Put Credit Spreads (Israelov-Klein 2024, SPY/QQQ/IWM weekly)	-0.14 test (BS-on-RV) / +0.63 test (with +20% IV-VRP uplift)	NO-GO formal · R18b pending data	Formal NO-GO at Black-Scholes-on-realized-vol pricing (PF 0.94, full-window Sharpe -0.39). But agent's sensitivity test under defensible IV-over-RV pricing (Bakshi-Kapadia 2003, Israelov 2017) flips ALL 5 gates to pass: Sharpe 0.63, PF 1.30, full-window 0.31. The "failure" is a modeling artifact — BS-on-RV erases the very variance-risk premium the strategy harvests by construction. Reversible (distinct from R17b permanent shelf). Two data blockers: FRED API key (free, lifts HY OAS cap from 3yr to 25yr), historical options chains (Polygon $199/mo or OptionAlpha $99 one-time). Promote to R18b once procured.
20	GP/A Quality Long-Short (Novy-Marx 2013, sector-neutral SP500 quintiles)	-2.12 test / -0.42 full-window	NO-GO	The factor has inverted in 2024-2026. Train +0.12 → val -0.57 → test -2.12 is monotonic DECAY — the literal opposite of the v7-hypothesised monotonic improvement. New v7 trajectory gate (gate #8, added because of R19 regime-flip lesson) caught this — train alone looked benign, but val and test progression was the textbook overfit/regime-decay shape. Mechanism: top-quintile names (mag-7 quality leaders) kept rallying through 2024-25 while the SHORT LEG (bottom quintile distressed industrials / capital-heavy energy) bounced harder in the 2024-26 reflation. Within-sector neutralisation didn't save it. Joins PEAD and Momentum+VIX in the "classical equity factor that has inverted" pile. Permanent shelf as a single-factor signal; agent recommends regime-filtered variant for v8 (only fade junk when long-junk underperforming long-quality 6mo). Coverage bias: 314/503 SP500 had clean fundamentals (financials excluded — banks don't report COGS).
21	sniper_mean_reversion (LIVE WORKHORSE validation)	-0.34 test / -0.27 full-window	NO-GO — production is regime luck	Walk-forward over 2,973 trades (vs production's 48) showed the live engine's main alpha source is statistically noise + regime survivorship. Train Sharpe -0.41, Val +0.53, Test -0.34 — classic lucky-middle-slice. Production +$3,677 from PF 3.05 reflects an Oct-2023 → Apr-2024 mean-reversion-friendly regime; the 10-year backtest finds 49% WR, PF 0.91, -30% NAV. The 1.5 ATR stop = 1.5 ATR target with 49% WR is structurally a money-loser even before costs. Concrete actions: tighten conf gate 0.50→0.75 (extreme-only path), add per-strategy kill-switch on rolling 30-trade PF < 1.5, change R:R from 1:1 to 1:2, prune universe per-ticker. V186 ships the kill-switch + tightened conf gate.
22	news_alpha (LIVE strategy validation)	-1.47 test / -0.72 full-window	NO-GO — 6-trade production sample is noise	Walk-forward over 1,123 events (earnings as news-proxy) shows the strategy is unedge-bearing. Random-direction arm: 40.3% WR, PF 0.87, -18% NAV. Momentum-direction arm: 38.6% WR, PF 0.80, -28% NAV. Production "PF 99.90" / 100% WR was 6 coin flips heads in a row (p ≈ 1.6%). The 3% target / 2% stop asymmetry needs >40% WR + sentiment oracle margin > 15bps cost — both untestable from 6 live trades. The exit mechanics carry no inherent edge around news events. Pattern-matches R21 sniper_mean_reversion exposure: live strategies with tiny samples reflect regime + selection bias, not structural edge. V189 adds news_alpha kill-switch.
23	wolf_quantum_convergence (LIVE strategy validation)	+0.36 test / +0.38 full-window	NO-GO — regime-stable but edge-too-small after costs	Walk-forward over 2,471 trades shows the convergence mechanic produces a stable, weakly-positive distribution: train +0.45, val +0.03, test +0.36 — all same-sign. Different rejection class from R21/R22 (those were lucky-middle-slice patterns). PF 1.07 across the full window — real but tiny structural drift. Win rate 46.5% with 2:1 R:R bracket should print PF ~1.7 if WR were even 50%; the 5% target rarely hits in 5 days while losses cap at -2.5% stops. Mean trade +$8 on $10K notional after 10bps round-trip — statistical noise dominates. Critical caveat: 5 of 11 source weights (news/social/congress/insider/GDELT cascade) are non-replayable. If live edge exists, it lives entirely in those 5 — and we cannot validate it. Production "PF 99.9" / 4 trades is a 0.465^4 = 4.7% probability outcome under the actual distribution. V190 ships kill-switch + tightens V143 quantum bypass conf gate from 0.80 → 0.95.
24	forex_trend (LIVE OANDA validation)	+0.33 test / -0.39 full-window	NO-GO — anti-edge	The worst rejection yet. Walk-forward over 1,754 trades, 21 instruments, 2018-2026: full-window total return -99.3%, Sharpe -0.39, 7 of 9 years negative. 2018 -54.6%, 2019 -60.5%, 2020 -13.5%, 2021 -55.4%, 2022 -69.5%, 2023 -44.3%, 2024 -52.4%, 2025 +115.2% (the lone winner — single-regime tail event on USD weakness + gold blow-off), 2026 YTD -44.9%. Per-pair: only 6 of 21 instruments PF > 1.0; JPY-crosses + equity indices are toxic (JP225 PF 0.64, GBP_JPY 0.60). 42% WR at 1.6 R:R is breakeven; transaction costs push it under. ADX>20 floor admits whipsaw regimes. Recommendation: disable forex_trend immediately. V191 ships kill-switch (WOLFX_FOREX_TREND_ENABLED). The 2025 P&L that justified deployment was tail-luck on regime persistence.
25	forex_reversion (LIVE OANDA validation, 5/5 strategies tested)	+0.87 test / +0.32 full-window	NO-GO — short-vol blowup	The closest call of all 26 rounds. Walk-forward over 955 trades 2018-2026: passed 6 of 7 v7 gates including Sharpe 0.87, PF 1.42, +0.32 full-window. The single fail: Test MaxDD 37.4% (gate 15%), full-window MaxDD 76%. Cause: 2020 COVID year produced -61.3% over 169 trades — RSI<30 + lower-band signals kept firing into the vertical drop. Strategy is structurally short-volatility — sells tails, tails eventually get paid. Different rejection class: real per-pair edge does exist (EUR_GBP PF 3.55, EUR_JPY 1.72, XAG_USD 1.67, GBP_JPY 1.59). Mean reversion works on JPY crosses and silver, fails on gold/SPX/NZD/USD_CHF. Cannot be deployed without volatility regime filters + dynamic sizing — the live system has neither. MAJOR IMPLICATION: ALL FIVE live strategies gauntleted, ALL FIVE rejected. Zero gauntlet-validated live alpha. Recommendation per agent: halt live execution; convert engine to research/shadow harness until a strategy passes the gauntlet from scratch. The four shadow strategies (Trend, VIX Carry, Overnight Drift, VWAP Breakout) remain the only credible pipeline.

Methodology

Every round runs through the same walk-forward harness:
70 / 15 / 15 split — training, validation, test. Parameters fit on training only.
Walk-forward monotonicity check — train ≤ validation ≤ test Sharpe. Monotonic improvement OOS is the single strongest credibility signal.
Full-window sanity — if a strategy needs a specific regime to print positive numbers, the test slice alone doesn't save it. Round 11 was rejected on this exact point.
Gate calibration — Sharpe 0.30 – 0.50 depending on asset class. Trend in equities needs higher Sharpe gate than VIX carry.
Cost model — slippage, commission, and execution realism baked in. Round 1 died specifically because we made the fill assumptions honest.

Meta-insights the swarm has learned

The 2015-2021 equity-factor playbook is upside-down in 2026. Six out of the first eight rejections were classical equity factors. PEAD inverting was the clincher. Mega-cap concentration, passive flows, and retail options gamma have rewired the single-name tape.
Both passes are macro / commodity / vol with multi-decade academic lineage. The alpha research pivot to this territory produced two consecutive passes after six consecutive equity-factor rejections. That is not coincidence — that is structure.
Data proxies die in out-of-sample testing. Round 10 and Round 11 both had to use Yahoo-price proxies because the true signal requires curve / yield data. Both failed. The lesson: before we backtest another carry signal, procure the data.

Round 14 — first v5 round, first PASS under v5 constraints

Variant B (mean-reversion-filtered overnight long): test Sharpe 1.229, MaxDD -8.36 %, PF 1.25, 163 trades, full-window Sharpe 0.447. All five v5 gates cleared. Train/Val/Test Sharpe 0.51 / 1.35 / 1.23 — clean OOS pattern, no overfit signature. Variant A (always-on overnight long, no filter) NO-GO at PF 1.09 — confirms the filter is doing real work.

The strategy fills the high-frequency gap that v4 was structurally unable to test. Diversifies cleanly from Trend (monthly futures momentum) and VIX Carry (monthly vol selling).

What's next

Alpha v4 final tally: 0 PASS / 4 tested — commodity basis (R10), treasury curve with and without proxy (R11, R12), DXY regime (R13 near-miss). The v4 research batch leaned too hard on macro / carry signals that either need decades of data (to separate premium from regime bet) or would never fire often enough to accumulate a test-slice sample in ten years.
Alpha Researcher v5 is now the next loop. Three hard constraints for v5, learned the hard way from v4:
1. Data source must be named and verified to expose the true signal before a harness is built. R10 and R11 both died because the available data only supported a proxy of the real signal.
2. Prefer signals that naturally fire ≥ 50 times per year — intraday, event-driven, or short-lookback technical. Monthly carry signals are fine in principle but the 20-trade-per-test-slice sample gate starves them.
3. Full-window Sharpe is a hard gate, set after Round 11. A strategy that only works in one half of the 2016-2026 window is a regime bet; no more "only works post-QE" passes.

---

WOLFX publishes every signal and every realized fill. Past performance, including walk-forward backtest performance, is not predictive of future results. Every strategy here is informational — nothing is investment advice.

Edge-served from Cloudflare R2.