WOLFX Research · reports/gauntlet-scoreboard.md

The WOLFX Backtest Gauntlet — Live Scoreboard

WOLFX Research · Updated 2026-04-24

Every strategy WOLFX trades has cleared a walk-forward 70/15/15 backtest with hard gates. Every strategy WOLFX considers but rejects is published here — the rejections are how you know the filter is real.

Running score: 4 PASS / 26 rounds (15 %). One documented near-miss (Round 13). FIVE LIVE strategies gauntleted post-deployment, ALL FIVE rejected (Rounds 21, 22, 23, 24, 25). Live-strategy validation campaign COMPLETE — zero gauntlet-validated live alpha.

The passes

RoundStrategyTest SharpeShipped
7Cross-Asset Futures Trend (ES/NQ/RTY/YM/6E/6J, 12-1 skip-month)0.895V167 · 2026-04-24 · whitepaper
8VIX Contango Carry (short VXX / long VXZ, regime-gated)1.41V169 · 2026-04-24 · whitepaper
14Overnight Drift Reversal (SPY MOC→MOO, 5d-intraday filter)1.229V174 · 2026-04-25 · whitepaper
19Intraday VWAP Breakout (top-50 SP500, 50bp threshold, long+short)1.83 test / 0.47 fullV182 · 2026-04-28 · whitepaper TBD · GUARDED PASS — slice trajectory train 0.40 → val -0.60 → test 1.83 is regime-flip-shaped, not steady-edge-shaped. PF clears gate by 1bp (1.21 vs 1.20). Canary at 0.5% NAV (not spec'd 5%) + rolling-Sharpe kill-switch when 20-trade rolling Sharpe < 0 for 30 trades.

(Round 15 intentionally skipped — FOMC Pre-Announcement Drift only fires 8 times/year, can't clear the v5 ≥50-trade gate. Proposed as a sizing multiplier on Round 14, not standalone.)

Both strategies are in 30-day paper-shadow canary via V170 scheduler. Flag flip to live execution happens only after rolling Sharpe ≥ 0.5 with no monthly drawdown &gt; 3 %.

The rejections

RoundStrategyTest SharpeVerdictFinding
1Overnight Gap Continuation-8.11NO-GOSignal evaporated when realistic fill costs applied. Infrastructure gap on premarket data.
2Crypto Funding Arbitrage v13.15 IS / -5.68 OOSNO-GOTextbook overfit. Great in-sample, dead out-of-sample.
3Crypto Funding v2 (long-only, z &lt; -3)7.22 (spurious)NO-GOForensic finding: the claimed 71.4 % WR was implicitly bundled with a "price at 20-day low" filter. The filter, not the funding signal, was doing the work.
4Cointegration Pairs Trading-1.51NO-GOMega-cap dispersion in 2025-2026 broke the cointegration assumptions that made this work in 2015.
5Momentum + VIX Long/Short-1.23NO-GOBenign-VIX regimes produce short-squeeze spikes that shred the short leg. MaxDD 25 %.
6Momentum long-only (decomposed)+0.88NO-GO-drawdownSharpe ok, but MaxDD 12.85 % eats the risk budget.
PEAD — Post-Earnings Drift-2.77HARD NO-GOTHE SIGNAL HAS INVERTED. A positive earnings surprise now predicts -0.61 % forward return. Classical decades-old anomaly is now anti-signal.
9G10 FX Trend+Carry0.164NO-GOStrategy lost money over 9 years. AUD/USD + USD/JPY profitable legs couldn't offset EUR/USD, GBP/USD, CHF, etc.
10Commodity Basis Carry-1.58NO-GOProxy rejected, not the underlying premium. Inverted-momentum-as-basis shorted the 2024-2026 gold/palladium rally. Real term-structure data required.
11Treasury Curve Carry 2s10s (Yahoo futures ratio)0.54 test / -1.08 trainNO-GOTest slice looked fine (3 of 4 gates pass) but strategy lost 18 % over the full 10-year window. Only worked post-QE. A regime bet, not a carry premium.
12Treasury Curve Carry 2s10s (FRED daily yields, IEF/SHY)-3.34NO-GORetested Round 11 with real yield data to falsify "maybe the proxy was the problem." Result: real yields were worse than the proxy. Full-window Sharpe -0.99, final NAV down 36 %. One trade alone (Oct 2023 short SHY at 3.78× weight) lost $44.8K when 2Y yields fell into the rate-cut cycle. The underlying signal — not the proxy — is wrong for the 2022-2026 hiking/cutting regime.
13DXY Regime Switch (long-only, Variant B)1.20 test / 0.61 fullNEAR-MISS / NO-GOSharpe 1.20, PF 2.95, MaxDD -1.47 %, full-window Sharpe positive — every substantive gate clears with margin. Fails only on trade count (2 vs gate 20). The 20-trade gate is miscalibrated for a regime classifier that fires ~3 times per test slice by design. Honest verdict under strict rules: NO-GO. Under the same gate-calibration argument that Round 7 (Trend) accepted, this would be a PASS — that's a calibration decision, not a statistical one. Flagged for re-evaluation if the trade-count gate is recalibrated per signal class.
16RRP-Driven Treasury Carry Reversal (FRED RRPONTSYD → SHY)-0.89 test / 0.24 fullNO-GOTest slice fails 3 of 5 gates (Sharpe -0.89, PF 0.88, only 29 trades). Walk-forward: train -0.03, val +1.81, test -0.89 — textbook in-sample-fit / out-of-sample-collapse. The validation slice caught the late-2023 RRP drain wave during the Fed pivot; the test slice is in a post-RRP-trough regime where the facility sits near zero and meaningful drains stop happening. Two side findings: (1) the v5 proposal had a units bug — RRPONTSYD is in billions, not millions; harness corrected; (2) the premium, if it existed, has likely been arbitraged away in the four years since Copeland-Duffie-Yang published.
17Speculator Crowding Reversal (CFTC COT, 12 commodities cross-sectional)0.11 test / 0.13 fullNO-GOTest PF 1.03 (gate 1.2). 444 legs — huge sample, so the near-zero Sharpe is statistically firm, not noise. Train slice ran -35% MaxDD during 2021-22 commodity supercycle when speculators stayed crowded long AND prices kept rising — exactly the regime Boons-Prado warn breaks the reversion. Recommendation was retest at 4-week hold (paper's documented 4-8w reversion window).
17bSpeculator Crowding Reversal — 4w / 6w / 8w hold retest-0.37 / -0.79 / -0.39 testPERMANENT NO-GOTested at every horizon the paper documents. ALL THREE produce negative test-slice Sharpe (PF 0.93, 0.86, 0.93 — all below 1.0). Sample is 408-432 legs each — not noise. Full-window Sharpes 0.31-0.73 are positive (train+val carried edge), but the held-out test slice (~late-2024 → Apr 2026) flipped sign. Strategy is genuinely dead at retail-data scale. Two cleanly-separated explanations: (a) premium decay since paper's 1986-2018 sample (RFS publication + COT-factor ETFs likely arbitraged); (b) Yahoo continuous-futures roll noise. Action identical regardless: permanent shelf. Five horizons tested (1w/2w/4w/6w/8w) — strategy gets removed from future v7 alpha pipelines.
18HY-OAS-Gated Put Credit Spreads (Israelov-Klein 2024, SPY/QQQ/IWM weekly)-0.14 test (BS-on-RV) / +0.63 test (with +20% IV-VRP uplift)NO-GO formal · R18b pending dataFormal NO-GO at Black-Scholes-on-realized-vol pricing (PF 0.94, full-window Sharpe -0.39). But agent's sensitivity test under defensible IV-over-RV pricing (Bakshi-Kapadia 2003, Israelov 2017) flips ALL 5 gates to pass: Sharpe 0.63, PF 1.30, full-window 0.31. The "failure" is a modeling artifact — BS-on-RV erases the very variance-risk premium the strategy harvests by construction. Reversible (distinct from R17b permanent shelf). Two data blockers: FRED API key (free, lifts HY OAS cap from 3yr to 25yr), historical options chains (Polygon $199/mo or OptionAlpha $99 one-time). Promote to R18b once procured.
20GP/A Quality Long-Short (Novy-Marx 2013, sector-neutral SP500 quintiles)-2.12 test / -0.42 full-windowNO-GOThe factor has inverted in 2024-2026. Train +0.12 → val -0.57 → test -2.12 is monotonic DECAY — the literal opposite of the v7-hypothesised monotonic improvement. New v7 trajectory gate (gate #8, added because of R19 regime-flip lesson) caught this — train alone looked benign, but val and test progression was the textbook overfit/regime-decay shape. Mechanism: top-quintile names (mag-7 quality leaders) kept rallying through 2024-25 while the SHORT LEG (bottom quintile distressed industrials / capital-heavy energy) bounced harder in the 2024-26 reflation. Within-sector neutralisation didn't save it. Joins PEAD and Momentum+VIX in the "classical equity factor that has inverted" pile. Permanent shelf as a single-factor signal; agent recommends regime-filtered variant for v8 (only fade junk when long-junk underperforming long-quality 6mo). Coverage bias: 314/503 SP500 had clean fundamentals (financials excluded — banks don't report COGS).
21sniper_mean_reversion (LIVE WORKHORSE validation)-0.34 test / -0.27 full-windowNO-GO — production is regime luckWalk-forward over 2,973 trades (vs production's 48) showed the live engine's main alpha source is statistically noise + regime survivorship. Train Sharpe -0.41, Val +0.53, Test -0.34 — classic lucky-middle-slice. Production +$3,677 from PF 3.05 reflects an Oct-2023 → Apr-2024 mean-reversion-friendly regime; the 10-year backtest finds 49% WR, PF 0.91, -30% NAV. The 1.5 ATR stop = 1.5 ATR target with 49% WR is structurally a money-loser even before costs. Concrete actions: tighten conf gate 0.50→0.75 (extreme-only path), add per-strategy kill-switch on rolling 30-trade PF < 1.5, change R:R from 1:1 to 1:2, prune universe per-ticker. V186 ships the kill-switch + tightened conf gate.
22news_alpha (LIVE strategy validation)-1.47 test / -0.72 full-windowNO-GO — 6-trade production sample is noiseWalk-forward over 1,123 events (earnings as news-proxy) shows the strategy is unedge-bearing. Random-direction arm: 40.3% WR, PF 0.87, -18% NAV. Momentum-direction arm: 38.6% WR, PF 0.80, -28% NAV. Production "PF 99.90" / 100% WR was 6 coin flips heads in a row (p ≈ 1.6%). The 3% target / 2% stop asymmetry needs >40% WR + sentiment oracle margin > 15bps cost — both untestable from 6 live trades. The exit mechanics carry no inherent edge around news events. Pattern-matches R21 sniper_mean_reversion exposure: live strategies with tiny samples reflect regime + selection bias, not structural edge. V189 adds news_alpha kill-switch.
23wolf_quantum_convergence (LIVE strategy validation)+0.36 test / +0.38 full-windowNO-GO — regime-stable but edge-too-small after costsWalk-forward over 2,471 trades shows the convergence mechanic produces a stable, weakly-positive distribution: train +0.45, val +0.03, test +0.36 — all same-sign. Different rejection class from R21/R22 (those were lucky-middle-slice patterns). PF 1.07 across the full window — real but tiny structural drift. Win rate 46.5% with 2:1 R:R bracket should print PF ~1.7 if WR were even 50%; the 5% target rarely hits in 5 days while losses cap at -2.5% stops. Mean trade +$8 on $10K notional after 10bps round-trip — statistical noise dominates. Critical caveat: 5 of 11 source weights (news/social/congress/insider/GDELT cascade) are non-replayable. If live edge exists, it lives entirely in those 5 — and we cannot validate it. Production "PF 99.9" / 4 trades is a 0.465^4 = 4.7% probability outcome under the actual distribution. V190 ships kill-switch + tightens V143 quantum bypass conf gate from 0.80 → 0.95.
24forex_trend (LIVE OANDA validation)+0.33 test / -0.39 full-windowNO-GO — anti-edgeThe worst rejection yet. Walk-forward over 1,754 trades, 21 instruments, 2018-2026: full-window total return -99.3%, Sharpe -0.39, 7 of 9 years negative. 2018 -54.6%, 2019 -60.5%, 2020 -13.5%, 2021 -55.4%, 2022 -69.5%, 2023 -44.3%, 2024 -52.4%, 2025 +115.2% (the lone winner — single-regime tail event on USD weakness + gold blow-off), 2026 YTD -44.9%. Per-pair: only 6 of 21 instruments PF > 1.0; JPY-crosses + equity indices are toxic (JP225 PF 0.64, GBP_JPY 0.60). 42% WR at 1.6 R:R is breakeven; transaction costs push it under. ADX>20 floor admits whipsaw regimes. Recommendation: disable forex_trend immediately. V191 ships kill-switch (WOLFX_FOREX_TREND_ENABLED). The 2025 P&L that justified deployment was tail-luck on regime persistence.
25forex_reversion (LIVE OANDA validation, 5/5 strategies tested)+0.87 test / +0.32 full-windowNO-GO — short-vol blowupThe closest call of all 26 rounds. Walk-forward over 955 trades 2018-2026: passed 6 of 7 v7 gates including Sharpe 0.87, PF 1.42, +0.32 full-window. The single fail: Test MaxDD 37.4% (gate 15%), full-window MaxDD 76%. Cause: 2020 COVID year produced -61.3% over 169 trades — RSI<30 + lower-band signals kept firing into the vertical drop. Strategy is structurally short-volatility — sells tails, tails eventually get paid. Different rejection class: real per-pair edge does exist (EUR_GBP PF 3.55, EUR_JPY 1.72, XAG_USD 1.67, GBP_JPY 1.59). Mean reversion works on JPY crosses and silver, fails on gold/SPX/NZD/USD_CHF. Cannot be deployed without volatility regime filters + dynamic sizing — the live system has neither. MAJOR IMPLICATION: ALL FIVE live strategies gauntleted, ALL FIVE rejected. Zero gauntlet-validated live alpha. Recommendation per agent: halt live execution; convert engine to research/shadow harness until a strategy passes the gauntlet from scratch. The four shadow strategies (Trend, VIX Carry, Overnight Drift, VWAP Breakout) remain the only credible pipeline.

Methodology

Meta-insights the swarm has learned

  1. The 2015-2021 equity-factor playbook is upside-down in 2026. Six out of the first eight rejections were classical equity factors. PEAD inverting was the clincher. Mega-cap concentration, passive flows, and retail options gamma have rewired the single-name tape.
  2. Both passes are macro / commodity / vol with multi-decade academic lineage. The alpha research pivot to this territory produced two consecutive passes after six consecutive equity-factor rejections. That is not coincidence — that is structure.
  3. Data proxies die in out-of-sample testing. Round 10 and Round 11 both had to use Yahoo-price proxies because the true signal requires curve / yield data. Both failed. The lesson: before we backtest another carry signal, procure the data.

Round 14 — first v5 round, first PASS under v5 constraints

Variant B (mean-reversion-filtered overnight long): test Sharpe 1.229, MaxDD -8.36 %, PF 1.25, 163 trades, full-window Sharpe 0.447. All five v5 gates cleared. Train/Val/Test Sharpe 0.51 / 1.35 / 1.23 — clean OOS pattern, no overfit signature. Variant A (always-on overnight long, no filter) NO-GO at PF 1.09 — confirms the filter is doing real work.

The strategy fills the high-frequency gap that v4 was structurally unable to test. Diversifies cleanly from Trend (monthly futures momentum) and VIX Carry (monthly vol selling).

What's next

---

WOLFX publishes every signal and every realized fill. Past performance, including walk-forward backtest performance, is not predictive of future results. Every strategy here is informational — nothing is investment advice.

Edge-served from Cloudflare R2.