The real multi-agent pipeline replayed over a PAST week using only data available at each decision day, then graded against the prices that actually followed. The future is already known, so there is no waiting.
decision support
vs NISA S&P (Japan resident, all costs)
verdict: NISA S&P WINS (by >0.5pp)
Should remaining NISA quota go to this AI agent or to tsumitate S&P 500? Compares agent's mean return net of US slippage, FX, JP broker fees, and 20.315% capital gains tax — vs SPY total return (NISA-internal = tax-free).
Research output, not financial advice. Numbers are model approximations on a small sample.
Sample is small — these numbers are observations, not proof. Do not change real allocation based on this panel until significance tier reaches MEANINGFUL (n ≥ 30) and verdict is consistent across regimes.
mean returns (n=3)
agent gross (raw equity)
+0.50%
agent net of US slippage + JP costs (FX + broker)
-1.10%
agent NISA-OUTSIDE (after 20.315% tax)
-1.17%
agent NISA-INSIDE (tax-free)
-1.10%
NISA S&P (SPY total return, tax-free)
+1.27%
Δ inside NISA (agent − SPY)
-2.37pp
Δ outside NISA (after tax)
-2.44pp
assumptions
JP capital gains tax: 20.315%
FX round-trip: 30bps
JP broker round-trip: 99bps
benchmark: SPY total return (NISA-tax-free)
decision support — interactive
Scenario simulator: NISA S&P vs AI agent mix × leverage
Move the sliders to see how a mix of (NISA S&P 500) + (AI agent at chosen leverage, inside/outside NISA) would have played out across the same 3 backtest windows. Pure math from existing data — no LLM re-run.
Research output, not financial advice. Leverage math is window-level approximation; ruin = AI portion hit ≥100% drawdown.
Same small-sample caveat: do not change real NISA allocation based on this slider until significance tier reaches MEANINGFUL (n ≥ 30).
0% (全 NISA)100% (全 AI)
1x4x
window
regime
NISA S&P
AI (levered)
combined
Δ vs NISA
ruin?
2026-03-17 → 2026-03-30
DOWN
-5.22%
-2.28%
-5.22%
+0.00pp
—
2026-04-06 → 2026-04-17
UP
+6.99%
+1.03%
+6.99%
+0.00pp
—
2026-05-06 → 2026-05-19
SIDE
+2.06%
-2.04%
+2.06%
+0.00pp
—
mean (n=3)
+1.27%
-1.10%
+1.27%
+0.00pp
0/3
mean combined return
+1.27%
mean Δ vs NISA S&P
+0.00pp
worst window: +0.00pp
ruin (AI portion ≥100% DD)
0/3
SCOPE: Backtest is a DEGRADED variant of the live agent, not a copy. All 4 analyst lanes (bull/bear/technical/fundamental) still run, but: fundamental gets ALL-NONE inputs (lane functionally dead); bull/bear see real technicals + null fundamentals + empty headlines (partially degraded); only technical_analyst is fully fed. The trader effectively decides on a TECHNICAL-HEAVY brief set. Universe is 15 mega-cap tech (Mag7 + AI), highly correlated. Fees model: ~5bps one-way slippage approximation. Macro context (World Monitor) is fed to bull/bear/trader point-in-time: each decision day pulls WM's ?as_of= reconstruction (macro regime + 1d/1w asset bias rebuilt from daily history as of that day). Approximations: news sentiment uses a VIX proxy (no historical news archive), and WM's 10-min horizon / sizing / signal-quality are omitted in as_of mode. Results DO NOT generalize to the live agent or to broader markets.
Aggregate across LLM strategy windows (demo excluded)
sample tier: ANECDOTE (n<5) (n=4)
Sample is too small for statistical significance. These numbers are observations, NOT proof of edge or skill. Read them as 'in this handful of windows so far'.
Conviction calibration (pooled decisions across all LLM windows)
Mean t+7d forward price move grouped by conviction (0-1). Calibrated agent: BUY high should beat BUY low; SELL high should fall more than SELL low.
low (0-0.33)
mid (0.33-0.67)
high (0.67-1)
BUY
—(n=0)
+2.26%(n=8)
+6.55%(n=26)
SELL
—(n=0)
+10.99%(n=3)
+3.11%(n=51)
by regime (benchmark return)
n
α
hit
agent / bench
DOWN (bench < −3%)
1
+6.72pp
100%
-0.86% / -7.59%
SIDEWAYS (±3%)
0
—
—
— / —
UP (bench > +3%)
3
-21.64pp
0%
+3.23% / +24.86%
phase 2 — pick quality only
Conviction filter sweep — BUY pick quality alone
best threshold (mean α): T=0 → -10.93pp
For each window, hypothetical return = Σ (BUY.target_weight × t+7d) for BUYs with conviction ≥ threshold T. Cash earns 0% on the rest. SELL/HOLD ignored. Answers: 'were the high-conviction BUYs actually good picks vs the bench?' — NOT a full portfolio replay.
Research output, not financial advice. Single-period model with strong simplifications.
Sample is small. Even if pick quality looks positive, it does NOT mean the full agent or a real portfolio would beat NISA S&P. Only the BUY picks are evaluated here, in isolation, for ~7 trading days.
window
regime
bench
full agent return
full agent α
T=0
T=0.33
T=0.67
T=0.8
2026-03-17 → 2026-03-30
DOWN
-7.59%
-0.86%
+6.73pp
+8.47pp(n1)
+8.47pp(n1)
+8.47pp(n1)
+7.59pp(n0)
2026-04-06 → 2026-04-17
UP
+13.78%
+3.01%
-10.77pp
-0.26pp(n15)
-0.26pp(n15)
-0.79pp(n14)
-13.78pp(n0)
2026-05-06 → 2026-05-19
UP
+4.36%
-0.66%
-5.01pp
+4.51pp(n10)
+4.51pp(n10)
+1.72pp(n5)
-4.36pp(n0)
2025-01-02 → 2026-01-06
UP
+56.45%
+7.32%
-49.13pp
-56.45pp(n0)
-56.45pp(n0)
-56.45pp(n0)
-56.45pp(n0)
mean filtered α
-10.93pp
-10.93pp
-11.76pp
-16.75pp
model assumptions
approximation: single-period BUY-picker sim (each kept BUY held ~7d, cash 0%, SELL/HOLD ignored). NOT a full broker replay — answers 'were the high-conviction BUYs actually good picks?' only.
γ ablation
ablation: lanes_off
conclusion: SIGNAL NEUTRAL (|Δ| ≤ 0.5pp)
Same agent, same windows — but the ablated run no-ops every SELL order at the broker (agent reasoning unchanged). The α difference is the SELL signal's contribution. Sign convention: negative Δ = ablation hurt α = signal was valuable.
vanilla · mean α
-3.02pp
ablated · mean α
-2.94pp
Δ (ablated − vanilla)
+0.08pp
window
regime
bench
vanilla equity
ablated equity
vanilla α
ablated α
Δ α
blocked
2026-03-17 → 2026-03-30
DOWN
-7.59%
-0.86%
-0.94%
+6.72pp
+6.65pp
-0.07pp
0
2026-04-06 → 2026-04-17
UP
+13.78%
+3.01%
+3.94%
-10.77pp
-9.84pp
+0.92pp
0
2026-05-06 → 2026-05-19
UP
+4.36%
-0.66%
-1.27%
-5.01pp
-5.63pp
-0.61pp
0
γ ablation
ablation: model_deepseek_r1_14b
conclusion: SIGNAL VALUABLE (mean α dropped without it)
Same agent, same windows — but the ablated run no-ops every SELL order at the broker (agent reasoning unchanged). The α difference is the SELL signal's contribution. Sign convention: negative Δ = ablation hurt α = signal was valuable.
vanilla · mean α
-5.01pp
ablated · mean α
-6.00pp
Δ (ablated − vanilla)
-0.99pp
window
regime
bench
vanilla equity
ablated equity
vanilla α
ablated α
Δ α
blocked
2026-05-06 → 2026-05-19
UP
+4.36%
-0.66%
-1.65%
-5.01pp
-6.00pp
-0.99pp
0
γ ablation
ablation: model_qwen7b
conclusion: SIGNAL VALUABLE (mean α dropped without it)
Same agent, same windows — but the ablated run no-ops every SELL order at the broker (agent reasoning unchanged). The α difference is the SELL signal's contribution. Sign convention: negative Δ = ablation hurt α = signal was valuable.
vanilla · mean α
-5.01pp
ablated · mean α
-6.25pp
Δ (ablated − vanilla)
-1.24pp
window
regime
bench
vanilla equity
ablated equity
vanilla α
ablated α
Δ α
blocked
2026-05-06 → 2026-05-19
UP
+4.36%
-0.66%
-1.89%
-5.01pp
-6.25pp
-1.24pp
0
γ ablation
SELL signal removed (vanilla vs --no-sell)
conclusion: SIGNAL VALUABLE (mean α dropped without it)
Same agent, same windows — but the ablated run no-ops every SELL order at the broker (agent reasoning unchanged). The α difference is the SELL signal's contribution. Sign convention: negative Δ = ablation hurt α = signal was valuable.
vanilla · mean α
-3.02pp
ablated · mean α
-4.14pp
Δ (ablated − vanilla)
-1.13pp
window
regime
bench
vanilla equity
ablated equity
vanilla α
ablated α
Δ α
blocked
2026-03-17 → 2026-03-30
DOWN
-7.59%
-0.86%
-4.29%
+6.72pp
+3.30pp
-3.42pp
9
2026-04-06 → 2026-04-17
UP
+13.78%
+3.01%
+3.82%
-10.77pp
-9.96pp
+0.80pp
2
2026-05-06 → 2026-05-19
UP
+4.36%
-0.66%
-1.42%
-5.01pp
-5.78pp
-0.76pp
0
Total blocked: 11
γ ablation
ablation: picks_only
conclusion: SIGNAL HARMFUL (mean α improved without it)
Same agent, same windows — but the ablated run no-ops every SELL order at the broker (agent reasoning unchanged). The α difference is the SELL signal's contribution. Sign convention: negative Δ = ablation hurt α = signal was valuable.
vanilla · mean α
-3.02pp
ablated · mean α
-0.53pp
Δ (ablated − vanilla)
+2.49pp
window
regime
bench
vanilla equity
ablated equity
vanilla α
ablated α
Δ α
blocked
2026-03-17 → 2026-03-30
DOWN
-10.67%
-0.86%
-1.17%
+6.72pp
+9.51pp
+2.79pp
2
2026-04-06 → 2026-04-17
UP
+15.58%
+3.01%
+5.12%
-10.77pp
-10.46pp
+0.30pp
2
2026-05-06 → 2026-05-19
SIDEWAYS
-1.77%
-0.66%
-2.42%
-5.01pp
-0.65pp
+4.37pp
2
Total blocked: 6
How to read it
Portfolio result: the agent runs as paper trader with $100k starting cash; equity is marked-to-market each day. The dashed line is the equal-weight buy-and-hold benchmark over the same window.
α (alpha) = agent_return − benchmark_return, in percentage points. Positive = beat the basket; negative = left money on the table.
Each row of the matrix is the agent's decision (BUY / HOLD / SELL); each column is the average price move 1, 3, 7 days after.
No leakage: prices truncated to the decision day; fundamentals + news disabled (no clean historical-as-of feed).
A handful of windows is a handful of decisions — indicative only, before fees/slippage, NOT proof of edge.