Paper Review Wall II🧱|May 23
📰 Quant/Prediction/Macro/Algorithm Paper Review Wall — May 2026 Edition
Curated recent practical arXiv papers.
Each includes core method, key findings, practical takeaways, and direct link:
1️⃣ Sequential Structure in Intraday Futures Data: LSTM vs Gradient Boosting on MNQ
Author Mathias Mesfin|Published arXiv 2026-05-18|arXiv: 2605.17724
🔗 https://arxiv.org/abs/2605.17724
📋 Core Method
Compares gradient boosting and LSTM architectures on 5-minute OHLCV bars of Micro E-Mini Nasdaq 100 futures (MNQ) for intraday directional prediction. 944 trading days from 2021–2025, evaluated under strict expanding-window walk-forward across three OOS periods with permutation testing. Target: whether session close exceeds 10:30 AM open by more than ten points.
🎯 Key Findings
• No configuration produces statistically significant accuracy above the 51.8% base rate
• Gradient boosting OOS accuracies: 50.00%–50.89%; LSTM: 50.59%
• Best GB permutation p=0.135; LSTM p=0.515 — neither significant
• Feature importance instability across walk-forward folds suggests noise fitting, not stable signal capture
💡 Practical Takeaway
A clean negative result — four years of single-instrument 5-minute OHLCV data is empirically insufficient for sequential ML-based intraday forecasting (including Kronos-inspired foundation-model architectures). Provides an explicit empirical lower bound on data scale requirements. Before training your next sequential model on a small intraday dataset, check whether you’re walking into the same trap.
2 Deep Reinforcement Learning Framework for Diversified Portfolio Management Across Global Equity Markets
Authors Kamil Kashif, Robert Ślepaczuk (University of Warsaw)|Published arXiv 2026-05-17|arXiv: 2605.17307
🔗 https://arxiv.org/abs/2605.17307
📋 Core Method
Soft Actor-Critic (SAC) learning continuous portfolio weights within an MDP, with transaction costs, turnover penalties, and diversification constraints in the reward function. Five configurations compared across reward formulation × policy structure (flat vs hierarchical Dirichlet) × constraints × temporal encoder (LSTM vs Transformer). Walk-forward optimization across 16 OOS folds spanning 2003–2026 on Nasdaq-100, Nikkei 225, and Euro Stoxx 50.
🎯 Key Findings
• RL strategies achieve competitive risk-adjusted performance only in Euro Stoxx 50 (statistically significant abnormal returns)
• No strategy achieves significant excess returns vs Buy-and-Hold under HAC-robust inference across all markets
• Regime analysis: RL adds most value during elevated uncertainty
• Ensemble aggregation across markets improves risk-adjusted performance, confirming geographic diversification
💡 Practical Takeaway
An honest DRL portfolio study: under rigorous HAC inference, it doesn’t beat buy-and-hold. Far more credible than the typical “report Sharpe and move on” DRL paper. The SAC + Dirichlet policy still shows marginal edge in EU markets and high-vol regimes — useful as a regime-conditional framework prototype rather than a standalone alpha source.
3️⃣ The Anatomy of a Decentralized Prediction Market: Microstructure Evidence from the Polymarket Order Book
Author Philipp D. Dubach (Zurich)|Published arXiv 2026-04-27 (v2 2026-05-14)|arXiv: 2604.24366
🔗 https://arxiv.org/abs/2604.24366
📋 Core Method
Tick-level archive of Polymarket’s public WebSocket order-book feed over 52 days (30 billion events), joined to authoritative on-chain OrderFilled trade record for ground-truth direction. Pre-registered stratified panel of 600 markets.
🎯 Key Findings (8 stylized facts)
• Longshot spread premium: tail-probability contracts have wider effective spreads
• Depth concentration profile closer to uniform geometric grid than the top-of-book pattern often assumed
• Maker-wallet diversity broad but with a concentrated tail
• Median archive-ingestion delay <50ms with multi-second tail
• Self-counterparty wash share: median 1%, 22% upper tail (well below 25–70% benchmark for 2023 unregulated crypto venues)
• Depth decays near resolution (within-category slope 0.55 on log seconds-to-close, t=3.85)
• Trade direction inferred from feed agrees with on-chain ground truth only ~59% (panel mean 0.615, CI [0.58, 0.65]) — barely above 50% baseline, far below the ~80% Lee-Ready achieves on equities
💡 Practical Takeaway
For anyone running Polymarket / prediction-market algos: do not use feed-level trade direction inference. You must source direction from on-chain OrderFilled events. Otherwise the effective half-spread flips sign on 67% of markets and Kyle’s lambda flips on 60% — your microstructure signals are noise. Replication package: https://github.com/philippdubach/polymarket-microstructure
4️⃣ ForesightFlow: An Information Leakage Score Framework for Prediction Markets
Author Maksym Nechepurenko (Devnull FZCO)|Published arXiv 2026-05-01|arXiv: 2605.00493
🔗 https://arxiv.org/abs/2605.00493
📋 Core Method
A real-time detection framework for informed trading on Polymarket. Core construct: Information Leakage Score (ILS) — quantifies, for any resolved binary market, the fraction of the total information move that occurred before the first public news mention. Three components: (1) formal score backed by the Murphy decomposition of the Brier score; (2) resolution typology (event-resolved / deadline-resolved / unclassifiable); (3) classical microstructure measures (PIN, VPIN, Kyle’s lambda) adapted to bounded [0,1] binary markets.
🎯 Key Findings
• Documented informed-trading profits on Polymarket 2024–2026: ~$143M aggregate anomalous profit (Mitts & Ofir 2026 estimate); $40M arbitrage profit (Saguillo et al. 2025)
• In hours before the February 2026 U.S.–Israeli strike on Iran, 6 newly-created wallets bought YES at prices as low as 10¢, realizing ~$1.2M when the market resolved hours later
• Parallel cases: December 2025 Google “Year in Search,” Taylor Swift engagement, Venezuela operation, several OpenAI product launches
• Structural finding: documented insider cases are systematically deadline-resolved (“Will X happen by Y?”), motivating a deadline-ILS extension
💡 Practical Takeaway
Directly relevant for Polymarket alpha / risk monitoring. Provides an information-theoretic baseline for real-time (not post-hoc) leakage detection; the FFIC validation set is open-sourced as a community benchmark for insider-trading detectors. Also debunks the intuition that volume is a proxy for insider activity — the October 2024 Iran-strike case had peak volume of only ~$148K but is among the most clearly documented. Code: https://github.com/ForesightFlow
5️⃣ Large Language Model Agent in Financial Trading: A Survey
Authors Han Ding, Yinheng Li, Junhao Wang, Hang Chen, Doudou Guo, Yunbai Zhang (Columbia / NYU)|Published arXiv 2024-07, v2 updated 2026-03-01|arXiv: 2408.06361
🔗 https://arxiv.org/abs/2408.06361
📋 Core Method
Systematic taxonomy of LLM trading agents into four sub-types:
• News-Driven: stock-level news + macro updates injected into prompt context, LLM predicts next-day direction (LLMFactor, MarketSenseAI)
• Reflection-Driven: FinAgent, FinMem — embedded memory and reflection modules
• Debate-Driven: TradingGPT, HAD — multi-agent argumentation for decisions
• RL-Driven: SEP — RL with memorization + reflection refines LLM predictions
Also distinguishes LLM-as-Alpha-Miner (QuantAgent, AlphaGPT) vs LLM-as-Trader.
🎯 Key Findings
• Significant performance variation across agent architectures, but no fair cross-architecture benchmark exists
• Major challenges: look-ahead bias (pretraining data may leak future info), backtest reliability, prompt sensitivity
• LLM traders perform better in sentiment-heavy environments; remain fragile on quantitative numerical tasks
💡 Practical Takeaway
A required-reading map for anyone building LLM-based trading systems. Key takeaways to avoid common pitfalls: (1) never trust backtests without verifying pretraining cutoff; (2) reflection / RL feedback loops help mitigate alpha decay; (3) multi-agent debate isn’t automatically better than single-agent — prompt design dominates. The survey is a fast literature scan checklist.
6️⃣ ASRI: An Aggregated Systemic Risk Index for Cryptocurrency Markets
Authors Murad Farzulla, Andrew Maksakov (KCL / Dissensus AI)|Published arXiv 2026-02-04|arXiv: 2602.03874
🔗 https://arxiv.org/abs/2602.03874
📋 Core Method
The first composite systemic-risk index targeting DeFi-TradFi interconnection. Four weighted sub-indices:
• Stablecoin Concentration Risk (30%)
• DeFi Liquidity Risk (25%)
• Contagion Risk (25%)
• Regulatory Opacity Risk (20%)
Data: DeFi Llama, Federal Reserve FRED, on-chain analytics. Validated against four historical crises: Terra/Luna, Celsius/3AC, FTX, SVB.
🎯 Key Findings
• Event study detects significant abnormal signals for all four crises (t-stats 5.47–32.64, p<0.01)
• Threshold-based detection achieves 30-day average lead time for 3 of 4 events
• Walk-forward validation: 4/4 OOS detection, 18-day average lead
• Chow structural-stability test p=0.993
• vs Diebold-Yilmaz connectedness: equivalent detection rate (75%) but higher precision (33.5% vs 22.4%)
• 2024–2025 OOS specificity testing: zero false positives; correctly classified $1.5B Bybit hack as non-systemic
💡 Practical Takeaway
Highly practical for crypto-macro / TradFi-DeFi joint risk monitoring. Transparent and reproducible weights — pluggable into a risk dashboard. SRISK and CoVaR can’t accommodate composability risk, flash-loan exposure, or tokenized RWA linkages, so they materially underperform ASRI in DeFi settings. Useful as a macro overlay signal for crypto allocation decisions.
7️⃣ Statistical Benchmarking of Transformer Models in Low Signal-to-Noise Time-Series Forecasting
Authors Cyril Garcia, Guillaume Remy|Published arXiv 2026-02|arXiv: 2602.09869
🔗 https://arxiv.org/abs/2602.09869
📋 Core Method
Targets the low-data financial regime — only a few years of daily observations. Uses synthetic processes with known temporal + cross-sectional dependency structure and varying signal-to-noise ratios, with bootstrapped experiments enabling direct OOS correlation evaluation against the ground-truth optimal predictor.
🎯 Key Findings
• Two-way attention transformers (alternating temporal and cross-sectional self-attention) outperform Lasso, gradient boosting, and MLP baselines across a wide SNR range, including low-SNR regimes
• Standard single-direction transformers do not beat baselines in low SNR
• In low-data regimes, cross-sectional attention is the critical ingredient — models that ignore cross-sectional information are necessarily weaker
💡 Practical Takeaway
A clear architectural prescription for quant ML teams: if your transformer only attends in the time dimension, it will not have an edge on low-SNR financial data. Two-way attention (temporal × cross-sectional alternating) is currently the most promising architecture for multivariate equity / futures forecasting. Lasso and boosting baselines remain hard to beat, so transformers need cross-sectional attention to justify themselves.
8️⃣ Stablecoins as Dry Powder: A Copula-Based Risk Analysis of Cryptocurrency Markets
Authors Elliot Jones, Toshiko Matsui, William Knottenbelt (Imperial College London)|Published arXiv 2026-03|arXiv: 2603.23480
🔗 https://arxiv.org/abs/2603.23480
📋 Core Method
Uses copula-based methods to quantify transmission of volatility and activity from stablecoin to risk-crypto markets. Causality tested at daily / weekly / monthly horizons; distinguishes normal-regime vs tail-regime dependence structure.
🎯 Key Findings
• In-sample causality from stablecoin → BTC/ETH demonstrated across all three horizons
• Stablecoin supply / activity changes lead risk-asset price changes — consistent with the “dry powder” intuition
• Tail dependence (upper and lower copula tails) is significantly stronger than linear correlation suggests — in crisis episodes, stablecoin flows amplify rather than buffer
💡 Practical Takeaway
Highly useful for crypto-macro / cross-asset traders. Stablecoin issuance / minting / redemption is an underrated leading indicator. The copula framework plugs directly into a risk dashboard and captures tail co-movement better than Pearson correlation. Combined with ASRI (paper #6), forms a complete DeFi macro monitoring suite.
9️⃣ Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets
Authors Jaden Zhang, Gardenia Liu, Oliver Johansson, Hileamlak Yitayew, Kamryn Ohly, Grace Li (Arcada Labs / Harvard)|Published arXiv 2026-04|arXiv: 2604.07355
🔗 https://arxiv.org/abs/2604.07355
📋 Core Method
Six frontier LLMs each given a real $10,000 Kalshi account + concurrent Polymarket positions, running autonomously from 2026-01-12 to 2026-03-09 (57 days). Each model independently executed research → trade decisions → settlement. Accumulated 5,444 historical snapshots, 2,916 trades, 700 settled markets.
🎯 Key Findings
• Kalshi Cohort 1 final returns all negative (-16.0% to -30.8%)
• Striking cross-platform contrast: same models averaged -1.1% on Polymarket vs -22.6% on Kalshi
• grok-4-20-checkpoint achieved 71.4% settlement win rate on Polymarket — highest across any platform or cohort
• Initial prediction accuracy + ability to capitalize on correct predictions are the main drivers
• Research volume shows no correlation with outcomes — more research ≠ more profit
💡 Practical Takeaway
For engineers building AI agents for prediction markets, this is the most rigorous live benchmark to date. Core signal: prediction accuracy only matters when it translates into sizing/exit decisions — research depth alone is worthless. The Kalshi vs Polymarket performance gap shows platform-specific market structure (spread, liquidity, fees) materially affects agent performance. Any “universal AI trader” claim across venues should be re-validated platform by platform.
🔑 Synthesis (May 2026 Research Themes)
1. Negative results are gaining respect: paper #1 (intraday ML doesn’t work), paper #2 (DRL can’t beat buy-and-hold), and ForesightFlow’s pilot negative findings — under tighter peer-review pressure, robust statistical testing is winning over flashy backtests.
2. Polymarket microstructure has become a new frontier: papers #3, #4, and #9 all center on it. On-chain data + insider-trading detection + AI-agent benchmarking are converging as the new research infrastructure — directly relevant for any team running prediction-market arbitrage.
3. Two-way attention is the direction for transformers in finance: paper #7 provides quantitative evidence that single-axis self-attention has no edge on low-SNR financial data — temporal × cross-sectional attention is required.
4. Stablecoin systemic-risk tooling is maturing: papers #6 and #8 together provide a full stack — from macro index to cross-asset copula — with post-hoc validation on SVB 2023 and Terra 2022.
5. Key takeaway from the LLM-trader survey (#5): single-agent + reflection > multi-agent debate, and backtest rigor matters far more than architectural complexity.



