Paper Review Wall II🧱｜May 23

📰 Quant/Prediction/Macro/Algorithm Paper Review Wall — May 2026 Edition

Tiger Capital Research

May 23, 2026

Curated recent practical arXiv papers.

Each includes core method, key findings, practical takeaways, and direct link：

1️⃣ Sequential Structure in Intraday Futures Data: LSTM vs Gradient Boosting on MNQ

Author Mathias Mesfin｜Published arXiv 2026-05-18｜arXiv: 2605.17724

🔗 https://arxiv.org/abs/2605.17724

📋 Core Method
Compares gradient boosting and LSTM architectures on 5-minute OHLCV bars of Micro E-Mini Nasdaq 100 futures (MNQ) for intraday directional prediction. 944 trading days from 2021–2025, evaluated under strict expanding-window walk-forward across three OOS periods with permutation testing. Target: whether session close exceeds 10:30 AM open by more than ten points.

🎯 Key Findings

• No configuration produces statistically significant accuracy above the 51.8% base rate
• Gradient boosting OOS accuracies: 50.00%–50.89%; LSTM: 50.59%
• Best GB permutation p=0.135; LSTM p=0.515 — neither significant
• Feature importance instability across walk-forward folds suggests noise fitting, not stable signal capture

💡 Practical Takeaway
A clean negative result — four years of single-instrument 5-minute OHLCV data is empirically insufficient for sequential ML-based intraday forecasting (including Kronos-inspired foundation-model architectures). Provides an explicit empirical lower bound on data scale requirements. Before training your next sequential model on a small intraday dataset, check whether you’re walking into the same trap.

2 Deep Reinforcement Learning Framework for Diversified Portfolio Management Across Global Equity Markets

Authors Kamil Kashif, Robert Ślepaczuk (University of Warsaw)｜Published arXiv 2026-05-17｜arXiv: 2605.17307

🔗 https://arxiv.org/abs/2605.17307

📋 Core Method
Soft Actor-Critic (SAC) learning continuous portfolio weights within an MDP, with transaction costs, turnover penalties, and diversification constraints in the reward function. Five configurations compared across reward formulation × policy structure (flat vs hierarchical Dirichlet) × constraints × temporal encoder (LSTM vs Transformer). Walk-forward optimization across 16 OOS folds spanning 2003–2026 on Nasdaq-100, Nikkei 225, and Euro Stoxx 50.

🎯 Key Findings

• RL strategies achieve competitive risk-adjusted performance only in Euro Stoxx 50 (statistically significant abnormal returns)
• No strategy achieves significant excess returns vs Buy-and-Hold under HAC-robust inference across all markets
• Regime analysis: RL adds most value during elevated uncertainty
• Ensemble aggregation across markets improves risk-adjusted performance, confirming geographic diversification

💡 Practical Takeaway
An honest DRL portfolio study: under rigorous HAC inference, it doesn’t beat buy-and-hold. Far more credible than the typical “report Sharpe and move on” DRL paper. The SAC + Dirichlet policy still shows marginal edge in EU markets and high-vol regimes — useful as a regime-conditional framework prototype rather than a standalone alpha source.

3️⃣ The Anatomy of a Decentralized Prediction Market: Microstructure Evidence from the Polymarket Order Book

Author Philipp D. Dubach (Zurich)｜Published arXiv 2026-04-27 (v2 2026-05-14)｜arXiv: 2604.24366

🔗 https://arxiv.org/abs/2604.24366

📋 Core Method
Tick-level archive of Polymarket’s public WebSocket order-book feed over 52 days (30 billion events), joined to authoritative on-chain OrderFilled trade record for ground-truth direction. Pre-registered stratified panel of 600 markets.

🎯 Key Findings (8 stylized facts)

• Longshot spread premium: tail-probability contracts have wider effective spreads
• Depth concentration profile closer to uniform geometric grid than the top-of-book pattern often assumed
• Maker-wallet diversity broad but with a concentrated tail
• Median archive-ingestion delay <50ms with multi-second tail
• Self-counterparty wash share: median 1%, 22% upper tail (well below 25–70% benchmark for 2023 unregulated crypto venues)
• Depth decays near resolution (within-category slope 0.55 on log seconds-to-close, t=3.85)
• Trade direction inferred from feed agrees with on-chain ground truth only ~59% (panel mean 0.615, CI [0.58, 0.65]) — barely above 50% baseline, far below the ~80% Lee-Ready achieves on equities

💡 Practical Takeaway
For anyone running Polymarket / prediction-market algos: do not use feed-level trade direction inference. You must source direction from on-chain OrderFilled events. Otherwise the effective half-spread flips sign on 67% of markets and Kyle’s lambda flips on 60% — your microstructure signals are noise. Replication package: https://github.com/philippdubach/polymarket-microstructure

4️⃣ ForesightFlow: An Information Leakage Score Framework for Prediction Markets

Author Maksym Nechepurenko (Devnull FZCO)｜Published arXiv 2026-05-01｜arXiv: 2605.00493

🔗 https://arxiv.org/abs/2605.00493

📋 Core Method
A real-time detection framework for informed trading on Polymarket. Core construct: Information Leakage Score (ILS) — quantifies, for any resolved binary market, the fraction of the total information move that occurred before the first public news mention. Three components: (1) formal score backed by the Murphy decomposition of the Brier score; (2) resolution typology (event-resolved / deadline-resolved / unclassifiable); (3) classical microstructure measures (PIN, VPIN, Kyle’s lambda) adapted to bounded [0,1] binary markets.

🎯 Key Findings

• Documented informed-trading profits on Polymarket 2024–2026: ~$143M aggregate anomalous profit (Mitts & Ofir 2026 estimate); $40M arbitrage profit (Saguillo et al. 2025)
• In hours before the February 2026 U.S.–Israeli strike on Iran, 6 newly-created wallets bought YES at prices as low as 10¢, realizing ~$1.2M when the market resolved hours later
• Parallel cases: December 2025 Google “Year in Search,” Taylor Swift engagement, Venezuela operation, several OpenAI product launches
• Structural finding: documented insider cases are systematically deadline-resolved (“Will X happen by Y?”), motivating a deadline-ILS extension

💡 Practical Takeaway
Directly relevant for Polymarket alpha / risk monitoring. Provides an information-theoretic baseline for real-time (not post-hoc) leakage detection; the FFIC validation set is open-sourced as a community benchmark for insider-trading detectors. Also debunks the intuition that volume is a proxy for insider activity — the October 2024 Iran-strike case had peak volume of only ~$148K but is among the most clearly documented. Code: https://github.com/ForesightFlow

5️⃣ Large Language Model Agent in Financial Trading: A Survey

Authors Han Ding, Yinheng Li, Junhao Wang, Hang Chen, Doudou Guo, Yunbai Zhang (Columbia / NYU)｜Published arXiv 2024-07, v2 updated 2026-03-01｜arXiv: 2408.06361

🔗 https://arxiv.org/abs/2408.06361

📋 Core Method
Systematic taxonomy of LLM trading agents into four sub-types:

• News-Driven: stock-level news + macro updates injected into prompt context, LLM predicts next-day direction (LLMFactor, MarketSenseAI)
• Reflection-Driven: FinAgent, FinMem — embedded memory and reflection modules
• Debate-Driven: TradingGPT, HAD — multi-agent argumentation for decisions
• RL-Driven: SEP — RL with memorization + reflection refines LLM predictions

Also distinguishes LLM-as-Alpha-Miner (QuantAgent, AlphaGPT) vs LLM-as-Trader.

🎯 Key Findings

• Significant performance variation across agent architectures, but no fair cross-architecture benchmark exists
• Major challenges: look-ahead bias (pretraining data may leak future info), backtest reliability, prompt sensitivity
• LLM traders perform better in sentiment-heavy environments; remain fragile on quantitative numerical tasks

💡 Practical Takeaway
A required-reading map for anyone building LLM-based trading systems. Key takeaways to avoid common pitfalls: (1) never trust backtests without verifying pretraining cutoff; (2) reflection / RL feedback loops help mitigate alpha decay; (3) multi-agent debate isn’t automatically better than single-agent — prompt design dominates. The survey is a fast literature scan checklist.

6️⃣ ASRI: An Aggregated Systemic Risk Index for Cryptocurrency Markets

Authors Murad Farzulla, Andrew Maksakov (KCL / Dissensus AI)｜Published arXiv 2026-02-04｜arXiv: 2602.03874

🔗 https://arxiv.org/abs/2602.03874

📋 Core Method
The first composite systemic-risk index targeting DeFi-TradFi interconnection. Four weighted sub-indices:

• Stablecoin Concentration Risk (30%)
• DeFi Liquidity Risk (25%)
• Contagion Risk (25%)
• Regulatory Opacity Risk (20%)

Data: DeFi Llama, Federal Reserve FRED, on-chain analytics. Validated against four historical crises: Terra/Luna, Celsius/3AC, FTX, SVB.

🎯 Key Findings

• Event study detects significant abnormal signals for all four crises (t-stats 5.47–32.64, p<0.01)
• Threshold-based detection achieves 30-day average lead time for 3 of 4 events
• Walk-forward validation: 4/4 OOS detection, 18-day average lead
• Chow structural-stability test p=0.993
• vs Diebold-Yilmaz connectedness: equivalent detection rate (75%) but higher precision (33.5% vs 22.4%)
• 2024–2025 OOS specificity testing: zero false positives; correctly classified $1.5B Bybit hack as non-systemic

💡 Practical Takeaway
Highly practical for crypto-macro / TradFi-DeFi joint risk monitoring. Transparent and reproducible weights — pluggable into a risk dashboard. SRISK and CoVaR can’t accommodate composability risk, flash-loan exposure, or tokenized RWA linkages, so they materially underperform ASRI in DeFi settings. Useful as a macro overlay signal for crypto allocation decisions.

7️⃣ Statistical Benchmarking of Transformer Models in Low Signal-to-Noise Time-Series Forecasting

Authors Cyril Garcia, Guillaume Remy｜Published arXiv 2026-02｜arXiv: 2602.09869

🔗 https://arxiv.org/abs/2602.09869

📋 Core Method
Targets the low-data financial regime — only a few years of daily observations. Uses synthetic processes with known temporal + cross-sectional dependency structure and varying signal-to-noise ratios, with bootstrapped experiments enabling direct OOS correlation evaluation against the ground-truth optimal predictor.

🎯 Key Findings

• Two-way attention transformers (alternating temporal and cross-sectional self-attention) outperform Lasso, gradient boosting, and MLP baselines across a wide SNR range, including low-SNR regimes
• Standard single-direction transformers do not beat baselines in low SNR
• In low-data regimes, cross-sectional attention is the critical ingredient — models that ignore cross-sectional information are necessarily weaker

💡 Practical Takeaway
A clear architectural prescription for quant ML teams: if your transformer only attends in the time dimension, it will not have an edge on low-SNR financial data. Two-way attention (temporal × cross-sectional alternating) is currently the most promising architecture for multivariate equity / futures forecasting. Lasso and boosting baselines remain hard to beat, so transformers need cross-sectional attention to justify themselves.

8️⃣ Stablecoins as Dry Powder: A Copula-Based Risk Analysis of Cryptocurrency Markets

Authors Elliot Jones, Toshiko Matsui, William Knottenbelt (Imperial College London)｜Published arXiv 2026-03｜arXiv: 2603.23480

🔗 https://arxiv.org/abs/2603.23480

📋 Core Method
Uses copula-based methods to quantify transmission of volatility and activity from stablecoin to risk-crypto markets. Causality tested at daily / weekly / monthly horizons; distinguishes normal-regime vs tail-regime dependence structure.

🎯 Key Findings

• In-sample causality from stablecoin → BTC/ETH demonstrated across all three horizons
• Stablecoin supply / activity changes lead risk-asset price changes — consistent with the “dry powder” intuition
• Tail dependence (upper and lower copula tails) is significantly stronger than linear correlation suggests — in crisis episodes, stablecoin flows amplify rather than buffer

💡 Practical Takeaway
Highly useful for crypto-macro / cross-asset traders. Stablecoin issuance / minting / redemption is an underrated leading indicator. The copula framework plugs directly into a risk dashboard and captures tail co-movement better than Pearson correlation. Combined with ASRI (paper #6), forms a complete DeFi macro monitoring suite.

9️⃣ Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets

Authors Jaden Zhang, Gardenia Liu, Oliver Johansson, Hileamlak Yitayew, Kamryn Ohly, Grace Li (Arcada Labs / Harvard)｜Published arXiv 2026-04｜arXiv: 2604.07355

🔗 https://arxiv.org/abs/2604.07355

📋 Core Method
Six frontier LLMs each given a real $10,000 Kalshi account + concurrent Polymarket positions, running autonomously from 2026-01-12 to 2026-03-09 (57 days). Each model independently executed research → trade decisions → settlement. Accumulated 5,444 historical snapshots, 2,916 trades, 700 settled markets.

🎯 Key Findings

• Kalshi Cohort 1 final returns all negative (-16.0% to -30.8%)
• Striking cross-platform contrast: same models averaged -1.1% on Polymarket vs -22.6% on Kalshi
• grok-4-20-checkpoint achieved 71.4% settlement win rate on Polymarket — highest across any platform or cohort
• Initial prediction accuracy + ability to capitalize on correct predictions are the main drivers
• Research volume shows no correlation with outcomes — more research ≠ more profit

💡 Practical Takeaway
For engineers building AI agents for prediction markets, this is the most rigorous live benchmark to date. Core signal: prediction accuracy only matters when it translates into sizing/exit decisions — research depth alone is worthless. The Kalshi vs Polymarket performance gap shows platform-specific market structure (spread, liquidity, fees) materially affects agent performance. Any “universal AI trader” claim across venues should be re-validated platform by platform.

🔑 Synthesis (May 2026 Research Themes)

1. Negative results are gaining respect: paper #1 (intraday ML doesn’t work), paper #2 (DRL can’t beat buy-and-hold), and ForesightFlow’s pilot negative findings — under tighter peer-review pressure, robust statistical testing is winning over flashy backtests.
2. Polymarket microstructure has become a new frontier: papers #3, #4, and #9 all center on it. On-chain data + insider-trading detection + AI-agent benchmarking are converging as the new research infrastructure — directly relevant for any team running prediction-market arbitrage.
3. Two-way attention is the direction for transformers in finance: paper #7 provides quantitative evidence that single-axis self-attention has no edge on low-SNR financial data — temporal × cross-sectional attention is required.
4. Stablecoin systemic-risk tooling is maturing: papers #6 and #8 together provide a full stack — from macro index to cross-asset copula — with post-hoc validation on SVB 2023 and Terra 2022.
5. Key takeaway from the LLM-trader survey (#5): single-agent + reflection > multi-agent debate, and backtest rigor matters far more than architectural complexity.

Discussion about this post

Ready for more?