1. Abstract

POCKET is a dual-agent autonomous trading system that combines reinforcement learning with large language model research to trade prediction markets on Polymarket. The system operates two distinct agents: an RL Agent for high-frequency 15-minute crypto markets, and an Opus Agent for longer-term event-driven markets requiring real-world research.

Key Innovation

Cross-market state fusion: exploiting information lag between fast markets (Binance futures) and slow markets (Polymarket) through real-time multi-source data fusion.

~$50K

Training PnL

34,730

Training Trades

2,500%

Training ROI

2. System Architecture

2.1 Dual-Agent Design

The system runs two independent agents that complement each other:

Component	RL Agent	Opus Agent
Market Type	15-min crypto binary markets	All Polymarket events
Time Horizon	15 minutes	Hours to days
Decision Engine	PPO Neural Network	Claude AI (Anthropic)
Data Sources	Binance + Polymarket orderbook	Web search + market data
Trade Frequency	Multiple per hour	Every 30 minutes scan

2.2 Infrastructure

┌─────────────────────────────────────────────────────────────────────┐ │ POCKET SYSTEM │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ BINANCE │ │ POLYMARKET │ │ WEB SOURCES │ │ │ │ FUTURES │ │ CLOB │ │ (News, Research) │ │ │ └──────┬───────┘ └──────┬───────┘ └──────────┬───────────┘ │ │ │ │ │ │ │ └─────────┬─────────┴────────────────────────┘ │ │ │ │ │ ┌─────────▼─────────┐ │ │ │ DATA FUSION │ │ │ │ 18-dim state │ │ │ └─────────┬─────────┘ │ │ │ │ │ ┌──────────────┴──────────────┐ │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ RL AGENT │ │ OPUS AGENT │ │ │ │ (PyTorch) │ │ (Claude) │ │ │ │ │ │ │ │ │ │ LSTM+Attn │ │ Research + │ │ │ │ PPO v3.5 │ │ Reasoning │ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ └─────────┬───────────────┘ │ │ │ │ │ ┌────────▼────────┐ │ │ │ EXECUTION │ │ │ │ Polymarket API │ │ │ └────────┬────────┘ │ │ │ │ │ ┌────────▼────────┐ │ │ │ SUPABASE │──────▶ VERCEL DASHBOARD │ │ │ (Real-time) │ (Public Monitoring) │ │ └─────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘

3. RL Agent: Technical Deep Dive

3.1 State Space (18 Dimensions)

The RL agent observes an 18-dimensional state fused from multiple real-time sources:

Category	Features	Source
Momentum	`returns_1m`, `returns_5m`, `returns_10m`	Binance Futures
Order Flow	`ob_imbalance_l1`, `ob_imbalance_l5`, `trade_flow`, `cvd_accel`	Binance Futures
Microstructure	`spread_pct`, `trade_intensity`, `large_trade_flag`	Polymarket CLOB
Volatility	`vol_5m`, `vol_expansion`	Combined
Position	`has_position`, `position_side`, `position_pnl`, `time_remaining`	Internal State
Regime	`vol_regime`, `trend_regime`	Derived

3.2 Neural Network Architecture

The V3.5 architecture uses LSTM temporal encoding with cross-market attention:

LSTMTemporalEncoder:
    Input: (batch, seq_len=10, features=18)
    LSTM: 2 layers, hidden_dim=64, dropout=0.1
    Output: 64-dim temporal embedding

CrossMarketAttention:
    Multi-head attention (4 heads) across 4 markets
    Captures inter-market correlations

Actor Network:
    [temporal(64) + attention(64)] → 128 → LayerNorm → ReLU
    → 64 → LayerNorm → ReLU → 3 (softmax)

Critic Network:
    [temporal(64) + attention(64)] → 128 → LayerNorm → ReLU
    → 64 → LayerNorm → ReLU → 1 (value)

3.3 PPO Hyperparameters

Parameter	Value	Notes
Learning Rate (Actor)	`1e-4`	Conservative for stability
Learning Rate (Critic)	`3e-4`	Higher for faster value learning
Gamma (γ)	`0.95`	Short horizon (15-min markets)
GAE Lambda	`0.95`	Advantage estimation
Clip Epsilon	`0.2`	PPO clipping
Entropy Coefficient	`0.03`	Allows sparse policy
Buffer Size	`256`	Fast adaptation

3.4 Action Space

Action	Description
`HOLD (0)`	No action - wait for better opportunity
`BUY_UP (1)`	Long YES token (bet price goes up)
`BUY_DOWN (2)`	Long NO token (bet price goes down)

3.5 Reward Engineering

Key Breakthrough

Share-based PnL calculation that matches actual binary market economics:

shares = dollars / entry_price
pnl = (exit_price - entry_price) × shares

This amplifies returns from low-probability entries proportionally. Buy at 0.30 → 3.33 shares per dollar. Buy at 0.70 → 1.43 shares. Same price move, larger return at lower entries.

4. Opus Agent: AI Research Engine

4.1 Overview

The Opus Agent uses Claude (Anthropic's LLM) to research and trade longer-term prediction markets. It scans all Polymarket markets, performs web research, estimates true probabilities, and identifies trading edges.

4.2 Research Pipeline

Market Discovery: Scan Polymarket for liquid markets with reasonable time horizons
Web Search: Gather real-time information from news, social media, and official sources
AI Analysis: Claude analyzes market question, current odds, and web context
Edge Calculation: Compare AI's probability estimate vs market price
Execution: If edge > 8% and confidence > 60%, execute trade

4.3 Trading Parameters

Parameter	Value
Minimum Edge	8%
Minimum Confidence	60%
Max Position Size	15% of bankroll
Min Time to Resolution	6 hours
Max Time to Resolution	30 days
Scan Interval	30 minutes

5. Training Results

5.1 Training Evolution

The RL agent evolved through 5 phases, each fixing problems discovered in the previous:

Phase	Change	Size	PnL	ROI
1	Shaped rewards (failed)	$5	$3.90	-
2	Sparse PnL only	$5	$10.93	55%
3	10x scale up	$50	$23.10	12%
4	Share-based PnL	$500	$3,392	170%
5	LSTM + Attention	$500	~$50K	2,500%

5.2 Key Insights

Reward shaping is risky - When shaping rewards are gameable, agents optimize the wrong thing
Sparse but honest > dense but noisy - Pure PnL rewards outperform shaped rewards
Low win rate can be profitable - 23% win rate profitable due to asymmetric payoffs
Temporal context helps - LSTM captures momentum patterns that single-state misses
Cross-market attention - Markets are correlated; attention captures inter-market dynamics

6. Risk Management

6.1 Exit Rules (RL Agent)

Rule	Value
Take Profit	15%
Stop Loss	10%
Time Stop	300 seconds
Min Hold Time	3 minutes

6.2 Position Sizing

Fixed percentage of bankroll per trade
Maximum exposure across all positions
Automatic size reduction during drawdowns

7. Technology Stack

Component	Technology
RL Framework	PyTorch
LLM	Claude (Anthropic)
Execution	Polymarket CLOB API
Data Streaming	Binance WebSocket
Database	Supabase (PostgreSQL)
Dashboard	Vercel (Static)
Real-time Updates	Supabase Realtime

8. Conclusion

POCKET demonstrates that combining reinforcement learning with large language model research creates a powerful autonomous trading system. The RL agent exploits short-term information lag in crypto prediction markets, while the Opus agent leverages AI reasoning for event-driven opportunities.

The system is fully autonomous, running 24/7 with real-time public monitoring. All trades are executed with real USDC on Polymarket, with performance transparently displayed on the live dashboard.

Live Dashboard

Monitor real-time performance at nostradopus.tech