Backtesting

Data Cleaning for Prediction Market Backtests

How to clean and prepare Polymarket historical order book data for accurate backtesting and strategy development.

Why Data Cleaning Matters

Raw historical data — even from high-quality sources like PolyHistorical — needs cleaning and preparation before use in backtesting. Dirty data leads to unreliable backtests, which leads to strategies that fail in live trading. Investing time in data quality pays dividends in strategy reliability.

Common Data Issues in Prediction Market Data

IssueCauseImpact on Backtest
Missing SnapshotsNetwork latency, API downtimeGaps in price series, incorrect returns
Duplicate TimestampsRetry logic, clock sync issuesDouble-counting signals, inflated volume
Outlier PricesMomentary thin books, fat-finger ordersFalse signals, unrealistic backtest returns
Stale Order BooksLow activity periodsFalse liquidity signals, unrealistic fills
Timezone IssuesMixed UTC/local timestampsMisaligned signals and execution

Step 1: Handle Missing Data

Check for gaps in your timestamp series. PolyHistorical provides snapshots approximately every 300ms, so gaps longer than a few seconds indicate missing data.

  • Forward fill: Use the last known order book state for short gaps (< 5 seconds)
  • Interpolation: Interpolate midpoint prices for moderate gaps (< 1 minute)
  • Mark and skip: Flag periods with long gaps (> 1 minute) and exclude from backtest
  • Never backfill: Do not use future data to fill past gaps (look-ahead bias)

Step 2: Remove Duplicates

Check for duplicate timestamps in your dataset. If duplicates exist, keep the one with the most complete order book data (highest total depth). Remove exact duplicates entirely.

Step 3: Filter Outliers

Prediction market prices should be between 0 and 1. Additionally, filter for unrealistic price moves:

  • Remove snapshots where midpoint price is outside [0.01, 0.99] unless near resolution
  • Flag price changes exceeding 3 standard deviations from the rolling mean
  • Check that bid-ask spread is positive (best_ask > best_bid)
  • Verify that Up + Down complement prices approximately equal 1.00

Step 4: Validate Order Book Integrity

Each order book snapshot should satisfy basic consistency rules:

  • Bid prices should be in descending order
  • Ask prices should be in ascending order
  • No negative volumes at any price level
  • Best bid must be strictly less than best ask

Step 5: Standardize Timestamps

Convert all timestamps to UTC Unix milliseconds for consistency. This eliminates timezone confusion and makes time-range queries straightforward. PolyHistorical returns timestamps in UTC by default, but verify this in your data pipeline.

Step 6: Build Validation Checks

Automate your data cleaning pipeline with validation checks that run before every backtest. Log warnings for any data anomalies so you can investigate before they corrupt your results. A few hours spent on data cleaning saves days of debugging false backtest results.

PolyHistorical Data Quality

PolyHistorical maintains high data quality standards, but no data source is perfect. The cleaning steps above ensure your backtesting pipeline is robust regardless of upstream data quality. Start with the free tier to develop your cleaning pipeline, then scale to Pro at $11/month for production use.

Related Resources