Polymarket Historical Data

Polymarket Parquet Data: Download Order Book History in Parquet Format

Download Polymarket historical order book data in Parquet format. PolyHistorical provides columnar Parquet exports for fast analysis of BTC, ETH, and SOL prediction markets.

Why Parquet for Polymarket Data?

Apache Parquet is a columnar storage format designed for efficient analytical queries. For Polymarket order book data — where you often query specific columns (prices, depths, timestamps) across millions of rows — Parquet offers massive advantages over CSV or JSON.

Parquet vs Other Formats

Feature	Parquet	CSV	JSON
File Size	5-10x smaller (compressed)	Baseline	2-3x larger than CSV
Read Speed	Column pruning — read only what you need	Must scan full rows	Must parse full objects
Type Safety	✓ Schema embedded	✗ Everything is strings	Partial (no decimals)
pandas Loading	pd.read_parquet() — fast	pd.read_csv() — slow for large files	pd.read_json() — slowest
Nested Data	✓ Supports nested order books	✗ Requires flattening	✓ Native
Spark / DuckDB	✓ Native support	Requires parsing	Requires parsing

Data Schema

PolyHistorical Parquet exports include the following columns per snapshot:

Column	Type	Description
time	timestamp	Snapshot capture time (UTC)
price_up	decimal	Last trade price for Up outcome
price_down	decimal	Last trade price for Down outcome
coin_price	decimal	BTC/ETH/SOL spot price
volume	decimal	Cumulative market volume
liquidity	decimal	Total available liquidity
orderbook_up	struct	Full L2 bids/asks for Up token
orderbook_down	struct	Full L2 bids/asks for Down token

Loading Parquet Data in Python

import pandas as pd

# Load Parquet file — only read the columns you need
df = pd.read_parquet(
    "btc-5m-snapshots.parquet",
    columns=["time", "price_up", "price_down", "coin_price"]
)
print(f"Loaded {len(df):,} snapshots")
print(df.head())

# Calculate spread
df["spread"] = df["price_up"] + df["price_down"] - 1
print(f"Mean spread: {df['spread'].mean():.4f}")

Using DuckDB for Fast Queries

import duckdb

# Query Parquet directly without loading into memory
result = duckdb.sql("""
    SELECT
        date_trunc('hour', time) AS hour,
        avg(price_up) AS avg_up,
        avg(price_down) AS avg_down,
        count(*) AS snapshots
    FROM 'btc-5m-snapshots.parquet'
    GROUP BY 1
    ORDER BY 1
""")
print(result.fetchdf())

How to Get Parquet Exports

API + conversion: Fetch snapshots via the PolyHistorical API and save as Parquet using df.to_parquet()
Bulk export: Use the bulk data export endpoint to download large datasets, then convert to Parquet for offline analysis

import requests
import pandas as pd

API_KEY = "your_api_key"
slug = "btc-5m-up-down-2026-04-27-1200"

resp = requests.get(
    f"https://api.polyhistorical.com/v1/markets/{slug}/snapshots",
    headers={"X-API-Key": API_KEY},
    params={"include_orderbook": "true"}
)
df = pd.DataFrame(resp.json()["data"])
df.to_parquet(f"{slug}.parquet", index=False)
print(f"Saved {len(df)} snapshots to {slug}.parquet")

When to Use Parquet

Large-scale backtesting: Analyzing thousands of markets with millions of snapshots
Repeated analysis: Load once, query many times — Parquet reads are 10-100x faster than CSV
Distributed processing: Spark, Dask, and other frameworks read Parquet natively
Storage efficiency: Archive months of order book history in a fraction of the CSV size