Skip to main content
Simmer’s sim-venue, dry-run, and paper-trade modes all run live-forward — they test a strategy against today’s prices going forward. Backtesting is the missing historical leg: replay your skill against past prediction-market data to see how it would have performed before you commit real money.
Self-serve window download available in simmer-sdk >= 0.19.0. Backtesting ships as an optional extra — it pulls a few heavier dependencies (duckdb, fastapi, uvicorn) that most SDK users don’t need.

Install

pip install 'simmer-sdk[backtest]'
This adds the simmer command:
simmer backtest --help

Try it offline

The SDK bundles a tiny demo slice, so you can run a complete backtest with no data download and no network:
simmer backtest --demo
── backtest summary ─────────────────────────────────────────
  skill        backtest-demo-favorites@1.0.0
  window       2026-04-28 → 2026-05-05 @ 43200s
  pnl          -29.54   (final equity 970.46 on 1,000)
  hit rate     50.0%   (10 settled)
  max drawdown 5.8%
  activity     10 decisions · 10 trades · 10 markets · 15 ticks
  baselines    buy&hold YES -29.54 · random +269.34
  realism gaps no slippage, no market impact at size, no queue position, ...
  config_hash  4995db6204207cda
─────────────────────────────────────────────────────────────

Backtest your own skill

Point the CLI at a skill bundle and a window — the historical tape is fetched for you and cached, no data hunting required:
export SIMMER_API_KEY=sk_live_...   # the same key you trade with

simmer backtest ./my-skill \
    --entrypoint run.py \
    --t0 2026-03-01 --t1 2026-03-08 \
    --cadence 12h \
    --out report.json

# or give a duration instead of explicit dates:
simmer backtest ./my-skill --entrypoint run.py --window 30d
The first run for a window fetches a small slice (tens of MB) from Simmer’s tape service and caches it under ~/.simmer/tapes/; repeat runs of the same window are instant. The fetch needs your SIMMER_API_KEY (set it in the environment, the same key you use to trade) — there’s no separate signup. The engine runs your unmodified skill once per tick as a subprocess against a frozen, look-ahead-safe replay server — the same wire shapes as production, so anything that calls /api/sdk/* can be backtested. State files the skill writes (daily-spend counters, etc.) are sandboxed in a temp copy.
FlagMeaning
bundlePath to the skill bundle directory (positional).
--entrypointScript filename inside the bundle to run each tick.
--t0 / --t1Window bounds (ISO, e.g. 2026-03-01). Required (or use --window).
--windowWindow duration to fetch, e.g. 30d / 12h — alternative to --t0/--t1.
--max-marketsCap on markets in a fetched slice (default 300, max 1000).
--min-volumeMinimum market volume to include (default 1000).
--cadenceTick spacing: 15m / 12h / 30d / minutes (default 15m).
--balanceStarting balance (default 1000).
--tapeUse a local tape slice instead of fetching (BYO — see Getting a tape).
--argsEntrypoint CLI args, space-separated (default --live --quiet).
--outWrite the full report JSON here.
--demoRun the bundled offline demo (no key, no tape, no network).

Programmatic API

from simmer_sdk.backtest import run_backtest

report = run_backtest(
    "./my-skill",
    entrypoint="run.py",
    # omit `tape=` to fetch + cache the window (uses SIMMER_API_KEY);
    # or pass tape="./slice" to use your own local slice.
    t0="2026-03-01", t1="2026-03-08",
    cadence="12h",
)
print(report["summary"]["pnl"], report["summary"]["hit_rate"])

Reading the report

The report (stdout summary + full JSON via --out) includes:
  • summary — pnl, hit rate, max drawdown, trades, decisions, settlements, ticks.
  • baselines — the same entries/notionals under buy-and-hold-YES and a seeded random side rule, so you can tell skill from luck.
  • decisions / fills / equity_curve — the full per-tick trace.
  • realism_gaps — what the model does not capture (see below).
  • reproducibility.config_hash — a deterministic hash of the run inputs. Same (bundle, tape, window, cadence, args) → same config_hash → identical results.

What backtests do and don’t model

Backtests use trade-tape prices, not an order book. They measure decision quality — did the strategy pick the right side at the right time — not execution realism. Every report lists its realism_gaps: no slippage, no market impact at size, no queue position, no latency, no maker rebates. Treat a backtest as a filter for bad ideas, not a promise of live P&L.
A run is only trustworthy if it’s cleanbundle.clean == true means the skill executed successfully on every tick. A run with failed ticks under-reports the strategy (the skill didn’t actually run on those ticks) and the CLI exits non-zero.

Getting a tape

Most users don’t need to — pass --t0/--t1 (or --window) and the slice is fetched and cached automatically (see above).
Data coverage currently ends ~2026-05-05. Pick a window inside that range; a window starting after it returns an error. (The dataset is a snapshot of public on-chain Polymarket history; a freshness updater is planned.)
Bring your own tape (--tape). If you’d rather supply your own data — a different window, your own source, or to work fully offline — point --tape at a local directory containing markets.parquet + quant.parquet. The public, MIT-licensed dataset and the toolkit to regenerate it live at SII-WANGZJ/Polymarket_data; --tape lets power users slice their own and skip the hosted fetch entirely.

Graduation path

backtest (historical)  →  sim (instant fills, no spread)
  →  polymarket + live=False (real prices, spread modeled)
  →  polymarket live (real USDC)
See Trading Venues for the live-forward modes.