Backtesting

Simmer’s sim-venue, dry-run, and paper-trade modes all run live-forward — they test a strategy against today’s prices going forward. Backtesting is the missing historical leg: replay your skill against past prediction-market data to see how it would have performed before you commit real money.

Self-serve window download available in simmer-sdk >= 0.19.0. Backtesting ships as an optional extra — it pulls a few heavier dependencies (duckdb, fastapi, uvicorn) that most SDK users don’t need.

Install

pip install 'simmer-sdk[backtest]'

This adds the simmer command:

simmer backtest --help

Try it offline

The SDK bundles a tiny demo slice, so you can run a complete backtest with no data download and no network:

simmer backtest --demo

── backtest summary ─────────────────────────────────────────
  skill        backtest-demo-favorites@1.0.0
  window       2026-04-28 → 2026-05-05 @ 43200s
  pnl          -29.54   (final equity 970.46 on 1,000)
  hit rate     50.0%   (10 settled)
  max drawdown 5.8%
  activity     10 decisions · 10 trades · 10 markets · 15 ticks
  baselines    buy&hold YES -29.54 · random +269.34
  realism gaps no slippage, no market impact at size, no queue position, ...
  config_hash  4995db6204207cda
─────────────────────────────────────────────────────────────

Backtest your own skill

Point the CLI at a skill bundle and a window — the historical tape is fetched for you and cached, no data hunting required:

export SIMMER_API_KEY=sk_live_...   # the same key you trade with

simmer backtest ./my-skill \
    --entrypoint run.py \
    --t0 2026-03-01 --t1 2026-03-08 \
    --cadence 12h \
    --out report.json

# or give a duration instead of explicit dates:
simmer backtest ./my-skill --entrypoint run.py --window 30d

The first run for a window fetches a small slice (tens of MB) from Simmer’s tape service and caches it under ~/.simmer/tapes/; repeat runs of the same window are instant. The fetch needs your SIMMER_API_KEY (set it in the environment, the same key you use to trade) — there’s no separate signup. The engine runs your unmodified skill once per tick as a subprocess against a frozen, look-ahead-safe replay server — the same wire shapes as production, so anything that calls /api/sdk/* can be backtested. State files the skill writes (daily-spend counters, etc.) are sandboxed in a temp copy.

Flag	Meaning
`bundle`	Path to the skill bundle directory (positional).
`--entrypoint`	Script filename inside the bundle to run each tick.
`--t0` / `--t1`	Window bounds (ISO, e.g. `2026-03-01`). Required (or use `--window`).
`--window`	Window duration to fetch, e.g. `30d` / `12h` — alternative to `--t0`/`--t1`.
`--max-markets`	Cap on markets in a fetched slice (default `300`, max `1000`).
`--min-volume`	Minimum market volume to include (default `1000`).
`--cadence`	Tick spacing: `15m` / `12h` / `30d` / minutes (default `15m`).
`--balance`	Starting balance (default `1000`).
`--tape`	Use a local tape slice instead of fetching (BYO — see Getting a tape).
`--args`	Entrypoint CLI args, space-separated (default `--live --quiet`).
`--out`	Write the full report JSON here.
`--demo`	Run the bundled offline demo (no key, no tape, no network).

Programmatic API

from simmer_sdk.backtest import run_backtest

report = run_backtest(
    "./my-skill",
    entrypoint="run.py",
    # omit `tape=` to fetch + cache the window (uses SIMMER_API_KEY);
    # or pass tape="./slice" to use your own local slice.
    t0="2026-03-01", t1="2026-03-08",
    cadence="12h",
)
print(report["summary"]["pnl"], report["summary"]["hit_rate"])

Reading the report

The report (stdout summary + full JSON via --out) includes:

summary — pnl, hit rate, max drawdown, trades, decisions, settlements, ticks.
baselines — the same entries/notionals under buy-and-hold-YES and a seeded random side rule, so you can tell skill from luck.
decisions / fills / equity_curve — the full per-tick trace.
realism_gaps — what the model does not capture (see below).
reproducibility.config_hash — a deterministic hash of the run inputs. Same (bundle, tape, window, cadence, args) → same config_hash → identical results.

What backtests do and don’t model

Backtests use trade-tape prices, not an order book. They measure decision quality — did the strategy pick the right side at the right time — not execution realism. Every report lists its realism_gaps: no slippage, no market impact at size, no queue position, no latency, no maker rebates. Treat a backtest as a filter for bad ideas, not a promise of live P&L.

A run is only trustworthy if it’s clean — bundle.clean == true means the skill executed successfully on every tick. A run with failed ticks under-reports the strategy (the skill didn’t actually run on those ticks) and the CLI exits non-zero.

Getting a tape

Most users don’t need to — pass --t0/--t1 (or --window) and the slice is fetched and cached automatically (see above).

Data coverage currently ends ~2026-05-05. Pick a window inside that range; a window starting after it returns an error. (The dataset is a snapshot of public on-chain Polymarket history; a freshness updater is planned.)

Bring your own tape (--tape). If you’d rather supply your own data — a different window, your own source, or to work fully offline — point --tape at a local directory containing markets.parquet + quant.parquet. The public, MIT-licensed dataset and the toolkit to regenerate it live at SII-WANGZJ/Polymarket_data; --tape lets power users slice their own and skip the hosted fetch entirely.

Graduation path

backtest (historical)  →  sim (instant fills, no spread)
  →  polymarket + live=False (real prices, spread modeled)
  →  polymarket live (real USDC)

See Trading Venues for the live-forward modes.

SDK Reference

Contributing

Install

Try it offline

Backtest your own skill

Programmatic API

Reading the report

What backtests do and don’t model

Getting a tape

Graduation path

​Install

​Try it offline

​Backtest your own skill

​Programmatic API

​Reading the report

​What backtests do and don’t model

​Getting a tape

​Graduation path

Install

Try it offline

Backtest your own skill

Programmatic API

Reading the report

What backtests do and don’t model

Getting a tape

Graduation path