MLB Desk Construction Spec

Status: IN PROGRESS Started: April 6, 2026 Last Updated: April 9, 2026


1. Data Sources

Game-Day Markets (6/8/10/2/6 ET + 1 min before close)

Source What it provides Database Schedule
Kalshi game lines ML, spread, total, F5, RFI, team total kalshi-mlb-prices.db :00 past hour
Pinnacle MLB odds ML, run line, total (sharp anchor) pinnacle-mlb.db :02 past hour
Pinnacle F5 First 5 innings ML + total (NOTE: scraper unreliable — F5 lines now derived from full-game + SP ratio) pinnacle-mlb.db :02 past hour
SBR Multi-Book ML, spreads, totals from 6 US sportsbooks sbr-mlb.db :00 past hour
Pregame.com Cash vs ticket %, line movement, RLM, steam research-pipeline.db (filtered) :00 past hour

Futures Markets (adaptive schedule)

Source What it provides Database Schedule
Kalshi Futures WS, teams in WS, AL/NL pennant, playoffs, best/worst record (7 series) kalshi-mlb-futures.db Weekly Mon → 3d Jul 27 → daily Sep 1
Kalshi Win Totals Per-team season win over/unders (30 teams) kalshi-mlb-win-totals.db Same as above
Kalshi Divisions AL/NL division winners (6 divisions, 30 teams) kalshi-mlb-divisions.db Same as above
Pinnacle Futures WS, AL/NL pennant (60 selections, sharp anchor) pinnacle-mlb-futures.db Same as above, fires at :05

Stats (9 AM ET daily — 3 staggered groups)

Source What it provides Database
MLB Stats API Schedule, lineups, standings, transactions, pitchers mlb-stats.db
MLB Stats API SP game logs, SP baselines, bullpen usage/status/log, opener flags mlb-pitching.db
MLB Stats API Batter game logs, baselines, platoon splits, career stats, crosswalk mlb-batting.db
xFIP, quality composite, bullpen index, ABS challenges Advanced computed metrics mlb-model.db
NWS API Game-day weather + park factors mlb-weather.db
Derived metrics Team variance, fatigue, inning scoring, park factors, model daily mlb-model.db
Baseball Reference Team batting + pitching stats mlb-batting.db / mlb-pitching.db
Baseball Savant Pitcher stats, catcher framing mlb-pitching.db
UmpScorecards Umpire assignments + tendencies mlb-stats.db
DRatings Team power ratings + predictions research-pipeline.db (filtered)
Dimers Pythagorean win probabilities research-pipeline.db (filtered)
GameSim Monte Carlo simulation predictions research-pipeline.db (filtered)

Edge Detection (fires :10 past hour, after data collection)

Source What it provides Database
Edge Scanner Kalshi vs Pinnacle mispricings — ML, spread, total, F5, team total mlb-edges.db
F5 Probability Curves SP-scaled spread/total/ML curves for First 5 Innings mlb-edges.db
Implied Curves Alt-line probability curves from Kalshi ladders mlb-edges.db
Edge Summary Per-day edge counts and averages mlb-edges.db

Player Props (separate desk — MLB Player Props)

Source What it provides Database
Kalshi MLB props HR, hits, K, TB, HRR, season stats kalshi-mlb-props.db
FanDuel/DK prop lines Over/under lines via Odds API mlb-prop-lines.db
Prop edge scanner Book + model edges for props mlb-prop-edges.db
Awards (future addition) MVP, Cy Young, ROY, etc. — individual player futures TBD

2. Databases (16 files)

Game-Day Odds

Futures

Stats

Props (Player Props desk)

Edge Detection


3. Schedules

Game-Day Timeline (6 AM, 8 AM, 10 AM, 2 PM, 6 PM ET)

Time What fires
:00 Kalshi MLB game prices, SBR multi-book, Pregame sharp money, MLB props
:02 Pinnacle MLB full game + F5
:10 Edge scanner (reads fresh data from all sources)
Game start - 1 min Closing snapshot (Kalshi + Pinnacle)

Futures Adaptive Schedule

Phase Dates Frequency Time
Phase 1 Now → Jul 26, 2026 Every Monday 6:00 AM (Kalshi), 6:03 (divisions), 6:05 (Pinnacle)
Phase 2 Jul 27 → Aug 31 Every 3 days Same times
Phase 3 Sep 1 → Sep 27 Daily Same times
Expired After Sep 27 Stops scanning

Stats (9 AM ET daily — 3 staggered groups)

Group Time What runs
Group 1 — Raw Data Pulls 9:00 AM External APIs + web scrapes: MLB Stats API, batter/SP game logs, bullpen, weather, Baseball Reference, Savant, umpires, ratings (Sagarin, DRatings, MoneyPuck, Dimers), NBA/NHL/NCAAB/Soccer/Golf/Motorsports/UFC scrapers, MLB props
Group 2 — Baselines & Splits 9:05 AM Light computation on Group 1 data: batter baselines, player crosswalk, opener flags, model daily stats, SP baselines, platoon splits
Group 3 — Derived Metrics 9:10 AM Heavy math needing Groups 1+2: NHL/NBA/MLB variance + fatigue + period scoring, MLB park factors + inning scoring + derived metrics, soccer metrics, player prop analytics, usage cascade, matchup adjustments, MLB props edge scanner + steam

Each group runs sequentially within itself. Groups are staggered 5 minutes apart so raw data finishes before computation starts. If one group fails, the others still run.


4. Edge Detection

Game-Day Edges

The edge scanner compares Kalshi alt-line prices against Pinnacle sharp odds:

  1. Pinnacle provides the fair value (moneyline, spread, total)
  2. A probability curve is built from Pinnacle's line using Poisson (totals) or Normal (spreads) distribution
  3. Each Kalshi alt-line rung is compared: if Kalshi's price diverges from the model probability by more than the fee (7%), it's flagged as an edge
  4. Edges are tagged with scan_type, confidence, steam detection, and closing line value (CLV)

Team Totals

The edge scanner implies team totals from Pinnacle's game total + spread:

Implied Probability Curves

Built from Kalshi's alt-line ladders. Each game gets a curve showing the market-implied probability at each threshold. Per-team curves for spreads, game-level for totals. Pinned contracts (bid ≤5 and ask ≥90) are filtered out.

F5 (First 5 Innings) Probability Curves

F5 lines are derived from Pinnacle's full-game line, not scraped directly (Pinnacle's F5 scraper is unreliable). The model engine provides per-game SP quality ratios to scale full-game lines to F5.

F5 Spreads — Normal distribution, sigma 2.7 (tighter than full-game 4.0 because 5 innings = less variance). Per-game ratio: model_f5Spread / model_fullSpread applied to Pinnacle's spread. Fallback: flat 0.55 scale when no SP data.

F5 Totals — Poisson distribution. Per-game ratio: model_f5Total / model_fullTotal applied to Pinnacle's total. Produces F5 totals in the ~4-5 run range (vs 7-9 full game). SP quality dominates — an ace matchup gets a lower ratio than two bad starters.

F5 Moneyline (3-way) — Home win / Away win / Tie. Poisson tie model: P(tie) = Σ P(Home=k) × P(Away=k) using SP-scaled team lambdas. Pinnacle's 2-way ML de-vigged to conditional probabilities, then multiplied by (1 - P(tie)) for unconditional. 3 rows per game in the database. F5 ML edge scanner deferred until historical tie rate data validates the Poisson model (~2 weeks of settlements needed). Tie rate tracked passively through Kalshi F5 TIE contract settlements.

All F5 curves stored in mlb-edges.dbmlb_probability_curves table with market_type = f5_spread / f5_total / f5_moneyline.

Dashboard views: /data-status/view/mlb-f5-spread-curves, /data-status/view/mlb-f5-total-curves, /data-status/view/mlb-f5-ml-curves.

F5 Edge Scanner

F5 spread and total edges are live. The edge scanner detects F5 tickers (contains "F5" in Kalshi ticker), looks up precomputed SP-scaled probabilities from the mlb_probability_curves table, and compares against Kalshi prices. Falls back to on-the-fly curve building with sigma 2.7 if precomputed curve is unavailable. F5 edges tagged with sub_market = 'f5' in sports_edges.

Dashboard views: /data-status/view/edge-scanner-mlb-f5-spreads, /data-status/view/edge-scanner-mlb-f5-totals.

Futures Edge Detection (not yet built)

Compare Kalshi futures prices against Pinnacle sharp odds. WS winner, pennant, division odds from Pinnacle serve as the anchor. Gap between Kalshi and Pinnacle = the edge.


5. Dashboard

MLB Desk Groups

MLB Player Props Desk Groups


6. Freshness Rules

Freshness Tracker

Passive poller runs every 5 minutes. Checks MAX(timestamp) per table across all databases. No scraper cooperation needed.

Thresholds

Source type Yellow (stale) Red (alert)
Game-day odds (6/8/10/2/6) 30 min after window 90 min after window
Daily stats (9 AM) 60 min after expected 180 min after expected
Futures (weekly) 2 days after expected 7 days after expected
Edge scanner 30 min after window 90 min after window

Alerts

Telegram alert after 2 consecutive misses (10 min of staleness). Consolidated message with color-coded status. Dead-man's switch on the poller itself.


7. Data Quality Rules

  1. No live/in-game data — Only pre-game prices. Games that have started are skipped.
  2. No settled prices — Bid ≤5 or bid ≥95 filtered out for game-day data.
  3. Same-day only — Game-day database only stores prices for today's games.
  4. Scan window tagging — Every row tagged with scan_type (6am/8am/10am/2pm/6pm/close).
  5. Doubleheader handling — game_number column (1 or 2), G1/G2 detected from Pinnacle game_id.
  6. Pinned contract filter — Implied curves skip contracts where bid ≤5 and ask ≥90 (dead markets).
  7. Rate limiting — 500ms-2s delay between Kalshi API calls to avoid 429 errors.
  8. Closing snapshots — Fire 1 minute before game start, not at game start.
  9. Expiry dates — Futures databases have expiry_date column (2026-09-27). Scanning stops after expiry.

8. Isolation Checklist Status

# Item Status
1 Databases isolated DONE — 16 .db files. Clean schemas: mlb_probability_curves (27 cols with settlement), mlb_edges (37 cols MLB-only). Cross-DB reads use ATTACH pattern. Pregame/DRatings/Dimers/GameSim still in shared db (non-critical).
2 Scraper queue independent DONE — Split into 3 staggered groups at 9:00/9:05/9:10 AM ET
3 Recovery queue DONE — MLB scrapers in recovery, ToDo desks removed
4 Freshness tracking DONE — 24+ sources in passive poller
5 Dashboard views DONE — All pointing to isolated .db files
6 Edge scanner DONE — Reads Pinnacle from pinnacle-mlb.db, Kalshi from kalshi-mlb-prices.db, writes to mlb-edges.db. Curves built before each scan.
7 Scan windows DONE — All tagged 6am/8am/10am/2pm/6pm/close
8 Data cleanliness DONE — No live, no settled, same-day, pinned filter
9 Column formatting DONE — Human-readable tickers, dates, exec prices
10 Filters DONE — Column filters on all MLB views
11 No cross-desk dependencies DONE — Stats split into 3 independent groups (9:00/9:05/9:10 AM)
12 Doubleheader handling DONE — game_number, G1/G2 detection
13 Alerts DONE — Freshness poller with Telegram alerts

Document Control

This spec will be updated as the desk is built. Final version when all checklist items pass.

Source: ~/edgeclaw/docs/mlb-desk-construction-spec.md