This is the complete data collection specification for the NCAAB desk in the research pipeline. It covers college basketball games traded on two prediction markets (Kalshi and Polymarket). An AI builder should be able to read this and know exactly what data to collect, from where, how often, and why.
We trade on Kalshi and Polymarket — prediction markets where you buy/sell contracts priced 0-100 cents. We find mispriced lines by comparing prediction market prices to Pinnacle (the sharpest traditional sportsbook — their odds represent "true" fair value). When Kalshi/Polymarket prices are wrong relative to Pinnacle's fair value, we buy the cheap side.
Alt lines are the biggest edge. Main spreads (e.g., Team -5.5) are watched by many traders and stay fairly priced. But alternate lines (e.g., Team -12.5, Team -1.5, Over 155.5) have fewer traders, less liquidity, and update slower — so they're more often mispriced. We pull EVERY line offered.
Pinnacle is NOT a trading venue. We never place bets on Pinnacle. We only use their odds to calculate fair values (by removing the vig). All actual trading happens on Kalshi and Polymarket.
NCAAB is 2-way only (no draw). Standard multiplicative de-vig applies. NCAAB has less data available and bigger variance than NBA — fewer games per team, more unpredictable outcomes, which means both bigger edges AND bigger risks.
Pull EVERY line offered for every game. Not just the main spread — every alternate spread, every alternate total, every player prop.
What to pull per market/line:
Two snapshots: Early (morning ~11 AM ET) and Closing (~10 min before game).
What to pull:
How to get Pinnacle data:
After each game completes, collect:
This feeds Brier score calculations that grade every AI in the pipeline over time.
After each Kalshi snapshot, all alt lines for a game are grouped by market type (spread, total) and converted into a probability curve. If Kalshi offers spread alt lines at -3.5, -5.5, -7.5, -9.5, each price is an implied probability at that threshold.
What gets stored per curve: Sport, game key, home/away teams, market type (spread or total), snapshot type (early/closing), array of threshold values, array of implied probabilities, number of points, mean probability, curve slope.
DB Table: market_implied_curves (in research-pipeline.db)
Frequency: Runs automatically after every Kalshi snapshot (every 30 minutes).
Minimum: 3+ alt lines required to form a curve.
Why it creates edge: Compare entire distributions instead of individual lines. Especially valuable for NCAAB where tournament lines have lots of alt spreads.
Every minute during a live game, take a complete snapshot of all prediction market prices:
Game State: Timestamp, current half, game clock, current score.
Kalshi + Polymarket: Every ML contract, every spread (main + all alts), every O/U (main + all alts), every player prop — price + volume + contracts available for each.
We do NOT track Pinnacle live. During games, we compare prediction market prices to our own model's fair value (calculated from game state + pre-game Pinnacle baseline).
FanDuel is the sharp book for player props (separate from Pinnacle which anchors game lines). Note: NCAAB prop coverage is thinner than NBA — fewer players have lines, fewer alt lines offered.
What to pull per prop: Player name, team, game, prop type (points, rebounds, assists), over/under line, over odds, under odds, all alt lines offered.
Frequency: Two snapshots — Early (~11 AM ET) and Closing (~30 min before game).
DraftKings Salaries: Player name, position, team, game, DK salary. Once daily when slates published. Ownership Projections: From 2-3 sources (RotoGrinders, FantasyLabs, etc.) via research pipeline.
| Category | Stats | Source |
|---|---|---|
| Efficiency | Adjusted offensive efficiency (AdjOE), adjusted defensive efficiency (AdjDE), adjusted tempo | Barttovik/T-Rank (scrape) |
| Ratings | BPI, Barthag (power rating), WAB (wins above bubble) | ESPN API + Barttovik |
| Four Factors | eFG% (shooting efficiency), ORB% (offensive rebounds), TOV% (turnovers), FTR (free throw rate), 3PT rate + defensive versions of all | Barttovik |
| Situational | Overall/conference/home-away win%, 1st half vs 2nd half win% | ESPN + calculated |
Multiple independent models predict each game. We average them and track disagreement range (wide disagreement = more uncertainty = more potential for mispriced alt lines). NCAAB model disagreement tends to be wider than NBA because of less data per team, which means more edge opportunity on alt lines.
| Model | Source |
|---|---|
| T-Rank | barttorvik.com (scrape) |
| ESPN BPI | ESPN API (free) |
| DRatings | dratings.com (scrape) |
| GameSim | gamesim.com (scrape) |
| Dimers | dimers.com (scrape) |
How teams perform in the final 5 minutes of close games (within 5 points). College basketball has even more variance here than NBA due to less experienced players, coaching timeouts carrying more weight, and bonus free throw situations.
| Stat | What It Measures |
|---|---|
| Clutch scoring differential | Points scored minus points allowed in final 5 min of close games |
| Clutch turnover rate | Turnovers per possession in clutch situations |
| Clutch foul rate | Fouls committed in clutch situations |
| Clutch FT% | Free throw shooting in clutch situations (especially important — college FT% is lower and more variable) |
| Clutch win rate | Win % when game is within 5 points with 5 min left |
Source: ESPN API play-by-play data (free). Filter for score margin <= 5 with <= 5:00 remaining.
Frequency: Once daily, calculated from rolling season game logs.
DB Table: ncaab_clutch_stats
Why it creates edge: Teams that choke late cover tight spreads but fail on fat alt lines. Kalshi prices alt lines off overall stats, not clutch tendencies. NCAAB free throw shooting in pressure situations is especially unreliable and creates exploitable patterns.
NCAAB teams play ~30-35 games per season (vs 82 in NBA). This means:
During the NCAA tournament, prediction market volume spikes and casual money floods in. This typically INCREASES mispricings because casual traders overreact to narratives (Cinderella stories, brand-name programs). Tournament games should get extra attention.
Track player prop line movement from FanDuel. Sharp money flows into props first because they're lower-limit markets. Team totals and alt spreads follow 30-60 minutes later. (Less applicable to NCAAB due to thinner prop markets, but still monitor star players on major games.)
Monitor for breaking injury/lineup news faster than prediction markets react. Beat reporters on X break scratches 10-15 minutes before Kalshi/Polymarket prices adjust.
Real-time monitoring of X for injury leaks, lineup news, and narrative shifts. Information flows: beat reporters -> sharp sportsbooks (30 sec) -> betting Twitter (5 min) -> prediction markets (10-15 min). We want the 30-second tier.
Track momentum indicators during live games: scoring runs (e.g., a 15-2 run), foul trouble for key players, pace variance vs expected, timeout patterns. If a game is running 15% faster pace than expected through the 1st half, alt overs are mispriced before Kalshi catches up.
Full bid/ask depth from Kalshi WebSocket API and Polymarket blockchain data. Tells us if an edge is actually tradeable and enables market making.
Calculate how volatile each team's scoring is. High-variance teams with positive skew are more likely to cover fat alt spreads. Calculated from existing game log data. Especially important in NCAAB where variance is naturally higher.
Go beyond simple rest-day flags. Track conference travel patterns, mid-week vs weekend games, back-to-back-to-back weeks. College teams have less depth, so fatigue compounds faster.
Model the full distribution of each team's scoring margins. Teams with fat tails (frequent blowouts) are systematically underpriced on extreme alt lines. Pure math on existing game logs.
Break down when teams score their points by half. A team that dominates the 2nd half is more likely to cover late alt spreads. Also affects live trading.
| Source | What | Access |
|---|---|---|
| Kalshi API | ML, spread, O/U, all alt lines, player props, volume, order book (WebSocket) | API key (have it) |
| Polymarket | Event markets, prices, volume, on-chain order data | Free API |
| ESPN API | BPI, matchup predictor, standings, game times, live play-by-play | Free, no key |
| Odds API | Pinnacle, DK, FanDuel odds (fallback only) | API key (have it), rate-limited |
| Source | What |
|---|---|
| Pinnacle | Sharp lines — primary fair value source (pre-game only) |
| FanDuel | Player prop lines (thinner for NCAAB than NBA) |
| DraftKings | Opening/current lines |
| Barttovik/T-Rank | AdjOE, AdjDE, Barthag, WAB, tempo, schedule, Four Factors |
| DRatings | Standard + inflation ratings |
| GameSim | Spread, total, win probability |
| Dimers | Win probabilities |
| Data Type | Frequency | Source | Notes |
|---|---|---|---|
| Team statistics | Once daily, ~12 PM ET weekdays / ~10 AM ET weekends | Barttovik, ESPN API | Full refresh of all team stats |
| Model predictions | Once daily, ~12 PM ET WD / ~10 AM ET WE | T-Rank, ESPN BPI, DRatings, GameSim, Dimers | All rating sites |
| Injury/lineup news | Continuous monitoring | X API, team feeds, ESPN | Speed is everything |
| Results | ~4 AM ET next day | ESPN API + Kalshi/Polymarket settlement | Final scores + all derived metrics |
Pinnacle: If scraped free: every 30-60 min. If Odds API fallback: 3 times only (8 AM, 12 PM, ~10 min before game). DraftKings / FanDuel: Same schedule as Pinnacle. Kalshi + Polymarket (FREE): Every 30 minutes from lines open until game time.
| Data Type | Frequency | Source |
|---|---|---|
| Prediction market prices | Every 1 minute | Kalshi + Polymarket only |
| Game state | Every 1 minute | ESPN API |
| Player prop movement | Every 1 minute | Kalshi + Polymarket |