NCAAB Desk — Data Collection Spec (Mar 26, 2026)

What This Document Is

This is the complete data collection specification for the NCAAB desk in the research pipeline. It covers college basketball games traded on two prediction markets (Kalshi and Polymarket). An AI builder should be able to read this and know exactly what data to collect, from where, how often, and why.

The Business Model (Read This First)

We trade on Kalshi and Polymarket — prediction markets where you buy/sell contracts priced 0-100 cents. We find mispriced lines by comparing prediction market prices to Pinnacle (the sharpest traditional sportsbook — their odds represent "true" fair value). When Kalshi/Polymarket prices are wrong relative to Pinnacle's fair value, we buy the cheap side.

Alt lines are the biggest edge. Main spreads (e.g., Team -5.5) are watched by many traders and stay fairly priced. But alternate lines (e.g., Team -12.5, Team -1.5, Over 155.5) have fewer traders, less liquidity, and update slower — so they're more often mispriced. We pull EVERY line offered.

Pinnacle is NOT a trading venue. We never place bets on Pinnacle. We only use their odds to calculate fair values (by removing the vig). All actual trading happens on Kalshi and Polymarket.

NCAAB is 2-way only (no draw). Standard multiplicative de-vig applies. NCAAB has less data available and bigger variance than NBA — fewer games per team, more unpredictable outcomes, which means both bigger edges AND bigger risks.


SHARED DATA (collected for all sports)

Prediction Market Lines — Kalshi + Polymarket

Pull EVERY line offered for every game. Not just the main spread — every alternate spread, every alternate total, every player prop.

What to pull per market/line:

Two snapshots: Early (morning ~11 AM ET) and Closing (~10 min before game).

Sharp Book Fair Values — Pinnacle

What to pull:

How to get Pinnacle data:

Other Sportsbook Lines — DraftKings, FanDuel

Results & Grading

After each game completes, collect:

This feeds Brier score calculations that grade every AI in the pipeline over time.

Market-Implied Probability Curves (extracted from Kalshi alt lines)

After each Kalshi snapshot, all alt lines for a game are grouped by market type (spread, total) and converted into a probability curve. If Kalshi offers spread alt lines at -3.5, -5.5, -7.5, -9.5, each price is an implied probability at that threshold.

What gets stored per curve: Sport, game key, home/away teams, market type (spread or total), snapshot type (early/closing), array of threshold values, array of implied probabilities, number of points, mean probability, curve slope.

DB Table: market_implied_curves (in research-pipeline.db) Frequency: Runs automatically after every Kalshi snapshot (every 30 minutes). Minimum: 3+ alt lines required to form a curve. Why it creates edge: Compare entire distributions instead of individual lines. Especially valuable for NCAAB where tournament lines have lots of alt spreads.

Live In-Game Data (every 1 minute per game)

Every minute during a live game, take a complete snapshot of all prediction market prices:

Game State: Timestamp, current half, game clock, current score.

Kalshi + Polymarket: Every ML contract, every spread (main + all alts), every O/U (main + all alts), every player prop — price + volume + contracts available for each.

We do NOT track Pinnacle live. During games, we compare prediction market prices to our own model's fair value (calculated from game state + pre-game Pinnacle baseline).

Player Props Layer

FanDuel is the sharp book for player props (separate from Pinnacle which anchors game lines). Note: NCAAB prop coverage is thinner than NBA — fewer players have lines, fewer alt lines offered.

What to pull per prop: Player name, team, game, prop type (points, rebounds, assists), over/under line, over odds, under odds, all alt lines offered.

Frequency: Two snapshots — Early (~11 AM ET) and Closing (~30 min before game).

DFS Layer

DraftKings Salaries: Player name, position, team, game, DK salary. Once daily when slates published. Ownership Projections: From 2-3 sources (RotoGrinders, FantasyLabs, etc.) via research pipeline.


NCAAB-SPECIFIC DATA

Team Statistics

Category Stats Source
Efficiency Adjusted offensive efficiency (AdjOE), adjusted defensive efficiency (AdjDE), adjusted tempo Barttovik/T-Rank (scrape)
Ratings BPI, Barthag (power rating), WAB (wins above bubble) ESPN API + Barttovik
Four Factors eFG% (shooting efficiency), ORB% (offensive rebounds), TOV% (turnovers), FTR (free throw rate), 3PT rate + defensive versions of all Barttovik
Situational Overall/conference/home-away win%, 1st half vs 2nd half win% ESPN + calculated

Ratings & Win Probability Models

Multiple independent models predict each game. We average them and track disagreement range (wide disagreement = more uncertainty = more potential for mispriced alt lines). NCAAB model disagreement tends to be wider than NBA because of less data per team, which means more edge opportunity on alt lines.

Model Source
T-Rank barttorvik.com (scrape)
ESPN BPI ESPN API (free)
DRatings dratings.com (scrape)
GameSim gamesim.com (scrape)
Dimers dimers.com (scrape)

Clutch/Late-Game Execution

How teams perform in the final 5 minutes of close games (within 5 points). College basketball has even more variance here than NBA due to less experienced players, coaching timeouts carrying more weight, and bonus free throw situations.

Stat What It Measures
Clutch scoring differential Points scored minus points allowed in final 5 min of close games
Clutch turnover rate Turnovers per possession in clutch situations
Clutch foul rate Fouls committed in clutch situations
Clutch FT% Free throw shooting in clutch situations (especially important — college FT% is lower and more variable)
Clutch win rate Win % when game is within 5 points with 5 min left

Source: ESPN API play-by-play data (free). Filter for score margin <= 5 with <= 5:00 remaining. Frequency: Once daily, calculated from rolling season game logs. DB Table: ncaab_clutch_stats Why it creates edge: Teams that choke late cover tight spreads but fail on fat alt lines. Kalshi prices alt lines off overall stats, not clutch tendencies. NCAAB free throw shooting in pressure situations is especially unreliable and creates exploitable patterns.


NCAAB-SPECIFIC CONSIDERATIONS

Less Data, Bigger Variance

NCAAB teams play ~30-35 games per season (vs 82 in NBA). This means:

Tournament / March Madness

During the NCAA tournament, prediction market volume spikes and casual money floods in. This typically INCREASES mispricings because casual traders overreact to narratives (Cinderella stories, brand-name programs). Tournament games should get extra attention.


SHARED DATA GAPS (apply to all sports including NCAAB)

Gap #1: Player Prop Data as Leading Indicator

Track player prop line movement from FanDuel. Sharp money flows into props first because they're lower-limit markets. Team totals and alt spreads follow 30-60 minutes later. (Less applicable to NCAAB due to thinner prop markets, but still monitor star players on major games.)

Gap #2: Real-Time Injury & Lineup Speed

Monitor for breaking injury/lineup news faster than prediction markets react. Beat reporters on X break scratches 10-15 minutes before Kalshi/Polymarket prices adjust.

Gap #3: Social Media & News Sentiment

Real-time monitoring of X for injury leaks, lineup news, and narrative shifts. Information flows: beat reporters -> sharp sportsbooks (30 sec) -> betting Twitter (5 min) -> prediction markets (10-15 min). We want the 30-second tier.

Gap #4: In-Game Contextual Flow Data

Track momentum indicators during live games: scoring runs (e.g., a 15-2 run), foul trouble for key players, pace variance vs expected, timeout patterns. If a game is running 15% faster pace than expected through the 1st half, alt overs are mispriced before Kalshi catches up.

Gap #5: Order Book Depth & Liquidity

Full bid/ask depth from Kalshi WebSocket API and Polymarket blockchain data. Tells us if an edge is actually tradeable and enables market making.

Gap #6: Team Variance & Skewness Metrics

Calculate how volatile each team's scoring is. High-variance teams with positive skew are more likely to cover fat alt spreads. Calculated from existing game log data. Especially important in NCAAB where variance is naturally higher.

Gap #7: Advanced Fatigue & Travel Metrics

Go beyond simple rest-day flags. Track conference travel patterns, mid-week vs weekend games, back-to-back-to-back weeks. College teams have less depth, so fatigue compounds faster.

Gap #10: Game Script Volatility / Fat Tails

Model the full distribution of each team's scoring margins. Teams with fat tails (frequent blowouts) are systematically underpriced on extreme alt lines. Pure math on existing game logs.

Gap #11: Period-Specific Scoring Distribution

Break down when teams score their points by half. A team that dominates the 2nd half is more likely to cover late alt spreads. Also affects live trading.


DATA SOURCES SUMMARY

FREE APIs

Source What Access
Kalshi API ML, spread, O/U, all alt lines, player props, volume, order book (WebSocket) API key (have it)
Polymarket Event markets, prices, volume, on-chain order data Free API
ESPN API BPI, matchup predictor, standings, game times, live play-by-play Free, no key
Odds API Pinnacle, DK, FanDuel odds (fallback only) API key (have it), rate-limited

NEED SCRAPING

Source What
Pinnacle Sharp lines — primary fair value source (pre-game only)
FanDuel Player prop lines (thinner for NCAAB than NBA)
DraftKings Opening/current lines
Barttovik/T-Rank AdjOE, AdjDE, Barthag, WAB, tempo, schedule, Four Factors
DRatings Standard + inflation ratings
GameSim Spread, total, win probability
Dimers Win probabilities

CUSTOM CALCULATIONS


COLLECTION SCHEDULE

Data Type Frequency Source Notes
Team statistics Once daily, ~12 PM ET weekdays / ~10 AM ET weekends Barttovik, ESPN API Full refresh of all team stats
Model predictions Once daily, ~12 PM ET WD / ~10 AM ET WE T-Rank, ESPN BPI, DRatings, GameSim, Dimers All rating sites
Injury/lineup news Continuous monitoring X API, team feeds, ESPN Speed is everything
Results ~4 AM ET next day ESPN API + Kalshi/Polymarket settlement Final scores + all derived metrics

Pre-Game Odds Schedule

Pinnacle: If scraped free: every 30-60 min. If Odds API fallback: 3 times only (8 AM, 12 PM, ~10 min before game). DraftKings / FanDuel: Same schedule as Pinnacle. Kalshi + Polymarket (FREE): Every 30 minutes from lines open until game time.

Live In-Game Schedule

Data Type Frequency Source
Prediction market prices Every 1 minute Kalshi + Polymarket only
Game state Every 1 minute ESPN API
Player prop movement Every 1 minute Kalshi + Polymarket

VPS Cron Schedule (all times ET)

Source: ~/.claude/projects/-home-ubuntu-edgeclaw/memory/ncaab-desk-data-inventory.md