Version: 1.0 Date: 2026-03-30 Status: AUTHORITATIVE — governs the system until explicitly superseded Panel: Opus 4.6 · Sonnet 4.6 · Gemini 3.1 Pro · Grok 4.2 Reasoning · gpt-oss-120b
Three blind AI analysts (Claude Sonnet 4.6, Gemini 3.1 Pro Preview, gpt-oss-120b) independently set sports betting lines from fundamentals only. They see no market prices. Each produces: spread/total/ML lines, probability of which side covers, conviction (1-5), evidence, and upset scenario.
Their output is compared against a mathematical edge scanner (Pinnacle de-vig probability curves) and Kalshi market prices. A separate Opus 4.6 instance makes the final verdict with position sizing.
Model Roster (locked):
No substitutions without a full panel ruling.
Primary per-prediction accuracy metric.
B = (1/n) * SUM( (p_i - o_i)^2 )
Where p_i is the analyst's stated probability and o_i is the actual outcome (1 = correct, 0 = incorrect).
Stratified Brier scores computed for every combination of:
For each analyst, bin all predictions into 10 equal-width buckets:
Bucket k: predictions where p_i in [(k-1)/10, k/10), k = 1..10
For each bucket:
conf_k = mean(p_i) for all i in bucket k
acc_k = mean(o_i) for all i in bucket k
n_k = count of predictions in bucket k
gap_k = conf_k - acc_k (positive = overconfident)
Minimum n_k = 15 per bucket before used in ECE.
ECE = SUM_k [ (n_k / n) * |gap_k| ]
Sum over valid buckets only (n_k >= 15).
| ECE | Classification |
|---|---|
| < 0.030 | Excellent |
| 0.030-0.050 | Good |
| 0.050-0.075 | Acceptable — monitor |
| 0.075-0.100 | Degraded — soft correction trigger |
| > 0.100 | Critical — hard correction trigger |
OLS regression through calibration points:
acc_k = alpha + beta * conf_k
Ideal: beta = 1.0, alpha = 0.0
| beta | Interpretation |
|---|---|
| < 0.70 | Severely compressed |
| 0.70-0.90 | Moderate compression |
| 0.90-1.10 | Well-calibrated |
| > 1.10 | Over-spread |
For each conviction level c in {1, 2, 3, 4, 5}, compute:
B_c = Brier score of all predictions with conviction c
Expected pattern: B_c should strictly decrease as c increases.
| Conviction | Expected Max B | Hard Flag |
|---|---|---|
| 1 | 0.240 | > 0.250 |
| 2 | 0.230 | > 0.245 |
| 3 | 0.215 | > 0.235 |
| 4 | 0.195 | > 0.215 |
| 5 | 0.170 | > 0.200 |
If B_c > B_{c-1} (worse at higher conviction): trigger CONVICTION_INVERSION flag.
Conviction frequency distribution: Flag if any analyst issues > 40% of predictions at c=4-5 (inflation) or > 60% at c=1-2 (hedging).
Per-game checks (not aggregate):
a. Spread-ML Consistency
implied_ML_from_spread = Phi(spread / sigma_sport)
Where sigma = 12 (NFL), 7 (NBA/NCAAB), 1.4 (NHL goals). Flag if: |analyst_ML_prob - implied_ML| > 0.08
b. Probability Sum Check
p_home + p_away = 1.0 +/- 0.005
Hard reject if outside tolerance. Exclude from calibration.
c. Extreme Probability Gate Any stated probability outside [0.05, 0.95] is flagged EXTREME_CLAIM and logged separately.
| Scope | Minimum n |
|---|---|
| Analyst overall Brier | 50 |
| Per-sport Brier | 30 |
| Per-market-type Brier | 30 |
| Per-sport per-market | 25 |
| Conviction level bucket | 20 |
| Decile bucket (ECE) | 15 |
| Bias-specific computation | 20 |
| Cross-analyst correlation | 40 pairs |
Below-threshold scores labeled [PROVISIONAL]. Never trigger automated corrections.
A digest is generated when ANY of:
compression_score = beta_slope - 1.0 (from Section 1.4)
If compression_score < -0.20:
"Your probability estimates for [SPORT] [MARKET] are compressed toward 50%. When fundamentals indicate a strong lean (p > 0.65 or p < 0.35), increase stated probability by approximately {magnitude}pp."
home_bias_delta = mean(gap_k | home favored) - mean(gap_k | away favored)
Trigger if |home_bias_delta| > 0.04 and n >= 20 per group.
"Your stated confidence when favoring the home team is {delta}pp higher than accuracy warrants. Reduce probability by ~{correction} when edge relies primarily on home advantage."
narrative_pull = mean(p | team streak >= 3) - mean(p | no streak)
Trigger if |narrative_pull| > 0.03 and n_streak >= 15.
"Your probabilities show a {magnitude}pp shift following 3+ game streaks. This is not supported by accuracy. Treat streak context as already priced into fundamentals."
Segment by rest-day differential (home advantage >= 1 day vs away advantage):
fatigue_effect = gap_A - gap_B
Trigger if |fatigue_effect| > 0.04 and n >= 15 per group.
"When home team has superior rest, your predictions show a {magnitude}pp calibration gap. Normalize fatigue adjustment by ~40%."
Tag predictions with injury_flag in {none, minor, major, multiple}.
For each flag level: B_f = Brier(predictions where flag = f)
Trigger if B_major > B_none + 0.025 or |gap_major - gap_none| > 0.06.
"Your accuracy drops by {delta} Brier points on games with major injuries. Apply 0.03 reduction to probability magnitude (toward 0.50) in injury-affected games."
Identify top 2 (analyst, sport, market) triples by lowest Brier (min n=30). Listed FIRST in every digest before corrections.
"STRENGTH: Your [SPORT] [MARKET] Brier of {B} is {delta} below Pinnacle benchmark. This is your most reliable market. Maintain current approach."
All bias corrections require:
If p <= 0.10 but |gap| < 0.03: log MARGINAL_SIGNAL, no correction. If |gap| >= 0.03 but p > 0.10: log TRENDING, check next digest.
| Bias Type | Max correction magnitude |
|---|---|
| Compression (beta) | +/-0.15 probability shift |
| Home bias | +/-0.06 |
| Narrative | +/-0.05 |
| Fatigue | +/-0.04 |
| Injury | +/-0.04 |
| Combined total | +/-0.15 per prediction |
No more than 5 simultaneous active corrections per analyst.
CALIBRATION DIGEST v{version}
Analyst: {ANALYST_ID}
Period: {start_date} - {end_date}
Predictions included: {n}
== STRENGTH ZONES ==
[top performing areas]
== CALIBRATION SUMMARY ==
Sport | Market | n | Brier | ECE | beta | Status
== ACTIVE CORRECTIONS ==
Correction ID: {CORR-XXX}
Type: {BIAS_TYPE}
Market: {SPORT}/{MARKET}
Magnitude: {value}
Direction: {direction}
Confidence: p={value}
Instruction: [template text]
== FLAGGED CONCERNS ==
[inversions, extreme claims, etc.]
== DIGEST METADATA ==
Previous: {prev_id}
Added: {list}
Removed: {list}
| Phase | Schedule | Rationale |
|---|---|---|
| Baseline (weeks 1-4) | No digests | Build clean sample |
| Early (weeks 5-12) | Every 4 weeks | Thin sample still |
| Steady state (weeks 13+) | Every 3 weeks | Standard |
| Peak (playoffs) | Every 2 weeks | Higher volume |
| Trigger | Action | Latency |
|---|---|---|
| ECE > 0.075 | Soft: add to next digest | Next scheduled |
| ECE > 0.100 | Hard: emergency digest | 24 hours |
| CONVICTION_INVERSION n>=20 | Hard | 24 hours |
| Cross-analyst rho > 0.65 | Correlation digest | 48 hours |
| Provider model version update | Freeze corrections 21 days | Immediate |
| n crosses minimum threshold | First-valid digest | Next weekly scan |
Weeks 1-4: ZERO corrections. Clean baseline.
Format: {ANALYST_ID}-D{version}-{YYYYMMDD}
Example: SONNET-D001-20260501
Each digest supersedes all prior of same scope. Sport-specific corrections override global for that sport.
rho(A, B) = Pearson correlation of {p_i^A} and {p_i^B}
Rolling 60-game window. All 3 pairs: (SONNET, GEMINI), (SONNET, GPT), (GEMINI, GPT).
System average: rho_avg = mean of 3 pairs.
| rho | Classification |
|---|---|
| < 0.40 | Healthy independence |
| 0.40-0.55 | Monitor |
| 0.55-0.65 | Concerning — soft alert |
| > 0.65 | Critical — trigger action |
D_eff = n_analysts / (1 + (n-1) * rho_avg)
For n=3: D_eff = 3 / (1 + 2*rho_avg). Must be >= 2.0. Alert at 1.5.
After each digest cycle:
delta_rho = rho_post(30 games) - rho_pre(30 games)
If delta_rho > +0.08 for any pair within 30 days: flag DIGEST_CONVERGENCE_SUSPECTED.
Rule: No two analysts may receive structurally identical corrections in same digest cycle.
{
rho_SG, rho_SP, rho_GP, rho_avg, D_eff,
window_n, status (HEALTHY|MONITOR|ALERT|CRITICAL),
digest_convergence_flag
}
| Benchmark | Brier |
|---|---|
| Pinnacle de-vig (gold standard) | ~0.218-0.224 |
| Kalshi market-implied | ~0.220-0.228 |
| Baseline (always 0.50) | 0.250 |
| Target: beats Pinnacle | < 0.215 |
| Acceptable analyst | 0.220-0.235 |
| Weak analyst | 0.235-0.245 |
| Worse than baseline | > 0.245 |
Pinnacle Brier recomputed each season using closing line de-vigged probabilities.
Metric | SONNET | GEMINI | GPT-OSS | Scanner | Pinnacle | Kalshi
Overall Brier | | | | | |
NBA SPREAD B | | | | | |
NBA TOTAL B | | | | | |
NBA ML B | | | | | |
ECE overall | | | | N/A | |
Simple ensemble: arithmetic mean of 3 analyst probabilities.
Conviction-weighted:
p_ensemble = SUM(p_i * c_i^2) / SUM(c_i^2)
Log ENSEMBLE_UNDERPERFORM if weighted ensemble does not beat best individual over any 50-game window.
Bet rule: flat 1 unit when p_analyst > p_pinnacle + threshold
Thresholds tested: 0.03, 0.04, 0.05, 0.06, 0.07
ROI = (total_profit / total_wagered) * 100%
CI = ROI +/- 1.96 * sqrt(variance / n)
Analyst "showing edge" if lower bound of 95% CI > 0% at any threshold. Min 100 hypothetical bets.
SS = 1 - (B_analyst / B_baseline)
SS = 1 - (B_analyst / 0.250)
Target: SS > 0.08 (B < 0.230).
B_recent = Brier over most recent 40 games
B_overall = Brier over all games (or prior 120)
B_delta = B_recent - B_overall
| B_delta | Action |
|---|---|
| > 0.010 | SOFT_DEGRADATION |
| > 0.020 | HARD_DEGRADATION — manual review |
| > 0.030 | CRITICAL — pause corrections, re-baseline |
S_t = max(0, S_{t-1} + (b_t - B_target - k))
k = 0.005 (slack), h = 5.0 (alert threshold)
Detects sustained ~0.01 Brier shift within ~30 games.
After each correction:
If gap_k reversed sign AND |gap_post| > |gap_pre| * 0.5:
flag OVERCORRECTION
If OVERCORRECTION flagged twice in 3 consecutive digests: permanently reduce correction cap by 50% for that bias type.
Monitor total_correction_load = SUM(|correction_i|). If > 0.20: prune smallest corrections.
Weekly fingerprint: run 5 synthetic standardized games, track mean probability. If fingerprint shifts > 0.04 in one week without digest update: flag POSSIBLE_MODEL_VERSION_CHANGE.
On provider version update:
| Condition | Level | Action |
|---|---|---|
| ECE > 0.075 | WARN | Next digest |
| ECE > 0.100 | CRITICAL | Emergency digest |
| B_delta > 0.020 | WARN | Manual review |
| B_delta > 0.030 | CRITICAL | Pause, re-baseline |
| CONVICTION_INVERSION n>=20 | WARN | Include in digest |
| rho > 0.65 | CRITICAL | Correlation digest |
| D_eff < 1.5 | CRITICAL | Human review |
| OVERCORRECTION x2 | SYSTEMIC | Reduce cap |
| Fingerprint > 0.04 | WARN | Version check |
| Correction load > 0.20 | WARN | Prune corrections |
Replace candidate if ALL hold over 90+ days:
Meeting all 4 triggers replacement process. Single criterion triggers review.
Phase 1 — Shadow (30 games): New model predicts in parallel, not used in ensemble. Phase 2 — Partial Weight (30 games): New model at 0.5x weight alongside incumbent. Phase 3 — Promotion: Promote if B_new < B_old - 0.005 AND ECE_new < ECE_old AND p < 0.20.
Fresh 14-day baseline after promotion.
Triggers (need all 3):
Always add in pairs (4th + 5th together) to maintain odd-number ensemble.
| Day | Action |
|---|---|
| 0 | Log VERSION_EVENT, freeze corrections |
| 0-21 | Re-baseline, collect clean data |
| 21 | Assess: Brier differs by > 0.010? |
| 21+ | If yes: fresh digest. If no: resume existing. |
analyst_picks — Every prediction from every analyst. PERMANENT. Never deleted. outcomes — Actual game results. PERMANENT. calibration_snapshots — Computed metrics per period. PERMANENT. digests — Full text of every calibration digest. PERMANENT. corrections — Individual bias corrections. PERMANENT (status updated, never deleted). correlation_log — Weekly independence metrics. PERMANENT. model_version_events — Provider upgrades. PERMANENT. audit_log — Append-only event log. PERMANENT. fingerprint_snapshots — Model drift detection. 2-year rolling retention.
Corrections are metadata, not mutations. The raw stated probability from the analyst is ALWAYS preserved. Corrections are instructions in the system prompt, not transformations of output.
Every prediction row records digest_active_id (which corrections were in effect). Combined with the corrections table, you can reconstruct exactly what the system believed at any point in time.
Counterfactual analysis is supported: re-run ensemble calculations with any subset of corrections applied or omitted, because raw probabilities are always stored.
All events written to audit_log (append-only, never updated/deleted):
| Parameter | Value |
|---|---|
| Min n overall Brier | 50 |
| Min n per-sport | 30 |
| Min n per-market | 30 |
| Min n conviction bucket | 20 |
| Min n decile bucket | 15 |
| Min n bias computation | 20 |
| Min n correlation | 40 |
| ECE acceptable | 0.050 |
| ECE soft trigger | 0.075 |
| ECE hard trigger | 0.100 |
| Beta well-calibrated | 0.90-1.10 |
| Correction significance alpha | 0.10 |
| Min effect size | 0.03 |
| Max simultaneous corrections | 5 |
| Max total correction load | 0.15 |
| Correlation monitor threshold | 0.40 |
| Correlation critical | 0.65 |
| Min D_eff | 2.0 (alert 1.5) |
| Convergence delta_rho threshold | +0.08 |
| CUSUM h | 5.0 |
| CUSUM k | 0.005 |
| Hard degradation B_delta | 0.020 |
| Critical degradation B_delta | 0.030 |
| Re-baseline after version update | 21 days |
| A/B shadow phase n | 30 |
| A/B partial phase n | 30 |
| A/B promotion alpha | 0.20 |
| Max analyst ensemble weight | 0.40 |
| Baseline no-correction period | 4 weeks |
| Fingerprint drift alert | 0.04 |
| Conviction inflation flag | > 40% at c=4-5 |
| Conviction hedging flag | > 60% at c=1-2 |
| Home bias threshold | 0.04 |
| Narrative bias threshold | 0.03 |
| Fatigue effect threshold | 0.04 |
| Injury accuracy drop | 0.025 Brier |
| Emergency staging window | 12 hours |
| Standard staging window | 48 hours |
| Rollback availability | 30 days |
| Fingerprint retention | 2 years |
Amendments require:
When code and document conflict, the document wins until formally amended.
End of CAPS-001 v1.0