Calibration Specification — Blind Analyst Prediction System (CAPS-001)

Version: 1.0 Date: 2026-03-30 Status: AUTHORITATIVE — governs the system until explicitly superseded Panel: Opus 4.6 · Sonnet 4.6 · Gemini 3.1 Pro · Grok 4.2 Reasoning · gpt-oss-120b


0. System Overview

Three blind AI analysts (Claude Sonnet 4.6, Gemini 3.1 Pro Preview, gpt-oss-120b) independently set sports betting lines from fundamentals only. They see no market prices. Each produces: spread/total/ML lines, probability of which side covers, conviction (1-5), evidence, and upset scenario.

Their output is compared against a mathematical edge scanner (Pinnacle de-vig probability curves) and Kalshi market prices. A separate Opus 4.6 instance makes the final verdict with position sizing.

Model Roster (locked):

No substitutions without a full panel ruling.


1. Calibration Measurement

1.1 Brier Score

Primary per-prediction accuracy metric.

B = (1/n) * SUM( (p_i - o_i)^2 )

Where p_i is the analyst's stated probability and o_i is the actual outcome (1 = correct, 0 = incorrect).

Stratified Brier scores computed for every combination of:

1.2 Probability Decile Calibration Curve

For each analyst, bin all predictions into 10 equal-width buckets:

Bucket k: predictions where p_i in [(k-1)/10, k/10), k = 1..10

For each bucket:
  conf_k   = mean(p_i) for all i in bucket k
  acc_k    = mean(o_i) for all i in bucket k
  n_k      = count of predictions in bucket k
  gap_k    = conf_k - acc_k   (positive = overconfident)

Minimum n_k = 15 per bucket before used in ECE.

1.3 Expected Calibration Error (ECE)

ECE = SUM_k [ (n_k / n) * |gap_k| ]

Sum over valid buckets only (n_k >= 15).

ECE Classification
< 0.030 Excellent
0.030-0.050 Good
0.050-0.075 Acceptable — monitor
0.075-0.100 Degraded — soft correction trigger
> 0.100 Critical — hard correction trigger

1.4 Reliability Diagram Slope (Sharpness)

OLS regression through calibration points:

acc_k = alpha + beta * conf_k

Ideal: beta = 1.0, alpha = 0.0

beta Interpretation
< 0.70 Severely compressed
0.70-0.90 Moderate compression
0.90-1.10 Well-calibrated
> 1.10 Over-spread

1.5 Conviction Calibration

For each conviction level c in {1, 2, 3, 4, 5}, compute:

B_c = Brier score of all predictions with conviction c

Expected pattern: B_c should strictly decrease as c increases.

Conviction Expected Max B Hard Flag
1 0.240 > 0.250
2 0.230 > 0.245
3 0.215 > 0.235
4 0.195 > 0.215
5 0.170 > 0.200

If B_c > B_{c-1} (worse at higher conviction): trigger CONVICTION_INVERSION flag.

Conviction frequency distribution: Flag if any analyst issues > 40% of predictions at c=4-5 (inflation) or > 60% at c=1-2 (hedging).

1.6 Internal Consistency Checks

Per-game checks (not aggregate):

a. Spread-ML Consistency

implied_ML_from_spread = Phi(spread / sigma_sport)

Where sigma = 12 (NFL), 7 (NBA/NCAAB), 1.4 (NHL goals). Flag if: |analyst_ML_prob - implied_ML| > 0.08

b. Probability Sum Check

p_home + p_away = 1.0 +/- 0.005

Hard reject if outside tolerance. Exclude from calibration.

c. Extreme Probability Gate Any stated probability outside [0.05, 0.95] is flagged EXTREME_CLAIM and logged separately.

1.7 Minimum Sample Size Requirements

Scope Minimum n
Analyst overall Brier 50
Per-sport Brier 30
Per-market-type Brier 30
Per-sport per-market 25
Conviction level bucket 20
Decile bucket (ECE) 15
Bias-specific computation 20
Cross-analyst correlation 40 pairs

Below-threshold scores labeled [PROVISIONAL]. Never trigger automated corrections.


2. Calibration Digest Generation

2.1 Digest Trigger Conditions

A digest is generated when ANY of:

  1. Scheduled cadence has elapsed (Section 3)
  2. Any ECE exceeds 0.075 (soft trigger)
  3. Any ECE exceeds 0.100 (hard trigger — immediate, not waiting for schedule)
  4. CONVICTION_INVERSION flag with n >= 20
  5. Cross-analyst correlation rho > 0.65 (Section 4)
  6. Sample size crosses minimum threshold for first time

2.2 Bias Computation Algorithms

2.2.1 Compression Bias

compression_score = beta_slope - 1.0  (from Section 1.4)

If compression_score < -0.20:

"Your probability estimates for [SPORT] [MARKET] are compressed toward 50%. When fundamentals indicate a strong lean (p > 0.65 or p < 0.35), increase stated probability by approximately {magnitude}pp."

2.2.2 Home Court/Ice Bias

home_bias_delta = mean(gap_k | home favored) - mean(gap_k | away favored)

Trigger if |home_bias_delta| > 0.04 and n >= 20 per group.

"Your stated confidence when favoring the home team is {delta}pp higher than accuracy warrants. Reduce probability by ~{correction} when edge relies primarily on home advantage."

2.2.3 Narrative Bias (Recency/Streak)

narrative_pull = mean(p | team streak >= 3) - mean(p | no streak)

Trigger if |narrative_pull| > 0.03 and n_streak >= 15.

"Your probabilities show a {magnitude}pp shift following 3+ game streaks. This is not supported by accuracy. Treat streak context as already priced into fundamentals."

2.2.4 Fatigue Bias

Segment by rest-day differential (home advantage >= 1 day vs away advantage):

fatigue_effect = gap_A - gap_B

Trigger if |fatigue_effect| > 0.04 and n >= 15 per group.

"When home team has superior rest, your predictions show a {magnitude}pp calibration gap. Normalize fatigue adjustment by ~40%."

2.2.5 Injury Bias

Tag predictions with injury_flag in {none, minor, major, multiple}.

For each flag level: B_f = Brier(predictions where flag = f)

Trigger if B_major > B_none + 0.025 or |gap_major - gap_none| > 0.06.

"Your accuracy drops by {delta} Brier points on games with major injuries. Apply 0.03 reduction to probability magnitude (toward 0.50) in injury-affected games."

2.3 Strength Identification

Identify top 2 (analyst, sport, market) triples by lowest Brier (min n=30). Listed FIRST in every digest before corrections.

"STRENGTH: Your [SPORT] [MARKET] Brier of {B} is {delta} below Pinnacle benchmark. This is your most reliable market. Maintain current approach."

2.4 Statistical Significance for Corrections

All bias corrections require:

If p <= 0.10 but |gap| < 0.03: log MARGINAL_SIGNAL, no correction. If |gap| >= 0.03 but p > 0.10: log TRENDING, check next digest.

2.5 Correction Caps

Bias Type Max correction magnitude
Compression (beta) +/-0.15 probability shift
Home bias +/-0.06
Narrative +/-0.05
Fatigue +/-0.04
Injury +/-0.04
Combined total +/-0.15 per prediction

No more than 5 simultaneous active corrections per analyst.

2.6 Digest Structure

CALIBRATION DIGEST v{version}
Analyst: {ANALYST_ID}
Period: {start_date} - {end_date}
Predictions included: {n}

== STRENGTH ZONES ==
[top performing areas]

== CALIBRATION SUMMARY ==
Sport | Market | n | Brier | ECE | beta | Status

== ACTIVE CORRECTIONS ==
Correction ID: {CORR-XXX}
Type: {BIAS_TYPE}
Market: {SPORT}/{MARKET}
Magnitude: {value}
Direction: {direction}
Confidence: p={value}
Instruction: [template text]

== FLAGGED CONCERNS ==
[inversions, extreme claims, etc.]

== DIGEST METADATA ==
Previous: {prev_id}
Added: {list}
Removed: {list}

3. Update Cadence and Triggers

3.1 Scheduled Cadence

Phase Schedule Rationale
Baseline (weeks 1-4) No digests Build clean sample
Early (weeks 5-12) Every 4 weeks Thin sample still
Steady state (weeks 13+) Every 3 weeks Standard
Peak (playoffs) Every 2 weeks Higher volume

3.2 Event-Driven Triggers

Trigger Action Latency
ECE > 0.075 Soft: add to next digest Next scheduled
ECE > 0.100 Hard: emergency digest 24 hours
CONVICTION_INVERSION n>=20 Hard 24 hours
Cross-analyst rho > 0.65 Correlation digest 48 hours
Provider model version update Freeze corrections 21 days Immediate
n crosses minimum threshold First-valid digest Next weekly scan

3.3 Baseline Period

Weeks 1-4: ZERO corrections. Clean baseline.

3.4 Human Review Process

  1. System generates raw digest draft
  2. Human reviews and marks APPROVED=TRUE
  3. 48-hour staging window (12 hours for emergencies)
  4. Previous digest available for rollback for 30 days

3.5 Digest Version Control

Format: {ANALYST_ID}-D{version}-{YYYYMMDD}
Example: SONNET-D001-20260501

Each digest supersedes all prior of same scope. Sport-specific corrections override global for that sport.


4. Cross-Analyst Correlation Monitoring

4.1 Pairwise Correlation

rho(A, B) = Pearson correlation of {p_i^A} and {p_i^B}

Rolling 60-game window. All 3 pairs: (SONNET, GEMINI), (SONNET, GPT), (GEMINI, GPT).

System average: rho_avg = mean of 3 pairs.

4.2 Correlation Thresholds

rho Classification
< 0.40 Healthy independence
0.40-0.55 Monitor
0.55-0.65 Concerning — soft alert
> 0.65 Critical — trigger action

4.3 Effective Diversity

D_eff = n_analysts / (1 + (n-1) * rho_avg)

For n=3: D_eff = 3 / (1 + 2*rho_avg). Must be >= 2.0. Alert at 1.5.

4.4 Convergence-from-Digests Detection

After each digest cycle:

delta_rho = rho_post(30 games) - rho_pre(30 games)

If delta_rho > +0.08 for any pair within 30 days: flag DIGEST_CONVERGENCE_SUSPECTED.

Rule: No two analysts may receive structurally identical corrections in same digest cycle.

4.5 Independence Health Report (weekly)

{
  rho_SG, rho_SP, rho_GP, rho_avg, D_eff,
  window_n, status (HEALTHY|MONITOR|ALERT|CRITICAL),
  digest_convergence_flag
}

5. Performance Benchmarks

5.1 Brier Score Targets

Benchmark Brier
Pinnacle de-vig (gold standard) ~0.218-0.224
Kalshi market-implied ~0.220-0.228
Baseline (always 0.50) 0.250
Target: beats Pinnacle < 0.215
Acceptable analyst 0.220-0.235
Weak analyst 0.235-0.245
Worse than baseline > 0.245

Pinnacle Brier recomputed each season using closing line de-vigged probabilities.

5.2 Monthly Comparison Table

Metric          | SONNET | GEMINI | GPT-OSS | Scanner | Pinnacle | Kalshi
Overall Brier   |        |        |         |         |          |
NBA SPREAD B    |        |        |         |         |          |
NBA TOTAL B     |        |        |         |         |          |
NBA ML B        |        |        |         |         |          |
ECE overall     |        |        |         |  N/A    |          |

5.3 Ensemble Performance

Simple ensemble: arithmetic mean of 3 analyst probabilities.

Conviction-weighted:

p_ensemble = SUM(p_i * c_i^2) / SUM(c_i^2)

Log ENSEMBLE_UNDERPERFORM if weighted ensemble does not beat best individual over any 50-game window.

5.4 ROI Tracking (Hypothetical)

Bet rule: flat 1 unit when p_analyst > p_pinnacle + threshold
Thresholds tested: 0.03, 0.04, 0.05, 0.06, 0.07
ROI = (total_profit / total_wagered) * 100%
CI = ROI +/- 1.96 * sqrt(variance / n)

Analyst "showing edge" if lower bound of 95% CI > 0% at any threshold. Min 100 hypothetical bets.

5.5 Skill Score

SS = 1 - (B_analyst / B_baseline)
SS = 1 - (B_analyst / 0.250)

Target: SS > 0.08 (B < 0.230).


6. Degradation Detection

6.1 Rolling Window Brier

B_recent  = Brier over most recent 40 games
B_overall = Brier over all games (or prior 120)
B_delta   = B_recent - B_overall
B_delta Action
> 0.010 SOFT_DEGRADATION
> 0.020 HARD_DEGRADATION — manual review
> 0.030 CRITICAL — pause corrections, re-baseline

6.2 CUSUM Test

S_t = max(0, S_{t-1} + (b_t - B_target - k))
k = 0.005 (slack), h = 5.0 (alert threshold)

Detects sustained ~0.01 Brier shift within ~30 games.

6.3 Overcorrection Detection

After each correction:

If gap_k reversed sign AND |gap_post| > |gap_pre| * 0.5:
  flag OVERCORRECTION

If OVERCORRECTION flagged twice in 3 consecutive digests: permanently reduce correction cap by 50% for that bias type.

Monitor total_correction_load = SUM(|correction_i|). If > 0.20: prune smallest corrections.

6.4 Model Drift Detection

Weekly fingerprint: run 5 synthetic standardized games, track mean probability. If fingerprint shifts > 0.04 in one week without digest update: flag POSSIBLE_MODEL_VERSION_CHANGE.

On provider version update:

  1. Freeze corrections 21 days
  2. Recompute all metrics using post-update data only
  3. Issue fresh digest
  4. Pre-update data archived with PRE_VERSION tag

6.5 Alerting Thresholds Summary

Condition Level Action
ECE > 0.075 WARN Next digest
ECE > 0.100 CRITICAL Emergency digest
B_delta > 0.020 WARN Manual review
B_delta > 0.030 CRITICAL Pause, re-baseline
CONVICTION_INVERSION n>=20 WARN Include in digest
rho > 0.65 CRITICAL Correlation digest
D_eff < 1.5 CRITICAL Human review
OVERCORRECTION x2 SYSTEMIC Reduce cap
Fingerprint > 0.04 WARN Version check
Correction load > 0.20 WARN Prune corrections

7. Long-Term Model Evolution

7.1 Replacement Criteria

Replace candidate if ALL hold over 90+ days:

  1. Overall Brier > 0.240
  2. ECE > 0.065 on 2+ sports
  3. ROI 95% CI lower bound < -3%
  4. No improvement trend in 45 days

Meeting all 4 triggers replacement process. Single criterion triggers review.

7.2 A/B Testing

Phase 1 — Shadow (30 games): New model predicts in parallel, not used in ensemble. Phase 2 — Partial Weight (30 games): New model at 0.5x weight alongside incumbent. Phase 3 — Promotion: Promote if B_new < B_old - 0.005 AND ECE_new < ECE_old AND p < 0.20.

Fresh 14-day baseline after promotion.

7.3 Expansion from 3 to 5

Triggers (need all 3):

  1. D_eff < 1.8 consistently over 60 days
  2. All 3 analysts degrading in same market
  3. Volume > 150 predictions/week

Always add in pairs (4th + 5th together) to maintain odd-number ensemble.

7.4 Provider Version Upgrades

Day Action
0 Log VERSION_EVENT, freeze corrections
0-21 Re-baseline, collect clean data
21 Assess: Brier differs by > 0.010?
21+ If yes: fresh digest. If no: resume existing.

8. Data Retention and Audit Trail

8.1 Core Tables

analyst_picks — Every prediction from every analyst. PERMANENT. Never deleted. outcomes — Actual game results. PERMANENT. calibration_snapshots — Computed metrics per period. PERMANENT. digests — Full text of every calibration digest. PERMANENT. corrections — Individual bias corrections. PERMANENT (status updated, never deleted). correlation_log — Weekly independence metrics. PERMANENT. model_version_events — Provider upgrades. PERMANENT. audit_log — Append-only event log. PERMANENT. fingerprint_snapshots — Model drift detection. 2-year rolling retention.

8.2 Immutability Rule

Corrections are metadata, not mutations. The raw stated probability from the analyst is ALWAYS preserved. Corrections are instructions in the system prompt, not transformations of output.

8.3 Time-Travel Reconstruction

Every prediction row records digest_active_id (which corrections were in effect). Combined with the corrections table, you can reconstruct exactly what the system believed at any point in time.

Counterfactual analysis is supported: re-run ensemble calculations with any subset of corrections applied or omitted, because raw probabilities are always stored.

8.4 Audit Requirements

All events written to audit_log (append-only, never updated/deleted):


9. Operational Parameters — Quick Reference

Parameter Value
Min n overall Brier 50
Min n per-sport 30
Min n per-market 30
Min n conviction bucket 20
Min n decile bucket 15
Min n bias computation 20
Min n correlation 40
ECE acceptable 0.050
ECE soft trigger 0.075
ECE hard trigger 0.100
Beta well-calibrated 0.90-1.10
Correction significance alpha 0.10
Min effect size 0.03
Max simultaneous corrections 5
Max total correction load 0.15
Correlation monitor threshold 0.40
Correlation critical 0.65
Min D_eff 2.0 (alert 1.5)
Convergence delta_rho threshold +0.08
CUSUM h 5.0
CUSUM k 0.005
Hard degradation B_delta 0.020
Critical degradation B_delta 0.030
Re-baseline after version update 21 days
A/B shadow phase n 30
A/B partial phase n 30
A/B promotion alpha 0.20
Max analyst ensemble weight 0.40
Baseline no-correction period 4 weeks
Fingerprint drift alert 0.04
Conviction inflation flag > 40% at c=4-5
Conviction hedging flag > 60% at c=1-2
Home bias threshold 0.04
Narrative bias threshold 0.03
Fatigue effect threshold 0.04
Injury accuracy drop 0.025 Brier
Emergency staging window 12 hours
Standard staging window 48 hours
Rollback availability 30 days
Fingerprint retention 2 years

10. Document Control

Amendments require:

  1. Human operator approval
  2. Audit log entry with entity_type = 'specification'
  3. New version number (semantic: major.minor)
  4. 48-hour review window

When code and document conflict, the document wins until formally amended.

End of CAPS-001 v1.0

Source: ~/edgeclaw/docs/CALIBRATION-SPEC.md