Calibration Specification — Blind Analyst Prediction System (CAPS-001)

Version: 1.0 Date: 2026-03-30 Status: AUTHORITATIVE — governs the system until explicitly superseded Panel: Opus 4.6 · Sonnet 4.6 · Gemini 3.1 Pro · Grok 4.2 Reasoning · gpt-oss-120b

0. System Overview

Three blind AI analysts (Claude Sonnet 4.6, Gemini 3.1 Pro Preview, gpt-oss-120b) independently set sports betting lines from fundamentals only. They see no market prices. Each produces: spread/total/ML lines, probability of which side covers, conviction (1-5), evidence, and upset scenario.

Their output is compared against a mathematical edge scanner (Pinnacle de-vig probability curves) and Kalshi market prices. A separate Opus 4.6 instance makes the final verdict with position sizing.

Model Roster (locked):

Analyst 1: Claude Sonnet 4.6 (claude-sonnet-4-6) — Anthropic
Analyst 2: Gemini 3.1 Pro Preview (google/gemini-3.1-pro-preview) — Google/OpenRouter
Analyst 3: gpt-oss-120b (openai/gpt-oss-120b) — OpenAI/OpenRouter
Verdict: Opus 4.6 — CLI bridge (free)

No substitutions without a full panel ruling.

1. Calibration Measurement

1.1 Brier Score

Primary per-prediction accuracy metric.

B = (1/n) * SUM( (p_i - o_i)^2 )

Where p_i is the analyst's stated probability and o_i is the actual outcome (1 = correct, 0 = incorrect).

Range: [0, 1]. Lower is better.
Baseline (always predict 0.50): B = 0.250
Pinnacle de-vig benchmark: ~0.218-0.224 (recomputed each season)

Stratified Brier scores computed for every combination of:

Analyst x Market Type (spread/total/ML) x Sport
Analyst x Conviction bucket (c=1-2 low, c=3 mid, c=4-5 high)
Analyst x Home/Away (for spread)
Analyst x Game pace tier (for total)

1.2 Probability Decile Calibration Curve

For each analyst, bin all predictions into 10 equal-width buckets:

Bucket k: predictions where p_i in [(k-1)/10, k/10), k = 1..10

For each bucket:
  conf_k   = mean(p_i) for all i in bucket k
  acc_k    = mean(o_i) for all i in bucket k
  n_k      = count of predictions in bucket k
  gap_k    = conf_k - acc_k   (positive = overconfident)

Minimum n_k = 15 per bucket before used in ECE.

1.3 Expected Calibration Error (ECE)

ECE = SUM_k [ (n_k / n) * |gap_k| ]

Sum over valid buckets only (n_k >= 15).

ECE	Classification
< 0.030	Excellent
0.030-0.050	Good
0.050-0.075	Acceptable — monitor
0.075-0.100	Degraded — soft correction trigger
> 0.100	Critical — hard correction trigger

1.4 Reliability Diagram Slope (Sharpness)

OLS regression through calibration points:

acc_k = alpha + beta * conf_k

Ideal: beta = 1.0, alpha = 0.0

beta	Interpretation
< 0.70	Severely compressed
0.70-0.90	Moderate compression
0.90-1.10	Well-calibrated
> 1.10	Over-spread

1.5 Conviction Calibration

For each conviction level c in {1, 2, 3, 4, 5}, compute:

B_c = Brier score of all predictions with conviction c

Expected pattern: B_c should strictly decrease as c increases.

Conviction	Expected Max B	Hard Flag
1	0.240	> 0.250
2	0.230	> 0.245
3	0.215	> 0.235
4	0.195	> 0.215
5	0.170	> 0.200

If B_c > B_{c-1} (worse at higher conviction): trigger CONVICTION_INVERSION flag.

Conviction frequency distribution: Flag if any analyst issues > 40% of predictions at c=4-5 (inflation) or > 60% at c=1-2 (hedging).

1.6 Internal Consistency Checks

Per-game checks (not aggregate):

a. Spread-ML Consistency

implied_ML_from_spread = Phi(spread / sigma_sport)

Where sigma = 12 (NFL), 7 (NBA/NCAAB), 1.4 (NHL goals). Flag if: |analyst_ML_prob - implied_ML| > 0.08

b. Probability Sum Check

p_home + p_away = 1.0 +/- 0.005

Hard reject if outside tolerance. Exclude from calibration.

c. Extreme Probability Gate Any stated probability outside [0.05, 0.95] is flagged EXTREME_CLAIM and logged separately.

1.7 Minimum Sample Size Requirements

Scope	Minimum n
Analyst overall Brier	50
Per-sport Brier	30
Per-market-type Brier	30
Per-sport per-market	25
Conviction level bucket	20
Decile bucket (ECE)	15
Bias-specific computation	20
Cross-analyst correlation	40 pairs

Below-threshold scores labeled [PROVISIONAL]. Never trigger automated corrections.

2. Calibration Digest Generation

2.1 Digest Trigger Conditions

A digest is generated when ANY of:

Scheduled cadence has elapsed (Section 3)
Any ECE exceeds 0.075 (soft trigger)
Any ECE exceeds 0.100 (hard trigger — immediate, not waiting for schedule)
CONVICTION_INVERSION flag with n >= 20
Cross-analyst correlation rho > 0.65 (Section 4)
Sample size crosses minimum threshold for first time

2.2 Bias Computation Algorithms

2.2.1 Compression Bias

compression_score = beta_slope - 1.0  (from Section 1.4)

If compression_score < -0.20:

"Your probability estimates for [SPORT] [MARKET] are compressed toward 50%. When fundamentals indicate a strong lean (p > 0.65 or p < 0.35), increase stated probability by approximately {magnitude}pp."

2.2.2 Home Court/Ice Bias

home_bias_delta = mean(gap_k | home favored) - mean(gap_k | away favored)

Trigger if |home_bias_delta| > 0.04 and n >= 20 per group.

"Your stated confidence when favoring the home team is {delta}pp higher than accuracy warrants. Reduce probability by ~{correction} when edge relies primarily on home advantage."

2.2.3 Narrative Bias (Recency/Streak)

narrative_pull = mean(p | team streak >= 3) - mean(p | no streak)

Trigger if |narrative_pull| > 0.03 and n_streak >= 15.

"Your probabilities show a {magnitude}pp shift following 3+ game streaks. This is not supported by accuracy. Treat streak context as already priced into fundamentals."

2.2.4 Fatigue Bias

Segment by rest-day differential (home advantage >= 1 day vs away advantage):

fatigue_effect = gap_A - gap_B

Trigger if |fatigue_effect| > 0.04 and n >= 15 per group.

"When home team has superior rest, your predictions show a {magnitude}pp calibration gap. Normalize fatigue adjustment by ~40%."

2.2.5 Injury Bias

Tag predictions with injury_flag in {none, minor, major, multiple}.

For each flag level: B_f = Brier(predictions where flag = f)

Trigger if B_major > B_none + 0.025 or |gap_major - gap_none| > 0.06.

"Your accuracy drops by {delta} Brier points on games with major injuries. Apply 0.03 reduction to probability magnitude (toward 0.50) in injury-affected games."

2.3 Strength Identification

Identify top 2 (analyst, sport, market) triples by lowest Brier (min n=30). Listed FIRST in every digest before corrections.

"STRENGTH: Your [SPORT] [MARKET] Brier of {B} is {delta} below Pinnacle benchmark. This is your most reliable market. Maintain current approach."

2.4 Statistical Significance for Corrections

All bias corrections require:

n >= minimum for category (Section 1.7)
Two-sided t-test or binomial test p <= 0.10
Effect size: |gap| >= 0.03

If p <= 0.10 but |gap| < 0.03: log MARGINAL_SIGNAL, no correction. If |gap| >= 0.03 but p > 0.10: log TRENDING, check next digest.

2.5 Correction Caps

Bias Type	Max correction magnitude
Compression (beta)	+/-0.15 probability shift
Home bias	+/-0.06
Narrative	+/-0.05
Fatigue	+/-0.04
Injury	+/-0.04
Combined total	+/-0.15 per prediction

No more than 5 simultaneous active corrections per analyst.

2.6 Digest Structure

CALIBRATION DIGEST v{version}
Analyst: {ANALYST_ID}
Period: {start_date} - {end_date}
Predictions included: {n}

== STRENGTH ZONES ==
[top performing areas]

== CALIBRATION SUMMARY ==
Sport | Market | n | Brier | ECE | beta | Status

== ACTIVE CORRECTIONS ==
Correction ID: {CORR-XXX}
Type: {BIAS_TYPE}
Market: {SPORT}/{MARKET}
Magnitude: {value}
Direction: {direction}
Confidence: p={value}
Instruction: [template text]

== FLAGGED CONCERNS ==
[inversions, extreme claims, etc.]

== DIGEST METADATA ==
Previous: {prev_id}
Added: {list}
Removed: {list}

3. Update Cadence and Triggers

3.1 Scheduled Cadence

Phase	Schedule	Rationale
Baseline (weeks 1-4)	No digests	Build clean sample
Early (weeks 5-12)	Every 4 weeks	Thin sample still
Steady state (weeks 13+)	Every 3 weeks	Standard
Peak (playoffs)	Every 2 weeks	Higher volume

3.2 Event-Driven Triggers

Trigger	Action	Latency
ECE > 0.075	Soft: add to next digest	Next scheduled
ECE > 0.100	Hard: emergency digest	24 hours
CONVICTION_INVERSION n>=20	Hard	24 hours
Cross-analyst rho > 0.65	Correlation digest	48 hours
Provider model version update	Freeze corrections 21 days	Immediate
n crosses minimum threshold	First-valid digest	Next weekly scan

3.3 Baseline Period

Weeks 1-4: ZERO corrections. Clean baseline.

All data collected and scored
NO instructions sent to analysts
Baseline data tagged BASELINE=TRUE, excluded from future bias calculations
Baseline period is permanent and irrevocable

3.4 Human Review Process

System generates raw digest draft
Human reviews and marks APPROVED=TRUE
48-hour staging window (12 hours for emergencies)
Previous digest available for rollback for 30 days

3.5 Digest Version Control

Format: {ANALYST_ID}-D{version}-{YYYYMMDD}
Example: SONNET-D001-20260501

Each digest supersedes all prior of same scope. Sport-specific corrections override global for that sport.

4. Cross-Analyst Correlation Monitoring

4.1 Pairwise Correlation

rho(A, B) = Pearson correlation of {p_i^A} and {p_i^B}

Rolling 60-game window. All 3 pairs: (SONNET, GEMINI), (SONNET, GPT), (GEMINI, GPT).

System average: rho_avg = mean of 3 pairs.

4.2 Correlation Thresholds

rho	Classification
< 0.40	Healthy independence
0.40-0.55	Monitor
0.55-0.65	Concerning — soft alert
> 0.65	Critical — trigger action

4.3 Effective Diversity

D_eff = n_analysts / (1 + (n-1) * rho_avg)

For n=3: D_eff = 3 / (1 + 2*rho_avg). Must be >= 2.0. Alert at 1.5.

4.4 Convergence-from-Digests Detection

After each digest cycle:

delta_rho = rho_post(30 games) - rho_pre(30 games)

If delta_rho > +0.08 for any pair within 30 days: flag DIGEST_CONVERGENCE_SUSPECTED.

Rule: No two analysts may receive structurally identical corrections in same digest cycle.

4.5 Independence Health Report (weekly)

{
  rho_SG, rho_SP, rho_GP, rho_avg, D_eff,
  window_n, status (HEALTHY|MONITOR|ALERT|CRITICAL),
  digest_convergence_flag
}

5. Performance Benchmarks

5.1 Brier Score Targets

Benchmark	Brier
Pinnacle de-vig (gold standard)	~0.218-0.224
Kalshi market-implied	~0.220-0.228
Baseline (always 0.50)	0.250
Target: beats Pinnacle	< 0.215
Acceptable analyst	0.220-0.235
Weak analyst	0.235-0.245
Worse than baseline	> 0.245

Pinnacle Brier recomputed each season using closing line de-vigged probabilities.

5.2 Monthly Comparison Table

Metric          | SONNET | GEMINI | GPT-OSS | Scanner | Pinnacle | Kalshi
Overall Brier   |        |        |         |         |          |
NBA SPREAD B    |        |        |         |         |          |
NBA TOTAL B     |        |        |         |         |          |
NBA ML B        |        |        |         |         |          |
ECE overall     |        |        |         |  N/A    |          |

5.3 Ensemble Performance

Simple ensemble: arithmetic mean of 3 analyst probabilities.

Conviction-weighted:

p_ensemble = SUM(p_i * c_i^2) / SUM(c_i^2)

Log ENSEMBLE_UNDERPERFORM if weighted ensemble does not beat best individual over any 50-game window.

5.4 ROI Tracking (Hypothetical)

Bet rule: flat 1 unit when p_analyst > p_pinnacle + threshold
Thresholds tested: 0.03, 0.04, 0.05, 0.06, 0.07
ROI = (total_profit / total_wagered) * 100%
CI = ROI +/- 1.96 * sqrt(variance / n)

Analyst "showing edge" if lower bound of 95% CI > 0% at any threshold. Min 100 hypothetical bets.

5.5 Skill Score

SS = 1 - (B_analyst / B_baseline)
SS = 1 - (B_analyst / 0.250)

Target: SS > 0.08 (B < 0.230).

6. Degradation Detection

6.1 Rolling Window Brier

B_recent  = Brier over most recent 40 games
B_overall = Brier over all games (or prior 120)
B_delta   = B_recent - B_overall

B_delta	Action
> 0.010	SOFT_DEGRADATION
> 0.020	HARD_DEGRADATION — manual review
> 0.030	CRITICAL — pause corrections, re-baseline

6.2 CUSUM Test

S_t = max(0, S_{t-1} + (b_t - B_target - k))
k = 0.005 (slack), h = 5.0 (alert threshold)

Detects sustained ~0.01 Brier shift within ~30 games.

6.3 Overcorrection Detection

After each correction:

If gap_k reversed sign AND |gap_post| > |gap_pre| * 0.5:
  flag OVERCORRECTION

If OVERCORRECTION flagged twice in 3 consecutive digests: permanently reduce correction cap by 50% for that bias type.

Monitor total_correction_load = SUM(|correction_i|). If > 0.20: prune smallest corrections.

6.4 Model Drift Detection

Weekly fingerprint: run 5 synthetic standardized games, track mean probability. If fingerprint shifts > 0.04 in one week without digest update: flag POSSIBLE_MODEL_VERSION_CHANGE.

On provider version update:

Freeze corrections 21 days
Recompute all metrics using post-update data only
Issue fresh digest
Pre-update data archived with PRE_VERSION tag

6.5 Alerting Thresholds Summary

Condition	Level	Action
ECE > 0.075	WARN	Next digest
ECE > 0.100	CRITICAL	Emergency digest
B_delta > 0.020	WARN	Manual review
B_delta > 0.030	CRITICAL	Pause, re-baseline
CONVICTION_INVERSION n>=20	WARN	Include in digest
rho > 0.65	CRITICAL	Correlation digest
D_eff < 1.5	CRITICAL	Human review
OVERCORRECTION x2	SYSTEMIC	Reduce cap
Fingerprint > 0.04	WARN	Version check
Correction load > 0.20	WARN	Prune corrections

7. Long-Term Model Evolution

7.1 Replacement Criteria

Replace candidate if ALL hold over 90+ days:

Overall Brier > 0.240
ECE > 0.065 on 2+ sports
ROI 95% CI lower bound < -3%
No improvement trend in 45 days

Meeting all 4 triggers replacement process. Single criterion triggers review.

7.2 A/B Testing

Phase 1 — Shadow (30 games): New model predicts in parallel, not used in ensemble. Phase 2 — Partial Weight (30 games): New model at 0.5x weight alongside incumbent. Phase 3 — Promotion: Promote if B_new < B_old - 0.005 AND ECE_new < ECE_old AND p < 0.20.

Fresh 14-day baseline after promotion.

7.3 Expansion from 3 to 5

Triggers (need all 3):

D_eff < 1.8 consistently over 60 days
All 3 analysts degrading in same market
Volume > 150 predictions/week

Always add in pairs (4th + 5th together) to maintain odd-number ensemble.

7.4 Provider Version Upgrades

Day	Action
0	Log VERSION_EVENT, freeze corrections
0-21	Re-baseline, collect clean data
21	Assess: Brier differs by > 0.010?
21+	If yes: fresh digest. If no: resume existing.

8. Data Retention and Audit Trail

8.1 Core Tables

analyst_picks — Every prediction from every analyst. PERMANENT. Never deleted. outcomes — Actual game results. PERMANENT. calibration_snapshots — Computed metrics per period. PERMANENT. digests — Full text of every calibration digest. PERMANENT. corrections — Individual bias corrections. PERMANENT (status updated, never deleted). correlation_log — Weekly independence metrics. PERMANENT. model_version_events — Provider upgrades. PERMANENT. audit_log — Append-only event log. PERMANENT. fingerprint_snapshots — Model drift detection. 2-year rolling retention.

8.2 Immutability Rule

Corrections are metadata, not mutations. The raw stated probability from the analyst is ALWAYS preserved. Corrections are instructions in the system prompt, not transformations of output.

8.3 Time-Travel Reconstruction

Every prediction row records digest_active_id (which corrections were in effect). Combined with the corrections table, you can reconstruct exactly what the system believed at any point in time.

Counterfactual analysis is supported: re-run ensemble calculations with any subset of corrections applied or omitted, because raw probabilities are always stored.

8.4 Audit Requirements

All events written to audit_log (append-only, never updated/deleted):

Predictions inserted
Digests created/approved/deployed/retired
Corrections created/modified/retired
Manual exclusions
Human overrides
Model version events
Alerts triggered
Correction cap changes

9. Operational Parameters — Quick Reference

Parameter	Value
Min n overall Brier	50
Min n per-sport	30
Min n per-market	30
Min n conviction bucket	20
Min n decile bucket	15
Min n bias computation	20
Min n correlation	40
ECE acceptable	0.050
ECE soft trigger	0.075
ECE hard trigger	0.100
Beta well-calibrated	0.90-1.10
Correction significance alpha	0.10
Min effect size	0.03
Max simultaneous corrections	5
Max total correction load	0.15
Correlation monitor threshold	0.40
Correlation critical	0.65
Min D_eff	2.0 (alert 1.5)
Convergence delta_rho threshold	+0.08
CUSUM h	5.0
CUSUM k	0.005
Hard degradation B_delta	0.020
Critical degradation B_delta	0.030
Re-baseline after version update	21 days
A/B shadow phase n	30
A/B partial phase n	30
A/B promotion alpha	0.20
Max analyst ensemble weight	0.40
Baseline no-correction period	4 weeks
Fingerprint drift alert	0.04
Conviction inflation flag	> 40% at c=4-5
Conviction hedging flag	> 60% at c=1-2
Home bias threshold	0.04
Narrative bias threshold	0.03
Fatigue effect threshold	0.04
Injury accuracy drop	0.025 Brier
Emergency staging window	12 hours
Standard staging window	48 hours
Rollback availability	30 days
Fingerprint retention	2 years

10. Document Control

Amendments require:

Human operator approval
Audit log entry with entity_type = 'specification'
New version number (semantic: major.minor)
48-hour review window

When code and document conflict, the document wins until formally amended.

End of CAPS-001 v1.0

Source: ~/edgeclaw/docs/CALIBRATION-SPEC.md