Analyst Pipeline Specification — Prediction Phase

Version: 1.0 Date: 2026-04-06 Status: AUTHORITATIVE — Council Ruling (5/5 unanimous Option C) Applies to: All active desks (MLB, MLB Player Props, and all future desks)

Overview

The analyst stage is where AI models independently review fundamental data and generate predictions. This spec defines the architecture, models, data flow, and rules.

Council ruling: All 5 advisors (Opus, Sonnet, Gemini Pro, Grok 4.20, gpt-oss-120b) unanimously rejected the full council approach (Option B) for daily predictions. They unanimously recommended Option C — independent parallel analysts with mechanical combination in code.

Architecture — Option C

Three models run in parallel. Each receives the same fundamentals-only briefing. Each makes predictions independently. A code function (not AI) combines them. No AI touches the synthesis step.

DATA COLLECTION (scrapers, crons)
       |
BRIEFING GENERATOR (strips all prices/edges, fundamentals only)
       |
  +---------+---------+
  |         |         |
OPUS    GEMINI    GPT-OSS
(native) (API)    (API)
  |         |         |
  +---------+---------+
       |
MECHANICAL COMBINATION (code, weighted average)
       |
STORE (analyst_picks table, tagged UNANIMOUS/MAJORITY/SPLIT)
       |
SETTLEMENT (3 AM daily, Brier scores)

Panel Roster (LOCKED)

Seat	Model	Provider	Route	Notes
Analyst 1	Claude Opus 4.6 Max Reasoning	Anthropic	Claude Code cloud scheduled agent	Bills to subscription, NOT API. Max reasoning mode.
Analyst 2	Gemini 3.1 Pro Preview	Google	OpenRouter (`google/gemini-3.1-pro-preview`)	API call from analyst-runner.ts
Analyst 3	gpt-oss-120b	OpenAI	OpenRouter (`openai/gpt-oss-120b`)	API call from analyst-runner.ts

Rules:

Same 3 models used across ALL desks. No per-desk swapping.
No model may be substituted without a full council ruling.
Opus runs via Claude Code cloud scheduled tasks, not the Anthropic API.
All 3 models receive identical briefings. No per-model customization.
All go COLD every run — no memory, no context from prior runs.

Briefing — Fundamentals Only (Price Blackout)

Analysts NEVER see:

Kalshi prices or contract data
FanDuel / DraftKings / any sportsbook lines
Edge scanner output
Sharp money indicators
Steam alerts
Model anchor probabilities

Analysts DO see:

Team/player stats (season, rolling averages, EWMA)
Starting pitcher matchup (K rate, WHIP, handedness, game logs)
Batter splits (vs LHP/RHP, platoon data)
Park factors (HR-friendly, pitcher-friendly)
Weather (wind, temperature)
Lineups (confirmed starters)
Bullpen status (fatigue, availability)
Rest days, travel, schedule context
Injuries and transactions
Historical matchup data (H2H if sufficient sample)

Implementation: Build per-desk briefing generators (like NBA's analyst-briefing.ts) that pull from desk-specific data tables. Explicitly strip any price/edge data before sending to models.

Analyst Output Format

Each model returns structured JSON:

{
  "game_id": "...",
  "sport": "mlb",
  "market_type": "spread|total|moneyline|pitcher_strikeouts|batter_hits|...",
  "prediction": {
    "side": "over|under|home|away",
    "line": 5.5,
    "probability": 0.62
  },
  "conviction": 4,
  "reasoning": "One sentence explaining the key factor driving this prediction"
}

For player props, the market_type specifies the prop (e.g., pitcher_strikeouts, batter_hits, batter_total_bases, batter_home_runs, batter_hits_runs_rbis).

Mechanical Combination

A code function (TypeScript, not AI) combines the 3 predictions:

Weights:

Phase 1 (first 50 settled predictions per market type): Equal weights — 33/33/33.
Phase 2 (50+ settled): Brier-weighted. Better Brier score = higher weight. Recalculated monthly from the model_weights table.

Formula:

final_prediction = (w1 * pred1 + w2 * pred2 + w3 * pred3) / (w1 + w2 + w3)
final_conviction = average of individual convictions (no rounding)

No AI touches this step. It is arithmetic.

Consensus Tagging

After all 3 models respond, tag the prediction set:

Tag	Definition	What it means
`UNANIMOUS`	All 3 models predict the same direction	Strong signal
`MAJORITY`	2 models agree, 1 dissents	Moderate signal, dissenter logged
`SPLIT`	All 3 disagree or no clear majority	Weak/no signal

Stored fields:

consensus_tag: UNANIMOUS / MAJORITY / SPLIT
dissenter_model: Name of dissenting model (null if UNANIMOUS)
dissenter_prediction: What the dissenter predicted

Data Collection Phase (CURRENT MODE)

Log everything. Act on nothing.

We are NOT betting on these predictions yet. Every prediction from every model gets stored regardless of consensus or conviction. The purpose is to accumulate data so the system teaches itself what's actionable.

After months of data, the patterns will show:

What accuracy looks like at each consensus level
Which model to trust when they disagree
Which conviction levels are reliable per model per market type
When to act vs when to skip

The rules for what counts as actionable will be derived from the data, not guessed upfront.

No Peer Review on Daily Predictions

The full council process (advisory → peer review → synthesis) is NEVER used for daily prediction runs. This is a hard rule.

Why (council unanimous):

Peer review contaminates individual model signals
You cannot calibrate a model whose output was influenced by seeing other models' predictions
LLM chairmen are bad at mathematical synthesis — they average or pick the most persuasive argument
Clean individual data is non-negotiable for long-term calibration

Monthly Calibration Council

On the 1st of each month, the full 5-model council reviews the last 30 days:

Input (generated automatically):

Per-model Brier scores by market type
Conviction calibration curves (conviction 5 hit rate, conviction 4 hit rate, etc.)
Consensus accuracy (UNANIMOUS vs MAJORITY vs SPLIT hit rates)
Systematic biases identified ("Opus overestimates K props by 0.8 strikeouts")
Dissenter accuracy ("When Gemini dissents on totals, Gemini is right 41% of the time")

Output:

Prompt corrections for each model's next month of predictions
Weight adjustments for the combination function
Recommendations on which consensus levels to act on (once data collection phase ends)

The monthly council ruling gets saved to the dashboard.

XGBoost Meta-Learner Tracker

A simple counter monitors settled predictions per model per market type. When thresholds are crossed, it sends a Telegram alert.

Thresholds:

Game-level (spreads, totals, ML): 500 settled predictions per market type
Player props (per prop type): 200 settled predictions per prop type per model

Implementation: Daily cron queries analyst_picks WHERE actual_result IS NOT NULL, groups by model + market_type, counts. Compares against thresholds. First time a threshold is crossed, sends a one-time Telegram alert.

When triggered: Run a council to evaluate whether XGBoost meta-learner should replace simple weighted average. Do not auto-deploy — council reviews first.

Degraded Mode

If a model's API fails after 3 retries:

Run with the other 2 models
Log models_used field on the prediction set so degraded runs are identifiable
Weights auto-adjust (renormalize remaining 2 models' weights)
Never substitute a different model without a council ruling
If 2 models fail, run with 1 and tag as DEGRADED_SINGLE — data only, zero confidence

Opus Scheduling (Cloud Agent)

Opus 4.6 Max Reasoning runs as a Claude Code cloud scheduled agent:

Fires on the analyst schedule (desk-specific timing)
Reads the briefing from a file or database
Writes its prediction as structured JSON to a known location
The analyst runner checks for Opus's output, waits up to X minutes
If Opus doesn't respond in time, run with Gemini + gpt-oss only (degraded mode)
Bills to Claude subscription, NOT API

MLB-Specific Schedule

Time (ET)	What runs
11:00 AM	Data collection batch (stats, lineups, weather, bullpen, matchups)
12:00 PM	Briefing generator builds today's briefings from collected data
12:15 PM	Analyst pipeline fires — all 3 models run on all briefings
3:00 AM	Settlement — resolves outcomes, calculates Brier scores
1st of month	Monthly calibration council

Schedule may be adjusted based on lineup confirmation timing and game start times.

Database Schema

analyst_picks (per-prediction row)

id, date, sport, desk, game_id, market_type, player_name,
opus_prediction, opus_line, opus_probability, opus_conviction,
gemini_prediction, gemini_line, gemini_probability, gemini_conviction,
gptoss_prediction, gptoss_line, gptoss_probability, gptoss_conviction,
combined_prediction, combined_line, combined_probability, combined_conviction,
consensus_tag, dissenter_model, dissenter_prediction,
models_used, actual_result, actual_value,
opus_brier, gemini_brier, gptoss_brier, combined_brier,
settled_at, created_at

model_weights (per model per market type)

id, model_name, sport, market_type, weight, brier_score,
sample_size, last_updated

Peer Review Blind Spots (Documented for Future Reference)

The council's peer review phase identified these risks. They are NOT solved in v1 but must be monitored:

Correlated model failure on public narratives — When all models read the same injury news or media hype, their agreement might be fake consensus. Monitor for high-consensus predictions on heavily-reported events.
API reliability and silent model drift — Models may change behavior after provider updates without notice. Track prediction distribution shifts monthly.
Model monoculture — Independence requires different foundation models. Current roster (Claude/Google/OpenAI) satisfies this. Do not add a second model from the same provider.
A/B validation — Before acting on predictions (post data-collection phase), run a paper-trading validation period. Do not go live without at least 30 days of paper results.

Council Record

Date: 2026-04-06
Advisors: Opus 4.6, Sonnet 4.6, Gemini 3.1 Pro, Grok 4.20, gpt-oss-120b
Verdict: Unanimous Option C
Peer review strongest: Advisor D (Gemini) — identified LLM synthesis as anti-pattern for math
Critical blind spots found: Model monoculture, correlated narrative failure, API reliability, no A/B plan
Full council files: /home/ubuntu/edgeclaw/data/councils/analyst-pipeline/

Source: ~/edgeclaw/docs/analyst-pipeline-spec.md