Oracle Scores

How we measure forecast model accuracy — comparing predicted high/low temperatures against official ASOS observations to rank every model, every day.

What Is an Oracle Score?

An Oracle Score measures how accurate a forecast model's predicted high and low temperatures were compared to what actually happened. Every day, each model produces one or more forecast runs targeting a future date. Once that date passes, we compare every run's prediction against the official observed temperature.

The result is a per-model accuracy ranking — lower scores mean more accurate forecasts. Over a multi-day window (e.g., 7 days), you can see which models consistently perform best at a given station.

Ground Truth: ASOS Observations

All scores are measured against observed daily high and low temperatures from ASOS (Automated Surface Observing System) stations. These are the same stations that feed NWS official records. We pull the actual high and low in Fahrenheit for the station's local day.

If ASOS data is incomplete for a given day (missing high or low), that day is skipped entirely — no partial scores are recorded.

Three Scoring Modes

The Oracle Score widget offers three modes that answer different questions:

Overall

How accurate is each model across all of its forecast runs?

For a given scored date, we look at every forecast run that produced predictions for that day — regardless of when the run was created. A GFS run from 5 days out and a GFS run from 1 day out both contribute equally.

This gives you the model's overall reliability. A model that scores well here is consistently accurate at every forecast horizon, not just close-in.

Day-Ahead

How accurate was the prior day's forecast for the scored day?

For a given scored date, we only consider forecast runs that were fetched during the local day before the scored date. This is the forecast you would have been looking at when making a decision.

This is the most relevant mode for betting and trading. It answers: "If I trust this model's forecast the day before, how often is it right?"

Day-Of

How accurate were forecasts fetched during the scored day?

For a given scored date, we only consider forecast runs that were fetched during that same local day. The score is still settled after the full day of observations is available.

This mode is useful for same-day decisions and for comparing how models perform once short-range updates are available.

Which mode should I use?

Use Day-Ahead to decide which model to trust for your next bet. Use Day-Of for same-day checks once short-range runs are available. Use Overall to evaluate which models are fundamentally the best forecasters across all time horizons.

How Scores Are Calculated

Each model's daily score is computed in three steps:

Extract predicted high/low per run. For each forecast run, we take the maximum and minimum of all hourly temperature predictions (temperature_2m_f) that fall within the station's local day window.
Compare against observed. For each run, compute the error: predicted - actual. The absolute value is the run's error; the signed value is the run's bias.
Average across runs. All qualifying runs for that model are averaged to produce the day's MAE (Mean Absolute Error) and bias.

When you select a multi-day range (e.g., 7 days), the displayed values are the averages of each scored day's MAE and bias across the window.

Reading the Numbers

Bias (High / Low) — Primary

The average signed error in degrees Fahrenheit. Positive (+) means the model runs warm (over-predicts). Negative (-) means it runs cold (under-predicts).

This is the most actionable number. If GFS shows a +2° high bias and forecasts 82°F tomorrow, adjust down — expect closer to 80°F. A bot can apply this correction automatically.

Color coding: green within ±1°, amber within ±2.5°, red beyond ±2.5°.

MAE (High / Low) — Reliability

Mean Absolute Error — the average magnitude of the error regardless of direction. Lower is better. In the widget, you can rank by High MAE or Low MAE depending on which market you are trading.

MAE tells you how reliable a model is. A model with 0° bias but 5° MAE swings wildly in both directions — you can't correct for it. A model with +2° bias and 2° MAE is consistently warm — very reliable once you adjust.

Bias vs MAE for Bots

Use bias to adjust forecasts (subtract the bias from the model's prediction). Use High MAE when trading highs and Low MAE when trading lows. Both are available in the API response for programmatic access.

Scoring Timeline

Oracle Scores are computed daily at 18:00 UTC for the previous day. This timing ensures ASOS observations have been finalized for the scored date.

Scores are cached for 6 hours per station. When new scores are computed, the cache is automatically invalidated.

Historical scores use your plan's lookback (e.g. 7 / 30 / 90 calendar days); Clanker can also request days=all for the full retained series. See your plan tier. Free users can view 1 day of scores for public models.

API Access

Oracle Scores are available via the REST API for programmatic access. See the API Reference for endpoint details, parameters, and response format.

Reading the Dashboard API Reference

Back to Docs