Skip to main content

Calibration

What it is. Per-organization confidence calibration that learns from human review outcomes. Each customer's calibration map is fit on their own labeled decision history.

Why calibration matters

A model says "I'm 90% confident." Out of 100 such decisions, how many were actually correct? Without calibration, that 90 might be 60. With calibration, the number you see is the number you get — a probability you can trust as a probability.

Adjudon does not trust the model's self-report. We measure empirically.

Three numbers we report

MetricWhat it isRangeLower / higher = better
Brier scoreMean squared error between predicted probability and binary outcome0–1Lower better. Excellent < 0.10.
Expected Calibration Error (ECE)Average gap between predicted confidence and empirical accuracy across 10 bins0–1Lower better. Well-calibrated < 0.03.
Conformal coverageEmpirical fraction of outcomes within Vovk's prediction set at miscoverage α0–1Should match (1 − α).

You see these three numbers per agent in your Calibration Dashboard. They update nightly as new review outcomes flow in.

The reliability diagram

The canonical visualization in calibration literature [Niculescu-Mizil & Caruana, ICML 2005]. Each bin shows the mean predicted confidence vs the empirical accuracy of decisions in that bin. A perfectly calibrated AI system has all bins on the diagonal.

              accuracy
|
| ● ← perfectly calibrated bin
100% _____|________________
| ● ●
| ●
| ● ← bin below diagonal:
50% _____|_____________ model overconfident here
| ●
|
0% _____|________________
|________________________
0% 50% 100%
mean predicted confidence

The Calibration Dashboard renders this diagram with 95% Wilson confidence intervals on each bin's accuracy.

How it works

  1. Trace ingestion — Your agent's decision lands at POST /api/traces. The Confidence Engine computes a 12-signal score.
  2. Calibration map apply — The score passes through your organization's per-(agent, decisionType) isotonic-regression calibration map (Phase 2 onward). The output is score_calibrated.
  3. Review outcome ingestion — When a human reviewer marks a decision correct/incorrect via POST /api/cpi/ingest (or the Review Queue UI), the outcome enters the calibration corpus.
  4. Nightly refit — At 03:30 UTC the calibration map refits on all slices with ≥100 new outcomes since last refit (configurable). The refit uses the Pool Adjacent Violators algorithm — fast, exact, non-parametric.
  5. Hierarchical shrinkage — At apply time, the per-org map blends with the cross-org global prior:
    final = w_org × isotonic(rawScore) + (1 − w_org) × prior_mean
    w_org = n_org / (n_org + 500)
    At 50 outcomes the system pulls 91% from the prior; at 5000 outcomes 91% from your own data.

Cold start

For brand-new organizations with zero review outcomes, the system uses the cross-org global prior + identity calibration. The Calibration Dashboard shows "provisional" status until your corpus exceeds 50 outcomes.

Drift detection

Three triggers run nightly at 02:30 UTC:

  1. Kolmogorov-Smirnov test on score distribution — D > 0.10 fires
  2. ECE jump month-over-month ≥ 0.03
  3. Brier-score regression ≥ 15% relative

On any trigger: a CalibrationDriftAlert is persisted, the calibration.drift_detected webhook fires, the Calibration Dashboard shows a banner. Acknowledge or wait for the next clean refit to clear.

Methodology references

  • Guo, C. et al. (2017). On Calibration of Modern Neural Networks. ICML.
  • Niculescu-Mizil, A. & Caruana, R. (2005). Predicting Good Probabilities with Supervised Learning. ICML.
  • Vovk, V. et al. (2005). Algorithmic Learning in a Random World.
  • Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review.
  • Wilson, E. B. (1927). Probable Inference, the Law of Succession, and Statistical Inference.

For the full methodology defense, see the Adjudon white paper (in preparation, target FTML 2026).

Plan tier access

The full Calibration Dashboard, reliability diagram, Brier + ECE metrics, conformal coverage, drift detection, Article 13 IFU generator, and bias-stratified reports are gated to Scale plan and above (auditLog feature flag).

Plan-tier differentiation (Governance per-org maps · Enterprise per-agent sub-maps) is part of the methodology roadmap. Until that differentiation is enforced in code, all Scale+ customers receive the same per-org calibration treatment.

Adjudon does not currently offer a financially-backed Brier-score guarantee. Such a guarantee would require third-party reinsurance underwriting which is not in place — we document this honestly so procurement does not arrive expecting a refund mechanism that does not exist.

The conformal coverage badge SVG endpoint is publicly cacheable (no authentication needed on the URL) so it can be embedded as <img> on your compliance/trust page.