Calibration
What it is. Per-organization confidence calibration that learns from human review outcomes. Each customer's calibration map is fit on their own labeled decision history.
Why calibration matters
A model says "I'm 90% confident." Out of 100 such decisions, how many were actually correct? Without calibration, that 90 might be 60. With calibration, the number you see is the number you get — a probability you can trust as a probability.
Adjudon does not trust the model's self-report. We measure empirically.
Three numbers we report
| Metric | What it is | Range | Lower / higher = better |
|---|---|---|---|
| Brier score | Mean squared error between predicted probability and binary outcome | 0–1 | Lower better. Excellent < 0.10. |
| Expected Calibration Error (ECE) | Average gap between predicted confidence and empirical accuracy across 10 bins | 0–1 | Lower better. Well-calibrated < 0.03. |
| Conformal coverage | Empirical fraction of outcomes within Vovk's prediction set at miscoverage α | 0–1 | Should match (1 − α). |
You see these three numbers per agent in your Calibration Dashboard. They update nightly as new review outcomes flow in.
The reliability diagram
The canonical visualization in calibration literature [Niculescu-Mizil & Caruana, ICML 2005]. Each bin shows the mean predicted confidence vs the empirical accuracy of decisions in that bin. A perfectly calibrated AI system has all bins on the diagonal.
accuracy
|
| ● ← perfectly calibrated bin
100% _____|________________
| ● ●
| ●
| ● ← bin below diagonal:
50% _____|_____________ model overconfident here
| ●
|
0% _____|________________
|________________________
0% 50% 100%
mean predicted confidence
The Calibration Dashboard renders this diagram with 95% Wilson confidence intervals on each bin's accuracy.
How it works
- Trace ingestion — Your agent's decision lands at
POST /api/traces. The Confidence Engine computes a 12-signal score. - Calibration map apply — The score passes through your
organization's per-(agent, decisionType) isotonic-regression
calibration map (Phase 2 onward). The output is
score_calibrated. - Review outcome ingestion — When a human reviewer marks a decision
correct/incorrect via
POST /api/cpi/ingest(or the Review Queue UI), the outcome enters the calibration corpus. - Nightly refit — At 03:30 UTC the calibration map refits on all slices with ≥100 new outcomes since last refit (configurable). The refit uses the Pool Adjacent Violators algorithm — fast, exact, non-parametric.
- Hierarchical shrinkage — At apply time, the per-org map blends
with the cross-org global prior:
At 50 outcomes the system pulls 91% from the prior; at 5000 outcomes 91% from your own data.
final = w_org × isotonic(rawScore) + (1 − w_org) × prior_mean
w_org = n_org / (n_org + 500)
Cold start
For brand-new organizations with zero review outcomes, the system uses the cross-org global prior + identity calibration. The Calibration Dashboard shows "provisional" status until your corpus exceeds 50 outcomes.
Drift detection
Three triggers run nightly at 02:30 UTC:
- Kolmogorov-Smirnov test on score distribution — D > 0.10 fires
- ECE jump month-over-month ≥ 0.03
- Brier-score regression ≥ 15% relative
On any trigger: a CalibrationDriftAlert is persisted, the
calibration.drift_detected webhook fires, the Calibration Dashboard
shows a banner. Acknowledge or wait for the next clean refit to clear.
Methodology references
- Guo, C. et al. (2017). On Calibration of Modern Neural Networks. ICML.
- Niculescu-Mizil, A. & Caruana, R. (2005). Predicting Good Probabilities with Supervised Learning. ICML.
- Vovk, V. et al. (2005). Algorithmic Learning in a Random World.
- Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review.
- Wilson, E. B. (1927). Probable Inference, the Law of Succession, and Statistical Inference.
For the full methodology defense, see the Adjudon white paper (in preparation, target FTML 2026).
Plan tier access
The full Calibration Dashboard, reliability diagram, Brier + ECE metrics,
conformal coverage, drift detection, Article 13 IFU generator, and
bias-stratified reports are gated to Scale plan and above (auditLog
feature flag).
Plan-tier differentiation (Governance per-org maps · Enterprise per-agent sub-maps) is part of the methodology roadmap. Until that differentiation is enforced in code, all Scale+ customers receive the same per-org calibration treatment.
Adjudon does not currently offer a financially-backed Brier-score guarantee. Such a guarantee would require third-party reinsurance underwriting which is not in place — we document this honestly so procurement does not arrive expecting a refund mechanism that does not exist.
The conformal coverage badge SVG endpoint is publicly cacheable (no
authentication needed on the URL) so it can be embedded as <img> on
your compliance/trust page.
Related
- Reliability Diagrams — how to read the visualization the Calibration Dashboard renders
- Conformal Coverage — the Vovk-theorem-backed badge embedded on your compliance page
- Traces and Confidence — how the score this page calibrates is produced upstream