Reliability Diagrams
What it is. The canonical visualization of calibration. Shows empirical accuracy vs predicted confidence, binned. A perfectly calibrated AI system has all bins on the diagonal.
How to read one
Each point in the diagram is one bin (typically 10 bins from 0% to 100% predicted confidence). Within a bin:
- X-axis: mean predicted confidence of decisions in this bin
- Y-axis: empirical accuracy (fraction of decisions in the bin that turned out correct upon human review)
A perfectly calibrated model has all bins on the diagonal y = x. If your bin at "predicted 80%" has accuracy 60%, your model is overconfident in that range — it claims more confidence than is warranted.
The Calibration Dashboard renders this with 95% Wilson confidence intervals on each bin's accuracy (Wilson 1927; standard for binomial proportion CI; gracefully handles bins where all decisions were correct or all incorrect, where Wald intervals collapse).
Three patterns to recognize
y y y
| | |
| . . | . | .
| . . | .
| ←─── | . | .
| diagonal | ←─── | ← diagonal
| | diagonal | .
| | |
└──────── x └──────── x └──────── x
Well-calibrated Overconfident Underconfident
(claims more (claims less
than warranted) than warranted)
- Well-calibrated — bins lie on/near the diagonal. The model's "80%" really is 80%.
- Overconfident — bins below diagonal. The model says 80% but is right 60% of the time. Most common pattern in RLHF-tuned LLMs (per Sharma et al. 2023 sycophancy literature).
- Underconfident — bins above diagonal. Model says 60% but is right 80% of the time. Less common but happens when temperature scaling is set too aggressively.
What to do when you see miscalibration
Adjudon's per-org calibration map automatically corrects this. Once
your review corpus crosses ~500 labeled outcomes for a given (agent,
decision-type) slice, the per-agent isotonic-regression sub-map fits
on your own data with the hierarchical-shrinkage weight w_org = n_org / (n_org + 500)
— at 500 outcomes you pull 50% from your own data, at 5,000 you pull
~91%. Below 500 the slice falls back to per-org or cross-org priors, so
the dashboard never shows a calibration map that's been fit on
statistically thin evidence.
The current dashboard shows the post-calibration reliability diagram (after isotonic correction is applied to the raw model output). A side-by-side pre-vs-post view is on the methodology roadmap; until that ships, the post-calibration diagram is the canonical view of "how well your calibrated scores match reality" — the diagonal-fit you see is what regulators and your audit team will see in the Article 13 IFU export.
Methodology references
- Niculescu-Mizil, A. & Caruana, R. (2005). Predicting Good Probabilities with Supervised Learning. ICML.
- Murphy, A. H. (1973). A new vector partition of the probability score (Brier decomposition into reliability + resolution + uncertainty).
- Wilson, E. B. (1927). Probable Inference, the Law of Succession, and Statistical Inference (binomial-proportion CI).