Reliability Diagrams

What it is. The canonical visualization of calibration. Shows empirical accuracy vs predicted confidence, binned. A perfectly calibrated AI system has all bins on the diagonal.

How to read one

Each point in the diagram is one bin (typically 10 bins from 0% to 100% predicted confidence). Within a bin:

X-axis: mean predicted confidence of decisions in this bin
Y-axis: empirical accuracy (fraction of decisions in the bin that turned out correct upon human review)

A perfectly calibrated model has all bins on the diagonal y = x. If your bin at "predicted 80%" has accuracy 60%, your model is overconfident in that range — it claims more confidence than is warranted.

The Calibration Dashboard renders this with 95% Wilson confidence intervals on each bin's accuracy (Wilson 1927; standard for binomial proportion CI; gracefully handles bins where all decisions were correct or all incorrect, where Wald intervals collapse).

Three patterns to recognize

       y                  y                  y
       |                  |                  |
       |    .   .         |  .               |        .
       |  .              .                   |       .
       |  ←───            |   .              |      .
       |     diagonal     |     ←───         |     ← diagonal
       |                  |       diagonal   | .
       |                  |                  |
       └──────── x        └──────── x        └──────── x

   Well-calibrated      Overconfident        Underconfident
                       (claims more         (claims less
                        than warranted)     than warranted)

Well-calibrated — bins lie on/near the diagonal. The model's "80%" really is 80%.
Overconfident — bins below diagonal. The model says 80% but is right 60% of the time. Most common pattern in RLHF-tuned LLMs (per Sharma et al. 2023 sycophancy literature).
Underconfident — bins above diagonal. Model says 60% but is right 80% of the time. Less common but happens when temperature scaling is set too aggressively.

What to do when you see miscalibration

Adjudon's per-org calibration map automatically corrects this. Once your review corpus crosses ~500 labeled outcomes for a given (agent, decision-type) slice, the per-agent isotonic-regression sub-map fits on your own data with the hierarchical-shrinkage weight w_org = n_org / (n_org + 500) — at 500 outcomes you pull 50% from your own data, at 5,000 you pull ~91%. Below 500 the slice falls back to per-org or cross-org priors, so the dashboard never shows a calibration map that's been fit on statistically thin evidence.

The current dashboard shows the post-calibration reliability diagram (after isotonic correction is applied to the raw model output). A side-by-side pre-vs-post view is on the methodology roadmap; until that ships, the post-calibration diagram is the canonical view of "how well your calibrated scores match reality" — the diagonal-fit you see is what regulators and your audit team will see in the Article 13 IFU export.

Methodology references

Niculescu-Mizil, A. & Caruana, R. (2005). Predicting Good Probabilities with Supervised Learning. ICML.
Murphy, A. H. (1973). A new vector partition of the probability score (Brier decomposition into reliability + resolution + uncertainty).
Wilson, E. B. (1927). Probable Inference, the Law of Succession, and Statistical Inference (binomial-proportion CI).

How to read one​

Three patterns to recognize​

What to do when you see miscalibration​

Methodology references​

Related​

How to read one

Three patterns to recognize

What to do when you see miscalibration

Methodology references

Related