Policy Effectiveness
A competitor entering month 1 with the same algorithms cannot match Adjudon's precision-vs-data-volume curve at month 18 without an equivalent labelled-decision corpus.
The Effectiveness Service (Phase 2 Track D per LD-6) is the moat-compounding artifact. The same way credit-bureau, search-ranking, and antivirus-signature moats compound, per-customer policy effectiveness compounds over time as review outcomes accumulate.
Metrics
Each policy is scored over a sliding time-window via classical classification metrics:
- Precision = TP / (TP + FP) — of the decisions this policy made (block/flag/notify), what fraction did the human reviewer confirm?
- Recall = TP / (TP + FN) — of the decisions that SHOULD have been blocked/flagged, what fraction did this policy catch?
- F1 = 2·P·R / (P + R) — harmonic mean
- FPR = FP / (FP + TN) — false-positive rate
- Cohen's κ = (po - pe) / (1 - pe) — agreement vs chance (≥ 0.6 = substantial)
The confusion matrix is built from PolicyReviewOutcome records — one record per human reviewer verdict on a policy decision, written by cpiFeedbackIngest (Phase 2 D.1 SDK extension).
Where ground truth comes from
When a human reviewer:
- Confirms an automated block → True Positive (TP)
- Reverses an automated block → False Positive (FP)
- Partially-corrects → half-credit FP
False Negatives (FN) and True Negatives (TN) are NOT directly observable from policy-fired outcomes alone (we don't have "human reviewed every approved decision and confirmed it was right"). The caller can supply FN/TN counts from a broader review program when available.
Per-condition Shapley attribution
For a policy with N ≤ 10 conditions, Adjudon computes exact Shapley values over the 2^N subsets (1024 evaluations max). For N > 10 the current spike refuses (refitShapley throws) — Phase 2 D.4 scaffolds KernelSHAP Monte-Carlo (Štrumbelj & Kononenko 2014) with target ε ≤ 0.01 at 10K samples; until then customers with N > 10 policy conditions see a "Shapley unavailable" badge in the dashboard. Most customer policies have N ≤ 6, so this is rare in practice.
Important — what Shapley measures here
Shapley values are over fire-rate, NOT over precision.
Review-queue outcomes give Binary Ground Truth at the action level (block was right / wrong). Condition-level ground truth is unknown — we don't know which specific condition was responsible for a reviewer's verdict.
So Shapley is framed as:
Average contribution of each condition to the policy's historical fire-rate over the last N evaluations.
This is computable from policy-version replay against historical traces (no human labels needed). It tells you which conditions are doing the work; it does NOT tell you which conditions are causing reviewer disagreement.
Per Huang & Marques-Silva (The Inadequacy of Shapley Values for Explainability, arXiv:2302.08160), Shapley values can mis-rank features that are provably irrelevant. We use Shapley as first-order ranking signal across the policy population; never present per-decision Shapley to customers as causal explanations.
Drift Detection (Phase 2 Track D.7)
Every week, the drift detector computes:
- Kolmogorov-Smirnov statistic on confidenceScore distribution: current window vs 30-day baseline. Threshold 0.15 (Cohen "medium effect").
- Chi-squared on categorical attributes (status, agentId).
- Fire-rate delta: |current_fire_rate - baseline_fire_rate|. Threshold 0.20.
When drift exceeds threshold, a policy.drift_detected webhook fires and the dashboard surfaces a red banner. Customers investigate: model drift? Data shift? Adversarial gaming?
Suggested refinements (Phase 2 Track D.6)
The Refinement Engine performs rule-search over a finite candidate set per policy:
- Numeric threshold ±5% and ±10% sweeps
- Operator boundary tweaks (less_than ↔ lte for inclusivity)
- Remove lowest-Shapley condition
- Add
output_contains_piiAND-clause (defence-in-depth augmentation)
Each candidate is simulated against the org's recent history (via the policyReplayService counterfactual evaluator) to project ΔPrecision / ΔRecall / ΔF1. Ranked by F1-delta. Customer reviews top-10 suggestions in the Effectiveness Dashboard.
Cardinal Rule #6 compliance: this is rule-search over the customer's OWN labelled outcomes. No ML training. No cross-customer aggregation. The output is concrete proposed policy mutations the customer chooses to accept or reject.
Plan-tier gating
| Tier | Effectiveness behaviour |
|---|---|
| Sandbox / Scale | No effectiveness dashboard; cpiFeedbackIngest writes ReviewOutcome but no PolicyEffectivenessSnapshot refit |
| Governance | Per-policy precision/recall/F1/Cohen's κ snapshots; drift detector; manual suggested-refinement |
| Enterprise | + Per-condition Shapley + automated suggested-refinement + drift webhooks |
| Custom | + Cross-org anonymised benchmarking (opt-in, differential privacy ε/δ — Phase 3 deferred) |
What's in the dashboard
/dashboard/policies/effectiveness shows:
- One card per (policyId, policyVersionHash) tuple with at least one snapshot in the window
- Confusion-matrix counts (TP / FP / FN / TN) + derived metrics
- Cohen's κ as agreement-vs-chance signal
- Per-condition Shapley bar chart (when available)
- 30/90/365-day trend (Recharts time-series)
The 18-month moat math
- Competitor cold-start: Brier ~0.20–0.25 (uncalibrated heuristic blends per Niculescu-Mizil & Caruana 2005).
- Adjudon at month 18: Brier ~0.10–0.15 (range Guo et al. 2017 reports for temperature scaling on standard benchmarks).
- Gap requires 18 months of equivalent customer review data to close → structural data moat.
This is the moat number Munich Re's AI underwriting team uses when pricing Adjudon-evidenced AI liability coverage (per LD-11).
Cardinal Rules applied
- Rule #1: every
PolicyReviewOutcome+PolicyEffectivenessSnapshotquery filtersorganizationIdFIRST via compound index. - Rule #4:
correctedDecision+reviewMetadataPII-scrubbed at write time viapiiScrubber.scrubString/scrubPayload. - Rule #5:
PolicyReviewOutcomeis append-only (mutation rejection middleware). - Rule #6: derived from the customer's OWN labelled outcomes; never trained, never cross-customer.