Policy Effectiveness

A competitor entering month 1 with the same algorithms cannot match Adjudon's precision-vs-data-volume curve at month 18 without an equivalent labelled-decision corpus.

The Effectiveness Service (Phase 2 Track D per LD-6) is the moat-compounding artifact. The same way credit-bureau, search-ranking, and antivirus-signature moats compound, per-customer policy effectiveness compounds over time as review outcomes accumulate.

Metrics

Each policy is scored over a sliding time-window via classical classification metrics:

Precision = TP / (TP + FP) — of the decisions this policy made (block/flag/notify), what fraction did the human reviewer confirm?
Recall = TP / (TP + FN) — of the decisions that SHOULD have been blocked/flagged, what fraction did this policy catch?
F1 = 2·P·R / (P + R) — harmonic mean
FPR = FP / (FP + TN) — false-positive rate
Cohen's κ = (po - pe) / (1 - pe) — agreement vs chance (≥ 0.6 = substantial)

The confusion matrix is built from PolicyReviewOutcome records — one record per human reviewer verdict on a policy decision, written by cpiFeedbackIngest (Phase 2 D.1 SDK extension).

Where ground truth comes from

When a human reviewer:

Confirms an automated block → True Positive (TP)
Reverses an automated block → False Positive (FP)
Partially-corrects → half-credit FP

False Negatives (FN) and True Negatives (TN) are NOT directly observable from policy-fired outcomes alone (we don't have "human reviewed every approved decision and confirmed it was right"). The caller can supply FN/TN counts from a broader review program when available.

Per-condition Shapley attribution

For a policy with N ≤ 10 conditions, Adjudon computes exact Shapley values over the 2^N subsets (1024 evaluations max). For N > 10 the current spike refuses (refitShapley throws) — Phase 2 D.4 scaffolds KernelSHAP Monte-Carlo (Štrumbelj & Kononenko 2014) with target ε ≤ 0.01 at 10K samples; until then customers with N > 10 policy conditions see a "Shapley unavailable" badge in the dashboard. Most customer policies have N ≤ 6, so this is rare in practice.

Important — what Shapley measures here

Shapley values are over fire-rate, NOT over precision.

Review-queue outcomes give Binary Ground Truth at the action level (block was right / wrong). Condition-level ground truth is unknown — we don't know which specific condition was responsible for a reviewer's verdict.

So Shapley is framed as:

Average contribution of each condition to the policy's historical fire-rate over the last N evaluations.

This is computable from policy-version replay against historical traces (no human labels needed). It tells you which conditions are doing the work; it does NOT tell you which conditions are causing reviewer disagreement.

Per Huang & Marques-Silva (The Inadequacy of Shapley Values for Explainability, arXiv:2302.08160), Shapley values can mis-rank features that are provably irrelevant. We use Shapley as first-order ranking signal across the policy population; never present per-decision Shapley to customers as causal explanations.

Drift Detection (Phase 2 Track D.7)

Every week, the drift detector computes:

Kolmogorov-Smirnov statistic on confidenceScore distribution: current window vs 30-day baseline. Threshold 0.15 (Cohen "medium effect").
Chi-squared on categorical attributes (status, agentId).
Fire-rate delta: |current_fire_rate - baseline_fire_rate|. Threshold 0.20.

When drift exceeds threshold, a policy.drift_detected webhook fires and the dashboard surfaces a red banner. Customers investigate: model drift? Data shift? Adversarial gaming?

The Refinement Engine performs rule-search over a finite candidate set per policy:

Numeric threshold ±5% and ±10% sweeps
Operator boundary tweaks (less_than ↔ lte for inclusivity)
Remove lowest-Shapley condition
Add output_contains_pii AND-clause (defence-in-depth augmentation)

Each candidate is simulated against the org's recent history (via the policyReplayService counterfactual evaluator) to project ΔPrecision / ΔRecall / ΔF1. Ranked by F1-delta. Customer reviews top-10 suggestions in the Effectiveness Dashboard.

Cardinal Rule #6 compliance: this is rule-search over the customer's OWN labelled outcomes. No ML training. No cross-customer aggregation. The output is concrete proposed policy mutations the customer chooses to accept or reject.

Plan-tier gating

Tier	Effectiveness behaviour
Sandbox / Scale	No effectiveness dashboard; cpiFeedbackIngest writes ReviewOutcome but no PolicyEffectivenessSnapshot refit
Governance	Per-policy precision/recall/F1/Cohen's κ snapshots; drift detector; manual suggested-refinement
Enterprise	+ Per-condition Shapley + automated suggested-refinement + drift webhooks
Custom	+ Cross-org anonymised benchmarking (opt-in, differential privacy ε/δ — Phase 3 deferred)

What's in the dashboard

/dashboard/policies/effectiveness shows:

One card per (policyId, policyVersionHash) tuple with at least one snapshot in the window
Confusion-matrix counts (TP / FP / FN / TN) + derived metrics
Cohen's κ as agreement-vs-chance signal
Per-condition Shapley bar chart (when available)
30/90/365-day trend (Recharts time-series)

The 18-month moat math

Competitor cold-start: Brier ~0.20–0.25 (uncalibrated heuristic blends per Niculescu-Mizil & Caruana 2005).
Adjudon at month 18: Brier ~0.10–0.15 (range Guo et al. 2017 reports for temperature scaling on standard benchmarks).
Gap requires 18 months of equivalent customer review data to close → structural data moat.

This is the moat number Munich Re's AI underwriting team uses when pricing Adjudon-evidenced AI liability coverage (per LD-11).

Cardinal Rules applied

Rule #1: every PolicyReviewOutcome + PolicyEffectivenessSnapshot query filters organizationId FIRST via compound index.
Rule #4: correctedDecision + reviewMetadata PII-scrubbed at write time via piiScrubber.scrubString / scrubPayload.
Rule #5: PolicyReviewOutcome is append-only (mutation rejection middleware).
Rule #6: derived from the customer's OWN labelled outcomes; never trained, never cross-customer.

Metrics​

Where ground truth comes from​

Per-condition Shapley attribution​

Important — what Shapley measures here​

Drift Detection (Phase 2 Track D.7)​

Suggested refinements (Phase 2 Track D.6)​

Plan-tier gating​

What's in the dashboard​

The 18-month moat math​

Cardinal Rules applied​