Skip to main content

Adding custom PII regex patterns

Goal

Scrub organisation-specific identifiers — internal employee IDs, customer account numbers, tax IDs in your jurisdiction, anything outside the always-on patterns — from trace payloads before they reach Adjudon. The result: the scrubbed text is what gets persisted, what gets read on the dashboard, what gets exported in audit bundles, and what flows into vector memory.

Status

The Adjudon server-side PII scrubber covers five fixed patterns today: email, phone, SSN, credit card, IBAN. There is no per-org custom-pattern API — adding "scrub our customer-account-number format" is a client-side preprocessor, not a server-side configuration. The roadmap includes a per-org pattern surface; until it ships, this recipe is the supported path. The pattern below is what production customers run today.

You'll need

  • Any Adjudon plan
  • An adj_test_* agent API key
  • Python 3.9+ with adjudon installed
pip install adjudon
export ADJUDON_API_KEY="adj_test_..."

Code

custom_pii_scrubber.py
import os
import re
from adjudon import Adjudon

# ── Your organisation's custom patterns ────────────────────────────────
# Each entry: (compiled-regex, replacement-token)
CUSTOM_PATTERNS = [
# Acme employee ID format: EMP-12345
(re.compile(r"\bEMP-\d{5}\b"), "[REDACTED_EMP_ID]"),
# Acme customer account number: AC-XXXXX-XXX
(re.compile(r"\bAC-\d{5}-\d{3}\b"), "[REDACTED_ACCOUNT]"),
# German tax ID (Steuer-ID): 11 digits
(re.compile(r"\b\d{2}\s?\d{3}\s?\d{3}\s?\d{3}\b"), "[REDACTED_TAX_ID]"),
# Internal ticket reference: SUPP-2026-12345
(re.compile(r"\bSUPP-\d{4}-\d{5}\b"), "[REDACTED_TICKET]"),
]

def scrub_custom(value):
"""Recursively apply CUSTOM_PATTERNS to strings inside any payload shape."""
if isinstance(value, str):
for pattern, token in CUSTOM_PATTERNS:
value = pattern.sub(token, value)
return value
if isinstance(value, dict):
return {k: scrub_custom(v) for k, v in value.items()}
if isinstance(value, list):
return [scrub_custom(item) for item in value]
return value

# ── Adjudon client with both layers active ─────────────────────────────
adjudon = Adjudon(
api_key=os.environ["ADJUDON_API_KEY"],
agent_id="customer-support-bot",
redact_pii=True, # client-side: emails / phones / SSN / cc / IBAN
)

# ── Trace with the custom layer applied first ──────────────────────────
input_context = scrub_custom({
"prompt": (
"Customer EMP-87421 escalated SUPP-2026-04812 about account "
"AC-12345-678. Their email is [email protected]."
),
})
output_decision = scrub_custom({
"action": "Escalate to tier-2 support",
"rationale": "Customer has VIP tier; routing per policy",
"confidence": 0.91,
})

trace = adjudon.trace(
input_context=input_context,
output_decision=output_decision,
)

print(f"Trace: {trace.id} status={trace.status}")

Run it:

python custom_pii_scrubber.py
# → Trace: 65b1f2c4... status=approved

Open the trace in the Decision Log and the prompt reads:

Customer [REDACTED_EMP_ID] escalated [REDACTED_TICKET] about account
[REDACTED_ACCOUNT]. Their email is [REDACTED_EMAIL].

The custom tokens come from the client-side preprocessor. The [REDACTED_EMAIL] token comes from the Adjudon server-side scrubber that always runs.

The two-layer model

   ┌──────────────────────────────────────────────────────────┐
│ 1. Your client-side preprocessor │
│ — Org-specific patterns (EMP-IDs, account numbers, │
│ tax IDs, ticket refs) │
│ — Replaces matches with [REDACTED_*] tokens │
│ ──> payload still carries scrubbed strings │
├──────────────────────────────────────────────────────────┤
│ 2. Adjudon server-side scrubber (always-on) │
│ — EMAIL, PHONE, SSN, CREDIT_CARD, IBAN │
│ — Replaces any of those patterns the client missed │
│ ──> persisted payload is fully scrubbed │
└──────────────────────────────────────────────────────────┘

The ordering matters. Client-side first means your patterns fire on the payload before Adjudon's library-grade scrubber, which keeps your custom tokens ([REDACTED_EMP_ID]) distinct from the platform tokens ([REDACTED_EMAIL]). An auditor reading the trace later can tell which scrubber matched what just from the token shape.

The server-side scrubber is non-disableable — even if your client-side layer matches a string and replaces it, the server-side patterns run on the resulting payload as a defence in depth. Your custom layer cannot accidentally exempt the five always-on patterns.

Why this matters under GDPR

The scrubbed trace is the trace of record. A GDPR Article 17 right-to-erasure request against a customer who showed up in five years of decision history can be satisfied without deleting any traces if the customer's identifiers were already scrubbed at ingestion time — the trace shells remain on the SHA-256 Hash Chain (chain integrity preserved), but the personal data is mathematically unrecoverable because it was never persisted in the first place. This is the "scrub at ingest, anchor the shell" posture documented at Audit & Security.

Custom patterns matter exactly because GDPR's "personal data" definition is broader than the five built-ins. An internal employee ID maps to one named person inside the operator's HR system; a customer account number maps to one named customer inside the operator's CRM. Those are personal data under Article 4(1) regardless of whether they look like PII to a generic library. Scrubbing them is not optional defence-in-depth; it is the core compliance posture.

Common false-positive traps

  • Trace IDs and order numbers. A regex like \d{10} catches every long digit run, including IDs that are not PII. Anchor on the format prefix (AC-, EMP-, SUPP-) to keep the match precise.
  • MongoDB ObjectIDs. 24-character hex strings show up inside trace metadata; a credit-card-style \b\d{13,16}\b pattern can falsely match an ObjectID's digit suffix on a bad day. Adjudon's built-in patterns already guard against this; if you write your own credit-card variant for non-Visa schemes, mirror the boundary checks.
  • Phone numbers in URLs. A built-in +49 30 1234 5678 match also fires on https://example.com/12345678 if the pattern is too loose. Keep the area-code and grouping hints in your regex.
  • Email addresses in JSON keys. The recursive scrubber walks { "[email protected]": "value" } and replaces the key. This is correct behaviour but can confuse a downstream consumer that expects to look up by literal email. Avoid using PII as a JSON key in trace payloads.

Going further

  • Build the patterns from a config file. Hardcoding regex in the call site is fine for one-off recipes; production systems load CUSTOM_PATTERNS from a YAML or JSON config with a per-pattern severity tag for auditability.
  • Test your patterns. A custom regex that under-matches is silent data leakage. Maintain a unit-test suite of positive examples (strings that should get scrubbed) and negative examples (strings that should not) and run it on every config change.
  • Watch for false positives. A regex like \d{11} for a German tax ID will also match an Adjudon trace ID, an order number, and any other 11-digit run. Use word boundaries and format anchors (the EMP- prefix above is what makes the pattern safe).
  • OpenTelemetry path. When ingesting via the OTel exporter, the same client-side preprocessor approach applies — scrub before the span attribute is set, not after.
  • Audit log entries are also scrubbed. Every Operations Audit Log entry runs through the same server-side scrubber on persistence; your custom layer should also wrap any data that flows into audit-log details (operator-supplied notes, manual review comments) for symmetry.

See also

  • Audit & Security — the Cardinal Rule on PII scrubbing and the always-on posture
  • Traces API — the integration surface that consumes the scrubbed payload
  • Sub-Processors — what happens to the scrubbed payload downstream