Skip to main content

Tracing a multi-step agent

Goal

Trace one agent invocation that fans out into several tool calls — a refund classifier that looks up the order, checks the customer's tier, then computes the refund amount — as one auditable decision in Adjudon. The auditor sees the final decision and the chain of tool calls that produced it, all on the same trace row.

You'll need

  • An Adjudon Sandbox plan (or any tier above)
  • An adj_test_* agent API key
  • Python 3.9+ with adjudon installed
pip install adjudon
export ADJUDON_API_KEY="adj_test_..."

The pattern

Two valid shapes; pick the one that matches your agent architecture:

One trace per high-level decision. The agent emits a single trace at the end with the tool-call chain captured in outputDecision.toolCalls[]. This is the right shape when the tool calls are means-to-an-end: the auditable decision is the final one (refund granted: €24.99); the lookups along the way are evidence of how the decision was reached.

One trace per step, linked by conversationId. Each tool call gets its own trace with the same inputContext.conversationId. Use this when each intermediate step is itself a regulated decision (e.g. a content-moderation pipeline where every classifier output has separate compliance weight).

This recipe walks the first shape; the LangGraph recipe covers the second.

Code

multi_step_refund.py
import os
from adjudon import Adjudon

adjudon = Adjudon(
api_key=os.environ["ADJUDON_API_KEY"],
agent_id="refund-classifier",
)

# ── Step 1: Lookup the order (synthetic — replace with your real call) ──
def lookup_order(order_id: str) -> dict:
return {"order_id": order_id, "amount": 24.99, "currency": "EUR"}

# ── Step 2: Check customer tier ────────────────────────────────────────
def get_customer_tier(customer_id: str) -> str:
return "premium"

# ── Step 3: Compute the refund decision ────────────────────────────────
def decide_refund(order: dict, tier: str) -> dict:
return {
"action": "initiate_refund",
"amount": order["amount"],
"currency": order["currency"],
"rationale": f"{tier} customer; full refund per policy",
"confidence": 0.94,
}

# ── Run the chain and trace it as one decision ──────────────────────────
order = lookup_order("ord-12345")
tier = get_customer_tier("cust-789")
decision = decide_refund(order, tier)

trace = adjudon.trace(
input_context={
"prompt": "Customer requested a refund for order #12345.",
"conversationId": "conv-12345",
},
output_decision={
**decision,
"toolCalls": [
{ "tool": "lookup_order", "args": {"order_id": "ord-12345"}, "result": order },
{ "tool": "get_customer_tier", "args": {"customer_id": "cust-789"}, "result": tier },
],
},
metadata={
"stepCount": 2,
"agentVersion": "v1.4.2",
},
)

print(f"Trace: {trace.id} status={trace.status}")
if trace.status == "blocked":
raise SystemExit("Refund blocked by policy")

Run it:

python multi_step_refund.py
# → Trace: 65b1f2c4... status=approved

Why one trace, not three

A multi-step agent that emits three traces — one per tool call — gets you three rows in the audit log and one question in the regulator's review: which of these is the decision? The compliance answer is "the last one"; the operational answer is "all three"; both answers point at the ambiguity. Wrapping the chain into one trace with outputDecision.toolCalls[] collapses the ambiguity into a single row that says explicitly "this is the decision; here are the steps that produced it."

This shape also matches how the Confidence Engine wants to read the data: the agent's self-reported confidence is one signal in the triangulation, and asking "how confident is the final decision" is more meaningful than asking "how confident is each individual lookup." Tool calls that themselves carry confidence are evidence; the final decision is the verdict.

What just happened

One POST /api/v1/traces request hit Adjudon with the full chain recorded in outputDecision.toolCalls. The Confidence Engine read decision.confidence = 0.94, the Policy Engine evaluated the trace against your active policies, and the response came back as approved. On the dashboard's Decision Log, the trace renders with the chain expandable as nested rows — clicking the trace reveals each tool call, its arguments, and its result.

The conversationId ties this trace to any future trace from the same customer interaction; a follow-up "actually, send this one to a human" event lands in the same conversation view.

A regulator opening this trace under EU AI Act Article 13 transparency obligations sees the user prompt, the chosen action, the rationale, and every intermediate tool call that fed the rationale. There is no "but how did the model get here" gap to close out-of-band — the chain is in the audit row, anchored into the SHA-256 Hash Chain at append time, and verifiable months later with three commands.

If your trace volume is high enough that storing the full tool-call array on every trace would inflate storage, reduce the recorded chain to the calls that actually moved the decision needle — capturing every cache hit and lookup on every trace is rarely useful evidence. The schema accepts zero or more toolCalls; the auditor's question is "which calls explain this decision", not "every call your code made today."

Going further

  • Replace the synthetic lookup_order and get_customer_tier with your real downstream calls. The trace shape does not change.
  • If a single tool call exceeds 10 KB of arguments or result payload, the schema clamps the trace at the validation rule maxToolCalls = 50 and maxActionObjectBytes = 10 KiB per tool. Larger payloads should reference an external store (object-storage URI in args.objectRef).
  • For agents where each step is itself a regulated decision, switch to one-trace-per-step with shared conversationId — documented at LangGraph agents.
  • Confidence-score lower than your policy's threshold? The Policy Engine returns flagged; the trace lands in the Review Queue for a human verdict before any downstream action runs.
  • Want to retry a failed downstream call without inflating trace count? The SDK's Idempotency-Key collapses payload-identical retries automatically; pass an explicit key for cross-retry correlation.