Tracing a multi-step agent
Goal
Trace one agent invocation that fans out into several tool calls — a refund classifier that looks up the order, checks the customer's tier, then computes the refund amount — as one auditable decision in Adjudon. The auditor sees the final decision and the chain of tool calls that produced it, all on the same trace row.
You'll need
- An Adjudon Sandbox plan (or any tier above)
- An
adj_test_*agent API key - Python 3.9+ with
adjudoninstalled
pip install adjudon
export ADJUDON_API_KEY="adj_test_..."
The pattern
Two valid shapes; pick the one that matches your agent architecture:
One trace per high-level decision. The agent emits a single
trace at the end with the tool-call chain captured in
outputDecision.toolCalls[]. This is the right shape when the
tool calls are means-to-an-end: the auditable decision is the
final one (refund granted: €24.99); the lookups along the
way are evidence of how the decision was reached.
One trace per step, linked by conversationId. Each tool
call gets its own trace with the same
inputContext.conversationId. Use this when each intermediate
step is itself a regulated decision (e.g. a content-moderation
pipeline where every classifier output has separate compliance
weight).
This recipe walks the first shape; the LangGraph recipe covers the second.
Code
import os
from adjudon import Adjudon
adjudon = Adjudon(
api_key=os.environ["ADJUDON_API_KEY"],
agent_id="refund-classifier",
)
# ── Step 1: Lookup the order (synthetic — replace with your real call) ──
def lookup_order(order_id: str) -> dict:
return {"order_id": order_id, "amount": 24.99, "currency": "EUR"}
# ── Step 2: Check customer tier ────────────────────────────────────────
def get_customer_tier(customer_id: str) -> str:
return "premium"
# ── Step 3: Compute the refund decision ────────────────────────────────
def decide_refund(order: dict, tier: str) -> dict:
return {
"action": "initiate_refund",
"amount": order["amount"],
"currency": order["currency"],
"rationale": f"{tier} customer; full refund per policy",
"confidence": 0.94,
}
# ── Run the chain and trace it as one decision ──────────────────────────
order = lookup_order("ord-12345")
tier = get_customer_tier("cust-789")
decision = decide_refund(order, tier)
trace = adjudon.trace(
input_context={
"prompt": "Customer requested a refund for order #12345.",
"conversationId": "conv-12345",
},
output_decision={
**decision,
"toolCalls": [
{ "tool": "lookup_order", "args": {"order_id": "ord-12345"}, "result": order },
{ "tool": "get_customer_tier", "args": {"customer_id": "cust-789"}, "result": tier },
],
},
metadata={
"stepCount": 2,
"agentVersion": "v1.4.2",
},
)
print(f"Trace: {trace.id} status={trace.status}")
if trace.status == "blocked":
raise SystemExit("Refund blocked by policy")
Run it:
python multi_step_refund.py
# → Trace: 65b1f2c4... status=approved
Why one trace, not three
A multi-step agent that emits three traces — one per tool
call — gets you three rows in the audit log and one
question in the regulator's review: which of these is the
decision? The compliance answer is "the last one"; the
operational answer is "all three"; both answers point at the
ambiguity. Wrapping the chain into one trace with
outputDecision.toolCalls[] collapses the ambiguity into a
single row that says explicitly "this is the decision; here
are the steps that produced it."
This shape also matches how the Confidence Engine wants to read the data: the agent's self-reported confidence is one signal in the triangulation, and asking "how confident is the final decision" is more meaningful than asking "how confident is each individual lookup." Tool calls that themselves carry confidence are evidence; the final decision is the verdict.
What just happened
One POST /api/v1/traces request hit Adjudon with the full chain
recorded in outputDecision.toolCalls. The Confidence Engine
read decision.confidence = 0.94, the Policy Engine evaluated
the trace against your active policies, and the response came
back as approved. On the dashboard's
Decision Log, the
trace renders with the chain expandable as nested rows —
clicking the trace reveals each tool call, its arguments, and
its result.
The conversationId ties this trace to any future trace from
the same customer interaction; a follow-up "actually, send
this one to a human" event lands in the same conversation
view.
A regulator opening this trace under EU AI Act Article 13 transparency obligations sees the user prompt, the chosen action, the rationale, and every intermediate tool call that fed the rationale. There is no "but how did the model get here" gap to close out-of-band — the chain is in the audit row, anchored into the SHA-256 Hash Chain at append time, and verifiable months later with three commands.
If your trace volume is high enough that storing the full
tool-call array on every trace would inflate storage, reduce
the recorded chain to the calls that actually moved the
decision needle — capturing every cache hit and lookup
on every trace is rarely useful evidence. The schema accepts
zero or more toolCalls; the auditor's question is "which
calls explain this decision", not "every call your code
made today."
Going further
- Replace the synthetic
lookup_orderandget_customer_tierwith your real downstream calls. The trace shape does not change. - If a single tool call exceeds 10 KB of arguments or result
payload, the schema clamps the trace at the
validation rule
maxToolCalls = 50andmaxActionObjectBytes = 10 KiBper tool. Larger payloads should reference an external store (object-storage URI inargs.objectRef). - For agents where each step is itself a regulated decision,
switch to one-trace-per-step with shared
conversationId— documented at LangGraph agents. - Confidence-score lower than your policy's threshold? The
Policy Engine returns
flagged; the trace lands in the Review Queue for a human verdict before any downstream action runs. - Want to retry a failed downstream call without inflating trace count? The SDK's Idempotency-Key collapses payload-identical retries automatically; pass an explicit key for cross-retry correlation.