Tracing a LangGraph agent
Goal
Wire the adjudon-langchain callback handler to a LangGraph
state machine so every node transition — LLM call, tool
invocation, agent decision — produces an Adjudon trace
without touching the graph definition. The graph runs as it
always did; the audit trail emerges from the callback layer.
This recipe is the alternative shape to
Multi-step agents: instead of
collapsing the chain into one trace via
outputDecision.toolCalls[], each step gets its own trace
linked by inputContext.conversationId. Pick this shape when
each intermediate step is itself a regulated decision (a
content-moderation pipeline where every classifier output has
separate compliance weight, a multi-LLM ensemble where each
model's verdict is independently auditable).
You'll need
- An Adjudon Sandbox plan (or above)
- An
adj_test_*agent API key - Python 3.9+ with
adjudon-langchainandlanggraph
pip install adjudon-langchain langgraph langchain-openai
export ADJUDON_API_KEY="adj_test_..."
export OPENAI_API_KEY="sk-..."
What gets traced
The adjudon-langchain handler subscribes to four LangChain
callback events and emits one trace per event:
| LangChain event | Adjudon trace shape |
|---|---|
on_llm_start + on_llm_end | One trace per LLM invocation; inputContext.prompt from the messages, outputDecision.action from the response text |
on_tool_start + on_tool_end | One trace per tool call; outputDecision.toolCalls[] carries the tool name and arguments |
on_agent_action | One trace per agent action; the action becomes outputDecision.action |
on_llm_error / on_tool_error | One trace with metadata.error: true; the error type is captured but the message is scrubbed via Cardinal Rule #4 |
Every trace carries metadata.framework: 'langchain' and
metadata.langchainEvent — analytics queries can filter
by event type directly. The adapter never reaches into the
graph's state; it observes LangChain's callback bus and
emits side-band traces.
Code
import os
import uuid
from typing import Annotated, TypedDict
from langchain_core.messages import BaseMessage, HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from adjudon_langchain import AdjudonCallbackHandler
# One handler per agent; carry the conversationId across every node.
conversation_id = f"conv-{uuid.uuid4()}"
handler = AdjudonCallbackHandler(
api_key=os.environ["ADJUDON_API_KEY"],
agent_id="research-agent",
metadata={"conversationId": conversation_id},
)
llm = ChatOpenAI(model="gpt-4o-mini", callbacks=[handler])
# ── Graph state ─────────────────────────────────────────────────────────
class State(TypedDict):
messages: Annotated[list[BaseMessage], add_messages]
# ── Nodes ───────────────────────────────────────────────────────────────
def classify(state: State) -> State:
response = llm.invoke([HumanMessage(content=
f"Classify intent: {state['messages'][-1].content}"
)])
return {"messages": [response]}
def respond(state: State) -> State:
response = llm.invoke(state["messages"])
return {"messages": [response]}
# ── Wire the graph ──────────────────────────────────────────────────────
graph = (StateGraph(State)
.add_node("classify", classify)
.add_node("respond", respond)
.add_edge("classify", "respond")
.add_edge("respond", END)
.set_entry_point("classify")
.compile()
)
# ── Run ─────────────────────────────────────────────────────────────────
result = graph.invoke({
"messages": [HumanMessage(content="What's the GDPR penalty ceiling?")],
}, config={"callbacks": [handler]})
print(f"Conversation: {conversation_id}")
print(result["messages"][-1].content)
Run it:
python langgraph_agent.py
# → Conversation: conv-<uuid>
# → The GDPR maximum administrative fine is 4% of global annual turnover or €20M, whichever is higher.
What just happened
The two LLM invocations — one inside classify, one inside
respond — each fired the LangChain on_llm_start /
on_llm_end callbacks. The Adjudon handler intercepted both and
emitted two POST /api/v1/traces calls in sequence, each
carrying:
agentId: 'research-agent'(the constructor argument)inputContext.conversationId: '<the uuid>'(passed via the handler'smetadata)inputContext.prompt: the LangChain message contentoutputDecision.action: the LLM's response textmetadata.framework: 'langchain'(auto-set by the adapter)metadata.langchainEvent: 'llm_call'
Both traces share the same conversationId, so the dashboard's
Decision Log renders
them as a single thread. Click the conversation to see the two
LLM calls side-by-side, each with its own confidence score,
policy verdict, and rationale.
If a tool node had been part of the graph, the
on_tool_start / on_tool_end callbacks would have produced a
third trace under the same conversationId — the adapter
treats LLM calls and tool calls symmetrically.
Why per-step traces
LangGraph encourages decomposition: each node is a discrete unit of work, often with its own model or its own retrieval source. Collapsing the graph into one trace would erase the structural information that makes the graph debuggable and auditable. Each node's confidence, policy verdict, and rationale belong on their own row.
The cost is trace volume: a graph with three nodes invoked once produces three traces, not one. Sandbox's 10,000 traces / month covers serious development; production graphs typically run on Scale or above where overage is metered, not blocked.
What an auditor sees
A regulator opening the conversation thread reads it like a
transcript: the user's original message at the top, then each
node's input prompt, output text, confidence score, and policy
verdict in the order the graph executed. There is no "what
happened in between the LLM calls" gap because the graph's
control flow is implicit in the trace timestamps and the
shared conversationId.
Each per-node trace is independently anchored into the SHA-256 Hash Chain, so a regulator re-running the chain verify command months later can confirm the conversation transcript is exactly the one that ran. If a single node was retried (transient failure, manual replay), the Idempotency layer collapses the identical retry into the same trace; a different replay with new state is a separate trace under the same conversation.
Going further
- Use
sample_rate=0.1on the handler in high-volume development to keep trace count bounded; production traffic should stay at the default1.0so every decision is audited. - Set
raise_on_block=Trueto convert a policyblockverdict into anAdjudonBlockedExceptionthe graph's error edges catch directly — useful for routing blocked decisions to a human-review node. - For async graphs, use
AdjudonAsyncCallbackHandlerfrom the same package; the trace emission is itself async-native and does not block the graph's tick. - The adapter writes
metadata.langchainEventso analytics queries can filter by event type (llm_call,tool_call,agent_action) directly. - Conversation-level review — rather than per-trace —
is on the
Reviews API roadmap; today the
per-trace queue groups by
conversationIdfor visual consolidation.
See also
- Multi-step agents — the alternative single-trace shape
- Python SDK — the underlying adapter package's home
- Traces & Confidence — how each per-node trace is scored