Skip to main content

Tracing a LangGraph agent

Goal

Wire the adjudon-langchain callback handler to a LangGraph state machine so every node transition — LLM call, tool invocation, agent decision — produces an Adjudon trace without touching the graph definition. The graph runs as it always did; the audit trail emerges from the callback layer.

This recipe is the alternative shape to Multi-step agents: instead of collapsing the chain into one trace via outputDecision.toolCalls[], each step gets its own trace linked by inputContext.conversationId. Pick this shape when each intermediate step is itself a regulated decision (a content-moderation pipeline where every classifier output has separate compliance weight, a multi-LLM ensemble where each model's verdict is independently auditable).

You'll need

  • An Adjudon Sandbox plan (or above)
  • An adj_test_* agent API key
  • Python 3.9+ with adjudon-langchain and langgraph
pip install adjudon-langchain langgraph langchain-openai
export ADJUDON_API_KEY="adj_test_..."
export OPENAI_API_KEY="sk-..."

What gets traced

The adjudon-langchain handler subscribes to four LangChain callback events and emits one trace per event:

LangChain eventAdjudon trace shape
on_llm_start + on_llm_endOne trace per LLM invocation; inputContext.prompt from the messages, outputDecision.action from the response text
on_tool_start + on_tool_endOne trace per tool call; outputDecision.toolCalls[] carries the tool name and arguments
on_agent_actionOne trace per agent action; the action becomes outputDecision.action
on_llm_error / on_tool_errorOne trace with metadata.error: true; the error type is captured but the message is scrubbed via Cardinal Rule #4

Every trace carries metadata.framework: 'langchain' and metadata.langchainEvent — analytics queries can filter by event type directly. The adapter never reaches into the graph's state; it observes LangChain's callback bus and emits side-band traces.

Code

langgraph_agent.py
import os
import uuid
from typing import Annotated, TypedDict
from langchain_core.messages import BaseMessage, HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from adjudon_langchain import AdjudonCallbackHandler

# One handler per agent; carry the conversationId across every node.
conversation_id = f"conv-{uuid.uuid4()}"

handler = AdjudonCallbackHandler(
api_key=os.environ["ADJUDON_API_KEY"],
agent_id="research-agent",
metadata={"conversationId": conversation_id},
)

llm = ChatOpenAI(model="gpt-4o-mini", callbacks=[handler])

# ── Graph state ─────────────────────────────────────────────────────────
class State(TypedDict):
messages: Annotated[list[BaseMessage], add_messages]

# ── Nodes ───────────────────────────────────────────────────────────────
def classify(state: State) -> State:
response = llm.invoke([HumanMessage(content=
f"Classify intent: {state['messages'][-1].content}"
)])
return {"messages": [response]}

def respond(state: State) -> State:
response = llm.invoke(state["messages"])
return {"messages": [response]}

# ── Wire the graph ──────────────────────────────────────────────────────
graph = (StateGraph(State)
.add_node("classify", classify)
.add_node("respond", respond)
.add_edge("classify", "respond")
.add_edge("respond", END)
.set_entry_point("classify")
.compile()
)

# ── Run ─────────────────────────────────────────────────────────────────
result = graph.invoke({
"messages": [HumanMessage(content="What's the GDPR penalty ceiling?")],
}, config={"callbacks": [handler]})

print(f"Conversation: {conversation_id}")
print(result["messages"][-1].content)

Run it:

python langgraph_agent.py
# → Conversation: conv-<uuid>
# → The GDPR maximum administrative fine is 4% of global annual turnover or €20M, whichever is higher.

What just happened

The two LLM invocations — one inside classify, one inside respond — each fired the LangChain on_llm_start / on_llm_end callbacks. The Adjudon handler intercepted both and emitted two POST /api/v1/traces calls in sequence, each carrying:

  • agentId: 'research-agent' (the constructor argument)
  • inputContext.conversationId: '<the uuid>' (passed via the handler's metadata)
  • inputContext.prompt: the LangChain message content
  • outputDecision.action: the LLM's response text
  • metadata.framework: 'langchain' (auto-set by the adapter)
  • metadata.langchainEvent: 'llm_call'

Both traces share the same conversationId, so the dashboard's Decision Log renders them as a single thread. Click the conversation to see the two LLM calls side-by-side, each with its own confidence score, policy verdict, and rationale.

If a tool node had been part of the graph, the on_tool_start / on_tool_end callbacks would have produced a third trace under the same conversationId — the adapter treats LLM calls and tool calls symmetrically.

Why per-step traces

LangGraph encourages decomposition: each node is a discrete unit of work, often with its own model or its own retrieval source. Collapsing the graph into one trace would erase the structural information that makes the graph debuggable and auditable. Each node's confidence, policy verdict, and rationale belong on their own row.

The cost is trace volume: a graph with three nodes invoked once produces three traces, not one. Sandbox's 10,000 traces / month covers serious development; production graphs typically run on Scale or above where overage is metered, not blocked.

What an auditor sees

A regulator opening the conversation thread reads it like a transcript: the user's original message at the top, then each node's input prompt, output text, confidence score, and policy verdict in the order the graph executed. There is no "what happened in between the LLM calls" gap because the graph's control flow is implicit in the trace timestamps and the shared conversationId.

Each per-node trace is independently anchored into the SHA-256 Hash Chain, so a regulator re-running the chain verify command months later can confirm the conversation transcript is exactly the one that ran. If a single node was retried (transient failure, manual replay), the Idempotency layer collapses the identical retry into the same trace; a different replay with new state is a separate trace under the same conversation.

Going further

  • Use sample_rate=0.1 on the handler in high-volume development to keep trace count bounded; production traffic should stay at the default 1.0 so every decision is audited.
  • Set raise_on_block=True to convert a policy block verdict into an AdjudonBlockedException the graph's error edges catch directly — useful for routing blocked decisions to a human-review node.
  • For async graphs, use AdjudonAsyncCallbackHandler from the same package; the trace emission is itself async-native and does not block the graph's tick.
  • The adapter writes metadata.langchainEvent so analytics queries can filter by event type (llm_call, tool_call, agent_action) directly.
  • Conversation-level review — rather than per-trace — is on the Reviews API roadmap; today the per-trace queue groups by conversationId for visual consolidation.

See also