Building Production Agents with LangGraph · Building Production Agents with LangGraph · 7 min read
Reliability and safety
Production agents fail. APIs time out, models return junk, loops run long. The goal is not to prevent every failure. It is to make every failure recoverable, bounded, and visible. The four problems here all have simple, reliable fixes.
An uncaught error crashes the run. An unbounded loop drains your budget. Raw model output sent to a tool is the top cause of crashes. An unclassified tool is one that can delete your database.
An uncaught error crashes the run
If a tool raises an error and nothing catches it, the whole run crashes. You lose progress since the last checkpoint, and the agent cannot route around the problem.
Wrap each tool call in try/except and return a result object instead. The agent can then look at the result and decide: retry, fall back, or stop:
from dataclasses import dataclass
from typing import Any, Union
@dataclass
class ToolSuccess:
data: Any
@dataclass
class ToolError:
kind: str # "timeout", "rate_limit", "unknown"
message: str
retryable: bool # is it worth trying again?
ToolResult = Union[ToolSuccess, ToolError]
def safe_web_search(query: str) -> ToolResult:
try:
return ToolSuccess(web_search_api(query, timeout=10))
except TimeoutError:
return ToolError("timeout", "search timed out", retryable=True)
except Exception as e:
return ToolError("unknown", str(e), retryable=False)
The retryable flag is the key part. Temporary failures (timeouts, rate limits) are worth retrying. Logic failures (bad input, missing permission) will fail the same way every time, so do not retry them.
A loop without limits drains your budget
An agent loop can run forever if its stop condition is never met. Every turn calls the model and maybe several tools, so a runaway loop is expensive and produces nothing.
Set a hard cap on turns and on tokens before the run starts. Check it before each turn:
MAX_TURNS = 10
TOKEN_BUDGET = 50_000
def safety_check(state: State) -> str:
if state["turn"] >= MAX_TURNS:
return "stop"
if state["tokens"] >= TOKEN_BUDGET:
return "stop"
return "continue"
# route every turn through the check first
graph.add_conditional_edges("loop", safety_check, {
"continue": "loop",
"stop": "finish",
})
Start with limits that feel too low. Your logs will show when real tasks need more, and you can raise them with evidence.
Raw model output is unvalidated input
The model returns plain text. If you pass that text straight to a tool, you are handing unchecked input to something that will act on it. Models sometimes return broken JSON, wrong field names, or out-of-range values.
Validate the output with Pydantic first. Treat anything that does not fit your schema as a failure, not as input:
from pydantic import BaseModel, Field, ValidationError
import json
class SearchDecision(BaseModel):
query: str = Field(min_length=3, max_length=200)
num_results: int = Field(ge=1, le=20)
def parse_decision(raw: str) -> SearchDecision | None:
try:
return SearchDecision(**json.loads(raw)) # checks every field
except (json.JSONDecodeError, ValidationError):
return None # caller treats None as failure
Be strict in the model. A loose schema gives false confidence: it passes, the agent runs, and the real failure shows up later where it is harder to trace.
Not every action can be undone
Some tools only read. Some create or update records you can undo. Some send email, delete data, or post to other systems, and those cannot be taken back. A bug in that last kind is not a debugging session, it is an incident.
Label every tool by how reversible it is, and ask a human before any high-risk action. Treat unknown tools as the riskiest:
from enum import Enum
from langgraph.types import interrupt
class Risk(Enum):
READ = "read" # safe: search, read_file
WRITE = "write" # undoable: create_record, update_user
DESTROY = "destroy" # not undoable: delete_record, send_email
TOOL_RISK = {
"web_search": Risk.READ,
"update_user": Risk.WRITE,
"delete_record": Risk.DESTROY,
"send_email": Risk.DESTROY,
}
def execute_tool(name: str, args: dict, state):
risk = TOOL_RISK.get(name, Risk.DESTROY) # unknown = treat as risky
if risk is Risk.DESTROY:
if not interrupt({"action": name, "args": args}): # ask a human
return {**state, "error": "rejected by user", "done": True}
return run_tool(name, args)
When unsure, mark a tool DESTROY. One extra confirmation is a minor annoyance. One accidental irreversible action is not.
For an emergency stop from monitoring, add a force_stop flag to state. Every node checks it first and passes straight through if it is set, so you can halt a run without waiting for the next interrupt().
Next, we look at observability: making every run searchable, every slow step visible, and every regression caught automatically.