Reliability and safety · NorthGradient

Production agents fail. APIs time out, models return junk, loops run long. The goal is not to prevent every failure. It is to make every failure recoverable, bounded, and visible. The four problems here all have simple, reliable fixes.

An uncaught error crashes the run. An unbounded loop drains your budget. Raw model output sent to a tool is the top cause of crashes. An unclassified tool is one that can delete your database.

An uncaught error crashes the run

If a tool raises an error and nothing catches it, the whole run crashes. You lose progress since the last checkpoint, and the agent cannot route around the problem.

Wrap each tool call in try/except and return a result object instead. The agent can then look at the result and decide: retry, fall back, or stop:

from dataclasses import dataclass
from typing import Any, Union

@dataclass
class ToolSuccess:
    data: Any

@dataclass
class ToolError:
    kind:      str    # "timeout", "rate_limit", "unknown"
    message:   str
    retryable: bool   # is it worth trying again?

ToolResult = Union[ToolSuccess, ToolError]

def safe_web_search(query: str) -> ToolResult:
    try:
        return ToolSuccess(web_search_api(query, timeout=10))
    except TimeoutError:
        return ToolError("timeout", "search timed out", retryable=True)
    except Exception as e:
        return ToolError("unknown", str(e), retryable=False)

The retryable flag is the key part. Temporary failures (timeouts, rate limits) are worth retrying. Logic failures (bad input, missing permission) will fail the same way every time, so do not retry them.

A loop without limits drains your budget

An agent loop can run forever if its stop condition is never met. Every turn calls the model and maybe several tools, so a runaway loop is expensive and produces nothing.

Set a hard cap on turns and on tokens before the run starts. Check it before each turn:

MAX_TURNS    = 10
TOKEN_BUDGET = 50_000

def safety_check(state: State) -> str:
    if state["turn"] >= MAX_TURNS:
        return "stop"
    if state["tokens"] >= TOKEN_BUDGET:
        return "stop"
    return "continue"

# route every turn through the check first
graph.add_conditional_edges("loop", safety_check, {
    "continue": "loop",
    "stop":     "finish",
})

Start with limits that feel too low. Your logs will show when real tasks need more, and you can raise them with evidence.

Raw model output is unvalidated input

The model returns plain text. If you pass that text straight to a tool, you are handing unchecked input to something that will act on it. Models sometimes return broken JSON, wrong field names, or out-of-range values.

Validate the output with Pydantic first. Treat anything that does not fit your schema as a failure, not as input:

from pydantic import BaseModel, Field, ValidationError
import json

class SearchDecision(BaseModel):
    query:       str = Field(min_length=3, max_length=200)
    num_results: int = Field(ge=1, le=20)

def parse_decision(raw: str) -> SearchDecision | None:
    try:
        return SearchDecision(**json.loads(raw))   # checks every field
    except (json.JSONDecodeError, ValidationError):
        return None                                # caller treats None as failure

Be strict in the model. A loose schema gives false confidence: it passes, the agent runs, and the real failure shows up later where it is harder to trace.

Not every action can be undone

Some tools only read. Some create or update records you can undo. Some send email, delete data, or post to other systems, and those cannot be taken back. A bug in that last kind is not a debugging session, it is an incident.

Label every tool by how reversible it is, and ask a human before any high-risk action. Treat unknown tools as the riskiest:

from enum import Enum
from langgraph.types import interrupt

class Risk(Enum):
    READ    = "read"      # safe: search, read_file
    WRITE   = "write"     # undoable: create_record, update_user
    DESTROY = "destroy"   # not undoable: delete_record, send_email

TOOL_RISK = {
    "web_search":   Risk.READ,
    "update_user":  Risk.WRITE,
    "delete_record": Risk.DESTROY,
    "send_email":   Risk.DESTROY,
}

def execute_tool(name: str, args: dict, state):
    risk = TOOL_RISK.get(name, Risk.DESTROY)        # unknown = treat as risky
    if risk is Risk.DESTROY:
        if not interrupt({"action": name, "args": args}):   # ask a human
            return {**state, "error": "rejected by user", "done": True}
    return run_tool(name, args)

When unsure, mark a tool DESTROY. One extra confirmation is a minor annoyance. One accidental irreversible action is not.

For an emergency stop from monitoring, add a force_stop flag to state. Every node checks it first and passes straight through if it is set, so you can halt a run without waiting for the next interrupt().

Next, we look at observability: making every run searchable, every slow step visible, and every regression caught automatically.