Observability and tracing · NorthGradient

If you cannot see what your agent is doing, you cannot debug it, improve it, or trust it. Without it, a slow run is just “the agent is slow.” With it, you can say “web_search averages 8 seconds and is 70% of the run.” That is the difference between guessing and fixing.

Set up observability before the first run and it is nearly free. Add it after 10,000 runs and you have already lost the history that would explain what went wrong.

Logs without IDs are useless

In production, many runs happen at once. A line that says tool call failed with no ID attached tells you nothing: which run, which task, which user?

At the start of every run, make two IDs and attach them to every log line. A run_id marks this one run. A thread_id groups runs that belong to the same task:

import uuid
from datetime import datetime, timezone

def new_run_context(task: str, user_id: str = "anon") -> dict:
    return {
        "run_id":    str(uuid.uuid4()),            # this run
        "thread_id": f"{user_id}-{task[:20]}",     # this task chain
        "started":   datetime.now(timezone.utc).isoformat(),
    }

import logging
log = logging.getLogger("agent")

def search_node(state):
    ctx = state["run_context"]
    log.info("search start", extra=ctx)
    result = web_search(state["query"])
    log.info("search done", extra={**ctx, "count": len(result)})
    return {**state, "results": result}

Put run_id and thread_id in your state from day one. Adding them later, across existing logs, is much more work.

Plain text logs cannot be searched

search_node completed tells you the node ran. It does not tell you how long it took or let you compare it across runs. You cannot filter or total it.

Emit a small structured event when each node finishes:

import time, json
from functools import wraps

def traced_node(node):
    @wraps(node)
    def wrapper(state: dict) -> dict:
        start  = time.time()
        result = node(state)
        emit_event(json.dumps({
            "node":       node.__name__,
            "run_id":     state.get("run_id"),
            "latency_ms": round((time.time() - start) * 1000, 1),
        }))
        return result
    return wrapper

@traced_node
def search_node(state): ...

Add token count and cost to the event later if you need them. Even with just name and latency, a chart of average latency per node shows your slow steps at a glance.

Most tracing misses the real bottleneck

Standard tracing records how long the model took. But tool calls (web search, database queries) are often slower than the model. Trace only the model and you are blind to what actually makes the agent slow.

Trace tools with the same idea:

import time
from functools import wraps

def traced_tool(tool):
    @wraps(tool)
    def wrapper(*args, **kwargs):
        start, ok = time.time(), True
        try:
            return tool(*args, **kwargs)
        except Exception:
            ok = False
            raise
        finally:
            emit_tool_trace({
                "tool":       tool.__name__,
                "latency_ms": round((time.time() - start) * 1000, 1),
                "ok":         ok,
            })
    return wrapper

@traced_tool
def web_search(query: str) -> list:
    return real_search_api(query)

Sort tools by average latency. The slowest few are where caching and optimization pay off.

Bugs you cannot reproduce cannot be fixed

When an agent fails in production with no record, you are guessing. You do not know the inputs, the steps, or the state at the moment it broke.

A tracing backend stores every run so you can search and compare them. LangSmith is the usual choice for LangGraph. Turning it on takes a few minutes; the cost of skipping it is the next bug you cannot reproduce:

# .env (set before the first run)
# LANGSMITH_TRACING=true
# LANGSMITH_API_KEY=...
# LANGSMITH_PROJECT=my-agent

from dotenv import load_dotenv
load_dotenv()

# With tracing on, LangGraph sends every node call to LangSmith,
# with inputs, outputs, latency, and token counts, and no extra code.

Regressions are invisible without tests

Every new tool or path can break something that used to work. Without tests over existing behavior, you find out when a user complains.

Keep a list of example inputs and what the output should contain, and run it before every deploy:

EVAL_CASES = [
    {"input": {"task": "Capital of France?"}, "contains": "Paris"},
    {"input": {"task": "What is 847 * 293?"}, "contains": "248171"},
]

def run_evals(app, cases) -> float:
    passed = sum(
        case["contains"] in app.invoke(case["input"])["result"]
        for case in cases
    )
    return passed / len(cases)        # pass rate, 0 to 1

Run this in CI. Block any deploy that drops the pass rate by more than a few points.

Next, the daily habits that keep the codebase readable months from now: type annotations, versioned prompts, caching, file layout, and smoke tests.