NorthGradient
Start reading
Building Production Agents with LangGraph Browse lessons

Building Production Agents with LangGraph · Building Production Agents with LangGraph · 7 min read

Memory and state management

Memory bugs are quiet. They do not raise errors. The agent just behaves wrong, and the cause is hard to see. An agent that ran 200 steps and lost half its results partway through shows you nothing. This lesson covers five fixes that prevent that.

Lost results, lost progress, bloated context, mixed-up retrieval, and broken saved state are all memory problems with simple fixes.

Parallel branches overwrite each other

When two nodes run in parallel and both write the same list, LangGraph keeps only the last write by default. One branch’s results vanish, with no error.

The fix is a reducer: a merge rule attached to the field with typing.Annotated. LangGraph applies it instead of overwriting:

import operator
from typing import TypedDict, Annotated

class AgentState(TypedDict):
    task:    str
    # operator.add concatenates the two lists instead of
    # letting one branch overwrite the other
    results: Annotated[list, operator.add]

# Branch 1 returns {"results": ["A"]}, branch 2 returns {"results": ["B"]}
# Result after merge: {"results": ["A", "B"]}

For a different merge rule, pass your own function:

def merge_unique(a: list, b: list) -> list:
    return list(set(a + b))   # merge and drop duplicates

class DeduplicatedState(TypedDict):
    results: Annotated[list, merge_unique]

Any list two parallel nodes might write needs a reducer.

Crashes lose progress without checkpoints

A long run that crashes without checkpoints starts over from scratch. APIs time out and servers restart, so this will happen.

LangGraph saves checkpoints automatically when you attach a checkpointer. The thread_id is the resume key: run the graph again with the same thread_id and it continues from the last checkpoint:

from langgraph.checkpoint.memory import MemorySaver

memory = MemorySaver()                       # use PostgresSaver in production
app = graph.compile(checkpointer=memory)

config = {"configurable": {"thread_id": "task-001"}}
result = app.invoke({"task": "Analyze Q4 data", "result": ""}, config)

snapshot = app.get_state(config)             # state at the last checkpoint
print(snapshot.values)

Use a clear thread_id like "user-123-task-456" so you can find any run later.

Context grows without compaction

Every message in history is sent to the model on every step. After 50 turns you pay for 50 messages per call, most of them no longer relevant. Models also do worse on very long context: older text gets ignored.

Compaction replaces the oldest messages with a short summary and keeps only the recent ones:

def should_compact(state: State) -> str:
    return "compact" if len(state["messages"]) > 20 else "continue"

def compact_history(state: State) -> State:
    old = state["messages"][:15]                      # oldest 15
    summary = call_llm("Summarize:\n" + "\n".join(old))
    return {**state, "messages": state["messages"][-5:], "summary": summary}

Watch average tokens per run in your logs. When it climbs, compact sooner.

Mixing memory tiers makes retrieval slow

Memory has three kinds with different lifespans: what the agent is doing now, what happened in past sessions, and durable facts about the world. Put them in one store and every lookup searches data that does not matter to the current step.

Keep them separate:

# TIER 1: short-term. The current task, in the model's context window.
class AgentState(TypedDict):
    current_task: str
    recent_msgs:  list   # last few messages only

# TIER 2: episodic. Past sessions, saved as checkpoints.
# LangGraph stores and retrieves these by thread_id.
past_run = app.get_state({"configurable": {"thread_id": "session-042"}})

# TIER 3: long-term. Durable facts in a vector store, searched by
# similarity. Pull in only what is relevant to this task.
def reason(state: AgentState) -> AgentState:
    facts  = vector_store.search(state["current_task"], k=3)
    answer = call_llm(f"Facts: {facts}\nTask: {state['current_task']}")
    return {**state, "recent_msgs": state["recent_msgs"] + [answer]}

Start with short-term only. Add checkpoints when runs take more than a few minutes. Add long-term only when facts must survive across sessions.

Schema changes corrupt saved state

Your state schema is a database schema. Add a field, rename one, or change a type, and old saved state no longer matches. The agent loads it, reads a field that is missing or the wrong type, and behaves wrong with no error.

A schema_version field plus a migration function prevents this:

from typing import TypedDict

class AgentStateV2(TypedDict):
    schema_version: int
    task:           str
    result:         str
    confidence:     float   # new in v2

def migrate_state(old: dict) -> dict:
    if old.get("schema_version", 1) == 1:
        return {**old, "schema_version": 2, "confidence": 0.5}  # safe default
    return old

Every time you change the schema, bump schema_version and write a migration. Five minutes now saves hours of debugging later.

Next, we look at how to make an agent learn from its own failures: scoring runs, storing what went wrong, and checking that a change helps before you keep it.