Self-improvement patterns · NorthGradient

A self-improving agent follows a simple cycle: run a task, score the result, store what went wrong, propose a change, and check the change helped before keeping it. Each section here is one step of that cycle.

Without an outside judge, stored failures, a human gate, protected safety rules, and before-and-after checks, “self-improvement” is just an agent changing itself at random.

An agent cannot judge itself

An agent that scores its own work will overrate it. It has no outside reference, and its scoring prompt carries the same blind spots as its task prompt.

Use a separate evaluator: its only job is to read a finished run and return a score. It runs after the agent, never during:

def evaluate_run(task: str, output: str) -> dict:
    prompt = (
        f"Task: {task}\n"
        f"Output: {output}\n\n"
        "Score the output from 0 to 1 and list any problems. "
        'Reply as JSON: {"score": 0.8, "problems": ["..."]}'
    )
    return parse_json(call_llm(prompt))

# run the evaluator after the agent finishes
def run_and_evaluate(task: str) -> tuple[dict, dict]:
    result = executor_agent.invoke({"task": task})
    return result, evaluate_run(task, result["output"])

Start with a yes/no rubric: did the agent finish the task? Add detail once the basic loop works.

Successes teach you nothing; failures do

Most systems log successes and drop failures. For a self-improving agent that is backwards. A success confirms the agent works on one input. A failure tells you which tools are flaky, which prompts are unclear, and which tasks it cannot handle yet.

Store every failed run with enough context to learn from it:

from datetime import datetime, timezone

def run_with_failure_logging(task: str, db):
    result, score = run_and_evaluate(task)
    if score["score"] < 0.7:                  # treat low scores as failures
        db.failures.insert({
            "task":      task,
            "output":    result["output"],
            "score":     score["score"],
            "problems":  score["problems"],
            "trace":     result["trace"],      # what the agent did, step by step
            "timestamp": datetime.now(timezone.utc).isoformat(),
        })

Once you have 20 or so failures, group them by their problems. You will find a few recurring patterns. Those are your top priorities.

Unsupervised self-modification compounds errors

An agent that rewrites its own prompts without review can drift. A bad change shifts behavior, which shifts the next score, which drives more changes off a corrupted signal. The errors stack up.

LangGraph’s interrupt() pauses the run and waits for a human. Use it as a required gate before any self-change takes effect:

from langgraph.types import interrupt

def propose_change(state):
    suggestion = call_llm(
        f"Given these failures: {get_recent_failures()}, "
        "how should I change my prompt?"
    )
    return {**state, "proposed_change": suggestion}

def request_approval(state):
    # pauses here until a human resumes the run
    approved = interrupt({
        "message":  "Approve this prompt change?",
        "proposed": state["proposed_change"],
    })
    return {**state, "approved": approved}

def apply_change(state):
    if state["approved"]:
        save_prompt(state["proposed_change"])
    return state

Even if you trust the agent, keep the gate and auto-approve it from a monitoring script. It gives you an audit log of every change, which is reason enough to keep it.

Safety rules must be out of the agent’s reach

The agent may rewrite its task prompts. That is the point. But if safety rules (never make things up, always cite sources, never take irreversible actions) sit in the same file, the agent can rewrite those too.

Split them into two files. The agent can write task prompts. It can only read the rules:

import json
from pathlib import Path

# behavior_rules.json : pinned, changed only by code review
# task_prompts.json   : the self-improvement cycle may rewrite this

def build_system_prompt(task_name: str) -> str:
    rules   = json.loads(Path("behavior_rules.json").read_text())["rules"]
    prompts = json.loads(Path("task_prompts.json").read_text())
    rule_lines = "\n".join(f"- {r}" for r in rules)
    return f"RULES (never break):\n{rule_lines}\n\nTASK:\n{prompts[task_name]}"

Keep behavior_rules.json in version control and require review to change it, like any other code.

Applying a change without measuring it is guessing

A new prompt might help some tasks and hurt others. Without a before-and-after number you cannot tell, and you might ship a regression.

Measure on a fixed set of tasks before and after. If it gets worse, roll back:

def success_rate(test_cases) -> float:
    scores = [run_and_evaluate(task)[1]["score"] for task in test_cases]
    return sum(s > 0.7 for s in scores) / len(scores)

def apply_improvement_safely(test_cases, new_prompt):
    before = success_rate(test_cases)
    apply_prompt_change(new_prompt)
    after = success_rate(test_cases)
    if after < before - 0.02:          # worse by more than 2 points
        rollback_prompt_change()
        print(f"Rolled back: {before:.0%} -> {after:.0%}")
    else:
        print(f"Kept change: {before:.0%} -> {after:.0%}")

Build a set of 20 to 50 real tasks before you write any self-improvement code. You need a baseline before you can measure a change.

Next, we look at what happens when things go wrong: making tool failures recoverable, loops bounded, model output safe to act on, and risky actions reviewed first.