Recursive Patterns for Scaling Agent Work

Most AI agent scaffolds hit a wall somewhere between “impressive demo” and “actually useful at scale.” The context window fills up. Output gets truncated. Sub-agent coordination becomes a brittle mess of manual batching. These aren’t implementation bugs — they’re architectural limitations baked into how we think about agent systems.

The Recursive Language Models paper (Zhang, Kraska, Khattab — ICML 2025) offers a way out. It presents patterns that let LLMs process 10+ million tokens, generate unbounded output, and delegate work programmatically rather than manually. The results are striking: RLM(GPT-5) scores 91.3% on tasks where vanilla GPT-5 scores 0% (can’t even fit the input).

You don’t need their research infrastructure to use these ideas. This post extracts the core patterns and shows how to apply them to practical multi-agent systems.

Three Flaws of Standard Agent Scaffolds

The paper identifies three fundamental problems with how most agent systems are built:

1. Putting Full Data into Context

Standard approach: “Here’s the data, analyze it.”

# Problem: data goes into context
response = agent.run(
    prompt=f"Analyze this document: {document_text}",  # 500K tokens
)
# Result: context overflow, truncation, or refusal

When data exceeds the context window, the agent fails. Even when it fits technically, long contexts degrade reasoning quality. The model is trying to “remember” everything while also thinking.

2. Generating Output Autoregressively

Standard output is bounded by context length. An agent can’t generate a 100-page report if its context window is 32K tokens. Even models with larger contexts hit practical limits — generation slows, coherence degrades.

3. Verbalizing Sub-calls Instead of Programming Them

When an agent needs to delegate work, the standard pattern is verbalized delegation:

# Verbalized: O(1) delegation, each call manually specified
agent.delegate("Process articles 1-10")
agent.delegate("Process articles 11-20")
agent.delegate("Process articles 21-30")
# ... coordinator's context fills with results

This is O(1) in code but O(N) in human (or coordinator) effort. The coordinator has to specify each batch, track each result, handle each error. It doesn’t scale.

Pattern 1: Data as Environment

The RLM approach inverts the relationship between agent and data:

Standard: Data is in context, agent processes it all at once RLM: Data is in environment, agent writes code to access slices

# RLM pattern: data as environment
def analyze_document(doc_path):
    """Agent writes code like this instead of receiving full text."""
    for chunk in read_chunks(doc_path, chunk_size=1000):
        findings = analyze_chunk(chunk)
        if findings.relevant:
            yield findings
    
    return aggregate(findings)

The agent doesn’t hold the entire document in context. It interacts with the document through tool calls — reading chunks, querying sections, writing partial results to storage.

In practice, this means manifest-based dispatch:

# Bad: send file contents
task = {
    "articles": [read(f) for f in paths],  # Context overflow
    "instruction": "Edit each article"
}

# Good: send file paths
task = {
    "manifest": [{"path": p, "status": "draft"} for p in paths],
    "instruction": "Read each file from manifest, edit, save"
}

The agent reads what it needs, when it needs it. Context usage scales with task complexity, not input size.

Pattern 2: Symbolic Recursion

The key insight: agents can write programs that spawn sub-calls programmatically, not just verbalize each call individually.

Verbalized delegation:

Agent: "I'll process batch 1, then batch 2, then batch 3..."
// Each batch requires coordinator attention
// Results flow back, filling coordinator context

Programmatic delegation:

# Agent writes a loop
for article in manifest:
    result = process_single(article)
    results_file.append(result)

return {"status": "complete", "results_path": results_file}

The difference is O(1) verbalized calls vs O(N) (or O(N²)) programmatic calls. For a task requiring 100 sub-operations, verbalized delegation needs 100 coordinator interactions. Programmatic delegation needs one: “write the loop.”

This is why I dispatch coding agents to write processing scripts rather than manually batching work. The script handles iteration, error recovery, and progress tracking. The coordinator monitors completion, not individual items.

Pattern 3: Metadata-Only History

When sub-agents return results to a coordinator, what should flow back?

Standard: Full output enters coordinator context

# Result from sub-agent
{
    "article_1": "Full 2000-word edited text...",
    "article_2": "Full 2000-word edited text...",
    "reasoning": "Detailed explanation of every change..."
}
// Coordinator context fills with sub-agent verbosity

RLM: Only metadata enters context, full output goes to storage

# Structured return
{
    "status": "complete",
    "processed": 10,
    "changed": 7,
    "errors": 1,
    "details_path": "/logs/batch-2026-02-24.json"
}
// Coordinator context stays clean

The coordinator makes decisions on summaries. It can drill into details via file reads when needed, but details don’t pollute context by default.

This pattern is essential for long orchestration chains. Without it, the coordinator’s context fills with sub-agent output until it can’t reason clearly anymore.

Results From the Paper

The paper demonstrates these patterns on extreme tasks:

Task	Base Model	With RLM	Why
10M token QA	0%	91.3%	Input can’t fit without RLM
Dense reasoning	44%	56.5%	Every line matters, can’t summarize
Quadratic pairs	0.1%	58.0%	N² comparisons, only loops can do it

The “10M token QA” result is particularly striking. Vanilla GPT-5 literally cannot attempt the task — the input doesn’t fit. RLM(GPT-5) scores 91.3% by treating the input as environment and processing it through code.

You Don’t Need the Library

These patterns don’t require special infrastructure. They’re architectural decisions about:

Where data lives: In context vs in environment (files, databases, APIs)
How delegation happens: Verbalized one-by-one vs programmatic loops
What returns from sub-agents: Full output vs structured summaries

Implementation is convention, not framework:

# Manifest-based dispatch (Pattern 1)
def dispatch_manifest(agent, manifest_path, instructions):
    return agent.run(f"Process items in {manifest_path}. {instructions}")

# Programmatic delegation (Pattern 2)
def dispatch_with_loop(agent, items, process_fn_template):
    script = agent.run(f"Write a Python script that: {process_fn_template}")
    return exec_script(script, items)

# Structured returns (Pattern 3)
def collect_results(agent_return):
    summary = agent_return["summary"]  # For coordinator
    if needs_detail:
        details = read(agent_return["details_path"])  # Only when needed
    return summary

Ten lines of convention, not ten thousand lines of library.

When Standard Approaches Win

RLM patterns add orchestration overhead. Sometimes that overhead isn’t worth it:

Small inputs that fit in context: Just use the model directly. No need for manifest dispatch when the data is 1000 tokens.

One-shot tasks: A single agent call with no delegation doesn’t benefit from orchestration patterns.

Tasks requiring holistic judgment: Some tasks can’t be decomposed. If understanding requires seeing everything at once, chunking doesn’t help.

Low N tasks: Processing 5 articles doesn’t need programmatic loops. The overhead of writing a script exceeds the cost of manual batching.

The rule of thumb: if N > 10 or input > context_window, RLM patterns probably help. Otherwise, keep it simple.

Applying the Patterns

For a practical multi-agent content pipeline:

Research stage: Scout receives topic list (manifest), searches each incrementally, writes findings to files, returns summary.
Writing stage: Writer receives manifest of topics + research paths, reads research per-topic, writes articles to files, returns completion counts.
Editing stage: Editor receives manifest of draft paths, reads each, applies edits, writes changes, returns changed/unchanged/error counts.
Publishing stage: Publisher receives manifest of approved paths, formats and deploys each, returns deployment status.

At each stage, the coordinator sees summaries. Full content lives in files. Processing loops are scripts, not verbalized batches. Context stays clean across the entire pipeline.