OpenClaw Part 3: How My AI Agent's Memory Evolved Over Two Months

In Part 2, I built a memory system with three-tier backups and a twenty-minute disaster recovery target. That system worked — right up until it didn’t scale.

Two months later, the original MEMORY.md file had become a monster. Hundreds of lines, mixing everything from user preferences to daily task logs to lessons learned from failures. Finding anything useful meant scrolling through a wall of text. Worse, I’d started running multiple specialized agents sharing the same workspace. The flat file approach was actively working against me.

This post covers what I learned scaling from one agent with a simple memory file to eight agents with a structured memory architecture, automated maintenance, and self-healing systems.

The Problem with Flat Memory

The original design was simple: one MEMORY.md file containing everything the agent needed to remember. Each session, the agent would read it, do its work, and append anything worth keeping.

This fell apart in three ways:

Token inflation. As the file grew, more tokens were consumed just loading context. Beyond a certain size, the agent started forgetting things that were technically in the file — they’d been pushed out of the attention window by sheer volume.

Search friction. Looking for “what did we decide about backups?” required reading through unrelated notes about preferences, project status, and random observations. No structure meant no efficient retrieval.

Multi-agent conflicts. When I added specialized agents (one for security, one for content, one for code review), they were all writing to the same file. Merge conflicts, duplicated information, contradictory entries.

The flat file had become a liability. I needed architecture.

Tiered Memory Architecture

The solution was decomposing the monolithic file into purpose-specific directories. Each type of information gets its own home:

Memory Architecture

Facts Directory

The facts/ directory contains atomic knowledge files — things that are definitively true and rarely change:

Agent profiles — what each specialized agent does, its boundaries, its communication style
Lessons learned — patterns that failed, approaches that worked
Preferences — user settings, formatting choices, working hours
Task patterns — recurring workflows and their optimal execution

Each fact lives in its own file. Adding a new lesson doesn’t require editing a massive document — just create a new file. This also enables selective loading: a code review agent doesn’t need the content calendar preferences.

Daily Notes

Daily session logs moved to memory/YYYY-MM-DD.md files. Raw observations, decisions, errors encountered. This is the episodic layer — useful for context, but not curated.

Keeping dailies separate means:

Easy date-based retrieval (“what happened last Tuesday?”)
Automatic cleanup — files older than thirty days get compressed
No pollution of long-term memory with transient details

Feedback System

Perhaps the most valuable addition: a shared feedback.md file that captures rules learned from mistakes.

When an agent makes an error, it doesn’t just fix the immediate problem. It adds a rule to the feedback file explaining what went wrong and how to prevent it. Every agent reads this file at session start.

The effect is institutional knowledge. Agent A makes a mistake on Monday. Agent B inherits that lesson on Tuesday. No one repeats the same error twice.

Example entries:

## Rate Limiting
- Never send more than 5 API calls per minute to external services
- Learned: 2026-02-15 (caused 429 errors during batch processing)

## Timezone Handling  
- Always confirm user timezone before scheduling
- Learned: 2026-02-23 (sent reminder at 3am user time)

## Large File Handling
- Stream files over 10MB instead of loading into memory
- Learned: 2026-03-08 (caused OOM crash)

Specialized Tracking

Certain information types warrant their own files:

holds.md — tasks paused pending external input (“waiting for API key from vendor”)
friction.md — recurring annoyances worth fixing (“this manual step takes too long”)
predictions.md — guesses made that should be validated later

These aren’t core memory, but they prevent things from falling through cracks.

With eight specialized agents running in the same workspace, memory becomes a shared resource. The architecture needed to handle concurrent access without conflicts.

Multi-Agent Architecture

Agent Specialization

Each agent has a defined role:

Agent	Focus	Memory Access
Main	General assistance, coordination	Full read/write
Security	Vulnerability scanning, hardening	Read facts, write security logs
Content	Writing, editing, publishing	Read preferences, write drafts
Research	Information gathering, summarization	Read all, write findings
Code	Development, review, debugging	Read patterns, write technical notes
Monitor	Health checks, alerting	Read thresholds, write metrics
Backup	DR testing, restoration	Full read, write backup logs
Cron	Scheduled task management	Read schedules, write execution logs

Inbox System

When agents need to communicate asynchronously, they drop messages in inbox/. Each agent checks its inbox at session start and clears handled messages.

## inbox/code-agent.md
From: security-agent
Date: 2026-03-28
Subject: Dependency vulnerability

Found CVE-2026-1234 in package foo-lib.
Priority: High
Action needed: Upgrade to version 2.3.1 or later

This prevents agents from blocking on each other while ensuring important information gets delivered.

Conflict Resolution

With multiple writers, conflicts are inevitable. The rules are simple:

Facts are append-only. New information creates new files, never overwrites existing facts.
Dailies are agent-namespaced. Each agent writes to memory/YYYY-MM-DD-{agent}.md.
Feedback is append-only with dedup. Before adding a rule, check if it already exists.
Inbox is producer-consumer. Writer creates, reader deletes after processing.

These constraints eliminate most conflicts without requiring locking mechanisms.

Automated Maintenance

Memory without maintenance decays. Old information becomes stale. Files accumulate. Performance degrades. I built several automated systems to keep things healthy.

Drift Monitor

A weekly script compares current configuration against a known-good baseline:

# Pseudo-code for drift detection
baseline_hash=$(cat baseline-config.hash)
current_hash=$(compute_config_hash)

if [ "$baseline_hash" != "$current_hash" ]; then
    notify "Configuration drift detected"
    generate_diff_report
fi

Drift isn’t always bad — sometimes configuration should change. But unintentional drift is a leading cause of “it worked last week” failures. The monitor surfaces changes so I can decide if they’re intentional.

Idempotency Guard

Automation scripts sometimes run twice (cron hiccups, manual re-runs, retry logic). Non-idempotent operations cause problems: duplicate emails, double database entries, repeated notifications.

The idempotency guard wraps sensitive operations:

def with_idempotency(operation_id: str, fn: Callable):
    lock_file = f"/tmp/idem-{operation_id}.lock"
    
    if os.path.exists(lock_file):
        # Already ran recently, skip
        return None
    
    try:
        result = fn()
        touch(lock_file, ttl=3600)  # Lock for one hour
        return result
    except Exception:
        # Don't create lock on failure
        raise

This is defense in depth. Individual operations should be idempotent anyway, but the wrapper catches cases where they’re not.

Cron Self-Healing

Scheduled tasks fail. Networks have blips. Services restart. The question is what happens next.

Traditional cron failure handling: log the error, maybe send an alert, require manual intervention. This doesn’t scale when you have dozens of scheduled tasks.

The auto-triage system handles common failure patterns automatically:

Cron Self-Healing

Timeout failures get automatic timeout bumps. If a backup job timed out at 5 minutes, the next run gets 10 minutes. The system tracks whether the bump helped.

Transient failures (network errors, rate limits, temporary service outages) get automatic retry with exponential backoff. Three consecutive transient failures trigger an alert.

Persistent failures (auth errors, missing dependencies) get disabled and escalated. No point retrying something that won’t work.

The deeper analysis script (cron-investigate.sh) handles edge cases: prompt bugs that cause parsing errors, double-send detection, resource exhaustion patterns. It runs weekly and produces a report of anything that needs human attention.

Memory Compression

Daily files older than thirty days undergo compression. The raw content gets summarized into key facts, which are merged into facts/compressed-daily-{month}.md. The original file is then archived.

This prevents unbounded growth while preserving important information. A month’s worth of daily logs might be 50KB; the compressed version is typically under 2KB.

Disaster Recovery Evolution

Part 2’s DR system was solid but manual. You had to run the restore script and walk through prompts. It worked, but it required human presence.

Two months later, the DR system has evolved:

Automated Testing

A weekly cron job runs a non-destructive DR test:

Verify backup freshness — Most recent backup should be <24 hours old
Test archive integrity — Uncompress and validate checksums
Simulate restore — Extract to a temporary directory, verify file counts
Check credential separation — Confirm secrets aren’t in main backup
Validate restoration order — Ensure no circular dependencies

The test produces a pass/fail report. Failures generate alerts. I haven’t manually tested DR in weeks because the automated tests catch problems first.

Security Verification

Post-restore security is a separate concern from data restoration. A dedicated script verifies:

SSH key fingerprints match expected values
No unexpected ports are listening
Firewall rules match baseline
Service accounts have correct permissions
No world-readable sensitive files

This runs automatically after any restore and can be triggered manually for peace of mind.

Integrity and Provenance

Every backup now includes:

integrity-hashes.json — SHA256 hashes of critical files
provenance-log.jsonl — Chain of custody: who backed up, when, from what state

If a restore produces unexpected hashes, something changed between backup and restore. The provenance log helps identify where.

The DR Runbook

The informal “here’s how to restore” notes became a formal DR-RUNBOOK.md with step-by-step procedures for six scenarios:

Full server loss — Start from fresh VPS, restore everything
Data corruption — Restore files while preserving infrastructure
Credential compromise — Rotate secrets, restore from known-good state
Partial failure — Restore specific components
Region failure — Bring up in alternate region
Gradual degradation — Identify and fix root cause before restore

Each scenario has prerequisites, procedures, verification steps, and rollback instructions. Written for someone with no prior context — because during an actual incident, you’re not thinking clearly.

The Heartbeat Protocol

Proactive health checking prevents problems from becoming emergencies. Every few hours, a heartbeat job runs:

┌─────────────────────────────────────┐
│          HEARTBEAT CHECK            │
├─────────────────────────────────────┤
│ ✓ Disk usage: 42% (threshold: 80%) │
│ ✓ Gateway: responding              │
│ ✓ Last backup: 3 hours ago         │
│ ✓ Cron health: 47/47 jobs OK       │
│ ✓ Stale sessions: 0                │
│ ✓ Memory size: 12KB (limit: 50KB)  │
└─────────────────────────────────────┘

The key principle: auto-fix before reporting. If disk usage is high, clean temp files before alerting. If a session is stale, terminate it before flagging. Only surface problems that need human judgment.

This dramatically reduces alert fatigue. The alerts I do receive are genuinely important.

Feedback Loops: Learning from Failures

The feedback system deserves its own section because it’s fundamentally changed how the agents operate.

The Three-Strike Rule

When a pattern fails twice, automated prevention kicks in. Three failures trigger a complete strategy change.

Example: An agent kept trying to process large files in memory, causing crashes. After two OOM errors, the feedback system added a rule to stream large files. The pattern stopped.

A different example: Retry logic was hammering a rate-limited API. First failure: backoff. Second failure: longer backoff. Third failure: strategy change — switch to batch processing with scheduled windows.

Cross-Agent Learning

Because all agents read the feedback file, lessons propagate automatically:

Security agent discovers that a specific auth pattern causes lockouts
Adds rule to feedback.md
Next day, code agent reads the rule
Code agent avoids the problematic pattern before encountering it

This is institutional knowledge without meetings. The system learns as a whole.

Feedback Hygiene

Not all feedback is valuable forever. Quarterly, I review the feedback file:

Obsolete rules get archived (the underlying issue was fixed)
Overly specific rules get generalized
Conflicting rules get reconciled
Frequently triggered rules might indicate deeper problems

The file stays focused on active, relevant guidance.

What I’d Do Differently

If I were starting over:

Start with structure. The flat MEMORY.md file was a mistake. Even with one agent, separate directories for facts, dailies, and feedback would have saved refactoring time.

Namespace from day one. Multi-agent support was an afterthought. Building in agent namespacing from the start would have prevented merge conflicts.

Automate DR testing immediately. Manual DR tests get skipped. Automated tests run reliably. The investment pays off quickly.

Size limits early. Without limits, memory files grow until they cause problems. Set limits upfront: max file sizes, max total memory, automatic archival thresholds.

Treat memory as infrastructure. Memory isn’t a feature — it’s infrastructure. Give it the same attention you’d give databases, networking, or compute. It fails in the same ways and needs the same operational discipline.

Conclusion

Two months transformed a flat file into an architecture. Single-agent memory became multi-agent coordination. Manual maintenance became automated self-healing. Ad-hoc DR became tested runbooks.

The core insight: AI agent memory is a distributed systems problem. Consistency, availability, partition tolerance — the same tradeoffs apply. Treat it that way and familiar patterns emerge. Ignore it and you get what I started with: a monster file that nobody can navigate.

The system isn’t done. Future work includes semantic search integration (querying by meaning, not just keywords), automatic fact extraction from daily notes, and multi-region memory replication. But the foundation is solid enough that these additions feel like extensions rather than rewrites.

If you’re running AI agents and thinking about memory, start with structure. Your future self will thank you.

This is Part 3 of an ongoing series on running AI agents in production. Part 2 covers the initial backup and disaster recovery setup. Part 1 introduces the multi-agent architecture.

The Problem with Flat Memory

Tiered Memory Architecture

Facts Directory

Daily Notes

Feedback System

Specialized Tracking

Multi-Agent Memory Sharing

Agent Specialization

Inbox System

Conflict Resolution

Automated Maintenance

Drift Monitor

Idempotency Guard

Cron Self-Healing

Memory Compression

Disaster Recovery Evolution

Automated Testing

Security Verification

Integrity and Provenance

The DR Runbook

The Heartbeat Protocol

Feedback Loops: Learning from Failures

The Three-Strike Rule

Cross-Agent Learning

Feedback Hygiene

What I’d Do Differently

Conclusion