OpenClaw Part 3: How My AI Agent's Memory Evolved Over Two Months
From flat files to tiered memory architecture — how two months of running AI agents taught me that memory isn't a feature, it's infrastructure. Includes multi-agent sharing, self-healing systems, and automated DR.
In Part 2, I built a memory system with three-tier backups and a twenty-minute disaster recovery target. That system worked — right up until it didn’t scale.
Two months later, the original MEMORY.md file had become a monster. Hundreds of lines, mixing everything from user preferences to daily task logs to lessons learned from failures. Finding anything useful meant scrolling through a wall of text. Worse, I’d started running multiple specialized agents sharing the same workspace. The flat file approach was actively working against me.
This post covers what I learned scaling from one agent with a simple memory file to eight agents with a structured memory architecture, automated maintenance, and self-healing systems.
The Problem with Flat Memory
The original design was simple: one MEMORY.md file containing everything the agent needed to remember. Each session, the agent would read it, do its work, and append anything worth keeping.
This fell apart in three ways:
Token inflation. As the file grew, more tokens were consumed just loading context. Beyond a certain size, the agent started forgetting things that were technically in the file — they’d been pushed out of the attention window by sheer volume.
Search friction. Looking for “what did we decide about backups?” required reading through unrelated notes about preferences, project status, and random observations. No structure meant no efficient retrieval.
Multi-agent conflicts. When I added specialized agents (one for security, one for content, one for code review), they were all writing to the same file. Merge conflicts, duplicated information, contradictory entries.
The flat file had become a liability. I needed architecture.
Tiered Memory Architecture
The solution was decomposing the monolithic file into purpose-specific directories. Each type of information gets its own home:
Facts Directory
The facts/ directory contains atomic knowledge files — things that are definitively true and rarely change:
- Agent profiles — what each specialized agent does, its boundaries, its communication style
- Lessons learned — patterns that failed, approaches that worked
- Preferences — user settings, formatting choices, working hours
- Task patterns — recurring workflows and their optimal execution
Each fact lives in its own file. Adding a new lesson doesn’t require editing a massive document — just create a new file. This also enables selective loading: a code review agent doesn’t need the content calendar preferences.
Daily Notes
Daily session logs moved to memory/YYYY-MM-DD.md files. Raw observations, decisions, errors encountered. This is the episodic layer — useful for context, but not curated.
Keeping dailies separate means:
- Easy date-based retrieval (“what happened last Tuesday?”)
- Automatic cleanup — files older than thirty days get compressed
- No pollution of long-term memory with transient details
Feedback System
Perhaps the most valuable addition: a shared feedback.md file that captures rules learned from mistakes.
When an agent makes an error, it doesn’t just fix the immediate problem. It adds a rule to the feedback file explaining what went wrong and how to prevent it. Every agent reads this file at session start.
The effect is institutional knowledge. Agent A makes a mistake on Monday. Agent B inherits that lesson on Tuesday. No one repeats the same error twice.
Example entries:
## Rate Limiting
- Never send more than 5 API calls per minute to external services
- Learned: 2026-02-15 (caused 429 errors during batch processing)
## Timezone Handling
- Always confirm user timezone before scheduling
- Learned: 2026-02-23 (sent reminder at 3am user time)
## Large File Handling
- Stream files over 10MB instead of loading into memory
- Learned: 2026-03-08 (caused OOM crash)
Specialized Tracking
Certain information types warrant their own files:
- holds.md — tasks paused pending external input (“waiting for API key from vendor”)
- friction.md — recurring annoyances worth fixing (“this manual step takes too long”)
- predictions.md — guesses made that should be validated later
These aren’t core memory, but they prevent things from falling through cracks.
Multi-Agent Memory Sharing
With eight specialized agents running in the same workspace, memory becomes a shared resource. The architecture needed to handle concurrent access without conflicts.
Agent Specialization
Each agent has a defined role:
| Agent | Focus | Memory Access |
|---|---|---|
| Main | General assistance, coordination | Full read/write |
| Security | Vulnerability scanning, hardening | Read facts, write security logs |
| Content | Writing, editing, publishing | Read preferences, write drafts |
| Research | Information gathering, summarization | Read all, write findings |
| Code | Development, review, debugging | Read patterns, write technical notes |
| Monitor | Health checks, alerting | Read thresholds, write metrics |
| Backup | DR testing, restoration | Full read, write backup logs |
| Cron | Scheduled task management | Read schedules, write execution logs |
Inbox System
When agents need to communicate asynchronously, they drop messages in inbox/. Each agent checks its inbox at session start and clears handled messages.
## inbox/code-agent.md
From: security-agent
Date: 2026-03-28
Subject: Dependency vulnerability
Found CVE-2026-1234 in package foo-lib.
Priority: High
Action needed: Upgrade to version 2.3.1 or later
This prevents agents from blocking on each other while ensuring important information gets delivered.
Conflict Resolution
With multiple writers, conflicts are inevitable. The rules are simple:
- Facts are append-only. New information creates new files, never overwrites existing facts.
- Dailies are agent-namespaced. Each agent writes to
memory/YYYY-MM-DD-{agent}.md. - Feedback is append-only with dedup. Before adding a rule, check if it already exists.
- Inbox is producer-consumer. Writer creates, reader deletes after processing.
These constraints eliminate most conflicts without requiring locking mechanisms.
Automated Maintenance
Memory without maintenance decays. Old information becomes stale. Files accumulate. Performance degrades. I built several automated systems to keep things healthy.
Drift Monitor
A weekly script compares current configuration against a known-good baseline:
# Pseudo-code for drift detection
baseline_hash=$(cat baseline-config.hash)
current_hash=$(compute_config_hash)
if [ "$baseline_hash" != "$current_hash" ]; then
notify "Configuration drift detected"
generate_diff_report
fi
Drift isn’t always bad — sometimes configuration should change. But unintentional drift is a leading cause of “it worked last week” failures. The monitor surfaces changes so I can decide if they’re intentional.
Idempotency Guard
Automation scripts sometimes run twice (cron hiccups, manual re-runs, retry logic). Non-idempotent operations cause problems: duplicate emails, double database entries, repeated notifications.
The idempotency guard wraps sensitive operations:
def with_idempotency(operation_id: str, fn: Callable):
lock_file = f"/tmp/idem-{operation_id}.lock"
if os.path.exists(lock_file):
# Already ran recently, skip
return None
try:
result = fn()
touch(lock_file, ttl=3600) # Lock for one hour
return result
except Exception:
# Don't create lock on failure
raise
This is defense in depth. Individual operations should be idempotent anyway, but the wrapper catches cases where they’re not.
Cron Self-Healing
Scheduled tasks fail. Networks have blips. Services restart. The question is what happens next.
Traditional cron failure handling: log the error, maybe send an alert, require manual intervention. This doesn’t scale when you have dozens of scheduled tasks.
The auto-triage system handles common failure patterns automatically:
Timeout failures get automatic timeout bumps. If a backup job timed out at 5 minutes, the next run gets 10 minutes. The system tracks whether the bump helped.
Transient failures (network errors, rate limits, temporary service outages) get automatic retry with exponential backoff. Three consecutive transient failures trigger an alert.
Persistent failures (auth errors, missing dependencies) get disabled and escalated. No point retrying something that won’t work.
The deeper analysis script (cron-investigate.sh) handles edge cases: prompt bugs that cause parsing errors, double-send detection, resource exhaustion patterns. It runs weekly and produces a report of anything that needs human attention.
Memory Compression
Daily files older than thirty days undergo compression. The raw content gets summarized into key facts, which are merged into facts/compressed-daily-{month}.md. The original file is then archived.
This prevents unbounded growth while preserving important information. A month’s worth of daily logs might be 50KB; the compressed version is typically under 2KB.
Disaster Recovery Evolution
Part 2’s DR system was solid but manual. You had to run the restore script and walk through prompts. It worked, but it required human presence.
Two months later, the DR system has evolved:
Automated Testing
A weekly cron job runs a non-destructive DR test:
- Verify backup freshness — Most recent backup should be <24 hours old
- Test archive integrity — Uncompress and validate checksums
- Simulate restore — Extract to a temporary directory, verify file counts
- Check credential separation — Confirm secrets aren’t in main backup
- Validate restoration order — Ensure no circular dependencies
The test produces a pass/fail report. Failures generate alerts. I haven’t manually tested DR in weeks because the automated tests catch problems first.
Security Verification
Post-restore security is a separate concern from data restoration. A dedicated script verifies:
- SSH key fingerprints match expected values
- No unexpected ports are listening
- Firewall rules match baseline
- Service accounts have correct permissions
- No world-readable sensitive files
This runs automatically after any restore and can be triggered manually for peace of mind.
Integrity and Provenance
Every backup now includes:
- integrity-hashes.json — SHA256 hashes of critical files
- provenance-log.jsonl — Chain of custody: who backed up, when, from what state
If a restore produces unexpected hashes, something changed between backup and restore. The provenance log helps identify where.
The DR Runbook
The informal “here’s how to restore” notes became a formal DR-RUNBOOK.md with step-by-step procedures for six scenarios:
- Full server loss — Start from fresh VPS, restore everything
- Data corruption — Restore files while preserving infrastructure
- Credential compromise — Rotate secrets, restore from known-good state
- Partial failure — Restore specific components
- Region failure — Bring up in alternate region
- Gradual degradation — Identify and fix root cause before restore
Each scenario has prerequisites, procedures, verification steps, and rollback instructions. Written for someone with no prior context — because during an actual incident, you’re not thinking clearly.
The Heartbeat Protocol
Proactive health checking prevents problems from becoming emergencies. Every few hours, a heartbeat job runs:
┌─────────────────────────────────────┐
│ HEARTBEAT CHECK │
├─────────────────────────────────────┤
│ ✓ Disk usage: 42% (threshold: 80%) │
│ ✓ Gateway: responding │
│ ✓ Last backup: 3 hours ago │
│ ✓ Cron health: 47/47 jobs OK │
│ ✓ Stale sessions: 0 │
│ ✓ Memory size: 12KB (limit: 50KB) │
└─────────────────────────────────────┘
The key principle: auto-fix before reporting. If disk usage is high, clean temp files before alerting. If a session is stale, terminate it before flagging. Only surface problems that need human judgment.
This dramatically reduces alert fatigue. The alerts I do receive are genuinely important.
Feedback Loops: Learning from Failures
The feedback system deserves its own section because it’s fundamentally changed how the agents operate.
The Three-Strike Rule
When a pattern fails twice, automated prevention kicks in. Three failures trigger a complete strategy change.
Example: An agent kept trying to process large files in memory, causing crashes. After two OOM errors, the feedback system added a rule to stream large files. The pattern stopped.
A different example: Retry logic was hammering a rate-limited API. First failure: backoff. Second failure: longer backoff. Third failure: strategy change — switch to batch processing with scheduled windows.
Cross-Agent Learning
Because all agents read the feedback file, lessons propagate automatically:
- Security agent discovers that a specific auth pattern causes lockouts
- Adds rule to feedback.md
- Next day, code agent reads the rule
- Code agent avoids the problematic pattern before encountering it
This is institutional knowledge without meetings. The system learns as a whole.
Feedback Hygiene
Not all feedback is valuable forever. Quarterly, I review the feedback file:
- Obsolete rules get archived (the underlying issue was fixed)
- Overly specific rules get generalized
- Conflicting rules get reconciled
- Frequently triggered rules might indicate deeper problems
The file stays focused on active, relevant guidance.
What I’d Do Differently
If I were starting over:
Start with structure. The flat MEMORY.md file was a mistake. Even with one agent, separate directories for facts, dailies, and feedback would have saved refactoring time.
Namespace from day one. Multi-agent support was an afterthought. Building in agent namespacing from the start would have prevented merge conflicts.
Automate DR testing immediately. Manual DR tests get skipped. Automated tests run reliably. The investment pays off quickly.
Size limits early. Without limits, memory files grow until they cause problems. Set limits upfront: max file sizes, max total memory, automatic archival thresholds.
Treat memory as infrastructure. Memory isn’t a feature — it’s infrastructure. Give it the same attention you’d give databases, networking, or compute. It fails in the same ways and needs the same operational discipline.
Conclusion
Two months transformed a flat file into an architecture. Single-agent memory became multi-agent coordination. Manual maintenance became automated self-healing. Ad-hoc DR became tested runbooks.
The core insight: AI agent memory is a distributed systems problem. Consistency, availability, partition tolerance — the same tradeoffs apply. Treat it that way and familiar patterns emerge. Ignore it and you get what I started with: a monster file that nobody can navigate.
The system isn’t done. Future work includes semantic search integration (querying by meaning, not just keywords), automatic fact extraction from daily notes, and multi-region memory replication. But the foundation is solid enough that these additions feel like extensions rather than rewrites.
If you’re running AI agents and thinking about memory, start with structure. Your future self will thank you.
This is Part 3 of an ongoing series on running AI agents in production. Part 2 covers the initial backup and disaster recovery setup. Part 1 introduces the multi-agent architecture.