Writing / April 2026 / 19 min read

Agentic Memory

How AI agents persist, structure, and manage memory — three philosophical camps, six analytical axes, and twenty-one systems mapped.

What is it?

Agentic memory is the design space concerned with how AI agents persist, structure, retrieve, and reason about information across sessions and over time. It is distinct from the broader question of AI memory (which includes retrieval-augmented generation, context windows, and fine-tuning) because it specifically addresses agents that act autonomously, maintain ongoing relationships with users or environments, and must decide what to remember, what to forget, and what to conclude.

As of early 2026, over twenty distinct systems occupy this space, each making fundamentally different commitments about what memory is. These commitments cluster around three primary concerns — storage, reasoning, and dynamics — that every system must address. Most systems lead with one concern as their primary philosophical commitment, but the most architecturally interesting ones make substantive commitments across two or all three.

Storage-first systems treat memory as organized information. Intelligence lives in the structure (layers, scoping rules, loading conditions) and retrieval mechanisms. Memory is passive until consulted. Examples: z.ai’s layered .md architecture (organization → project → user → local → role-specific memory), Claude Code’s auto-memory (curated markdown files with frontmatter), Graphify (knowledge graphs from mixed-media corpora), Microsoft GraphRAG (hierarchical community summaries — maximum derivation at build time, zero dynamism at runtime).

Reasoning-first systems treat memory not as what was said but as what can be concluded from what was said. Storage is a side effect of reasoning. The key innovation is front-loading inference — whether at write time, build time, or through continuous reasoning loops. Examples: Honcho’s Neuromancer models (atomic conclusions with formal certainty levels), AgentMemory’s four-tier consolidation pipeline (observations → episodes → semantic patterns → procedures), Zep’s bi-temporal knowledge graph engine (Graphiti), Cognee’s ontology-grounded ECL pipeline with feedback-driven optimization.

Dynamics-first systems treat memory as alive. Memories decay, strengthen through co-activation, associate spontaneously, and surface into attention based on relevance rather than explicit query. The interesting question is not what’s stored but what’s active. Examples: MuninnDB (cognitive database using ACT-R decay and Hebbian learning), Supermemory (knowledge graph with time-based forgetting, contradiction resolution, and noise filtering), Mem0 (fact extraction with conflict-driven ADD/UPDATE/DELETE/NOOP operations), SleepGate (trained forgetting gates at the KV cache level with entropy-triggered consolidation).

The most architecturally novel systems span multiple camps. Letta (formerly MemGPT) inverts the typical architecture by making the agent its own memory manager — there is no external memory pipeline, just the agent reasoning about what to persist through visible tool calls, using an OS-inspired metaphor where context window is RAM and external storage is disk. Hindsight (Vectorize.io) makes substantive commitments across all three concerns: formal epistemic types (storage), belief revision via its CARA reasoning agent (reasoning), and defeasible opinions with configurable skepticism (dynamics). A-MEM (NeurIPS 2025) implements retroactive recontextualization — inserting a new memory changes the contextual descriptions of existing ones, making the memory network genuinely self-organizing.

Stored memory vs. accessed memory

A boundary worth drawing: most systems in this survey implement stored memory — information the system extracts, persists, and manages over time. Google Gemini’s Personal Intelligence takes a different approach: it performs live retrieval from the user’s Google ecosystem (Gmail, Photos, Maps) at query time, synthesizing answers from external services without persisting anything. Whether this counts as “memory” depends on your definition. This note focuses on stored memory — systems that maintain and manage their own persistent state — while acknowledging that accessed memory (on-demand retrieval from live external sources) is an adjacent and increasingly important pattern.

Key axes of variation

The two primary axes

The design space is organized by two independent axes. Plotting the systems reveals structural clusters and a notable gap:

The top-right quadrant — deep reasoning combined with dynamic lifecycle — was empty when this note was first written. Three systems now contest it: Hindsight (formal belief networks with defeasible opinions and CARA reasoning, 91.4% on LongMemEval), ReMem (MDP-formulated memory edited during inference), and A-MEM (self-organizing Zettelkasten with retroactive recontextualization). But no system yet fully combines formal probabilistic reasoning (Bayesian belief updating) with continuous biological-style lifecycle (trained forgetting gates, sleep consolidation). The synthesis of Hindsight’s epistemic structure with SleepGate’s biological mechanisms would describe the ideal occupant that doesn’t yet exist.

Abstraction levels

Not all systems operate at the same layer, and comparing across layers can mislead:

Level	What it means	Systems
Architecture	Operates on model internals (KV cache, attention) — requires model-level integration	SleepGate, EM-LLM
Application	Operates on extracted semantic content (facts, triples, notes) — composable and debuggable	Mem0, Honcho, Zep, Hindsight, A-MEM, Cognee, MuninnDB, Supermemory, AgentMemory, Graphify, Rowboat, ReMem, Memary
Framework	Bound to a specific agent orchestration framework	LangMem (LangChain), CrewAI, LlamaIndex
Platform	Opaque to users, deployed at scale — shapes behavior for millions	ChatGPT Memory, Google Gemini, Claude Code

Architecture-level systems (SleepGate, EM-LLM) cannot be bolted onto existing agents. Application-level systems are the composable middle. Platform-level systems make implicit bets that users can’t inspect but are shaped by daily.

What is the core unit?

Each system’s fundamental commitment shows up in what it considers the atomic unit of memory:

System	Core unit
z.ai / Claude Code	Markdown file (instruction or learned fact)
Graphify	Node + edge with confidence classification
Rowboat	Obsidian note with wikilinks
Supermemory	Memory chunk with typed relationships
Honcho	Atomic conclusion with certainty level
AgentMemory	Observation that consolidates upward through tiers
MuninnDB	Engram (row with confidence, decay score, associations)
Mem0	Distilled fact with 4-operation mutation logic (ADD/UPDATE/DELETE/NOOP)
Letta	Memory block (named, editable text in context window)
Zep	Temporally-anchored triple with bi-temporal metadata (event time + ingestion time)
Hindsight	Four parallel typed networks: world facts, agent experiences, entity summaries, evolving beliefs
Cognee	Ontology-aligned triple with feedback-derived edge weights
LangMem	Multi-type: profiles (facts), experiences (trajectories), rules (agent instructions)
A-MEM	Zettelkasten note with auto-generated links, keywords, and retroactive context updates
ReMem	Experience triple: intent (task embedding), experience (trajectory), utility (learned Q-value)
SleepGate	KV cache entry with temporal metadata and semantic signature
EM-LLM	Episodic event bounded by surprise-triggered segmentation (Bayesian change-point detection)
Microsoft GraphRAG	Community summary (hierarchically clustered entity relationship summaries)

The choice of unit implies a theory of what’s worth preserving. A system built on markdown files preserves human-readable documents. A system built on atomic conclusions preserves reasoning traces. A system built on engrams preserves activation patterns. A system built on Zettelkasten notes preserves a self-organizing web of associations. You get the memory system your ontology implies.

Hindsight’s four parallel networks are the most philosophically committed: by maintaining separate stores for facts, experiences, entities, and opinions, it makes an explicit claim that these are epistemically different kinds of knowledge that shouldn’t be collapsed into a single representation.

When does reasoning happen?

The timing of reasoning — the moment when raw data becomes structured understanding — has expanded beyond the patterns visible in early systems:

At write time: Honcho runs its Neuromancer model when messages arrive. Mem0 extracts and deduplicates facts against the existing store. Zep’s Graphiti engine performs entity extraction, relationship typing, and community detection on every ingestion. Cognee’s ECL pipeline builds knowledge graph triples. A-MEM generates note attributes and links, then retroactively updates existing memories to reflect new connections. The advantage: by retrieval time, low-level messages have already been distilled. The cost: compute at every write.

Deferred consolidation: AgentMemory stores raw observations first, then consolidates upward through tiers via LLM processing. SleepGate runs consolidation during entropy-triggered “sleep micro-cycles.” Cognee’s Memify pipeline runs background optimization. This amortizes reasoning cost but means recent memories may not yet be fully processed.

At query time: MuninnDB computes relevance scores (decay, Hebbian boost, content match) at read time through its ACTIVATE pipeline. ReMem blends semantic similarity with learned utility scores. The memory itself is relatively static; the intelligence is in the retrieval scoring.

At build time: Graphify, Rowboat, and Microsoft GraphRAG reason once during ingestion/graph construction, then serve the pre-built structure. GraphRAG represents the extreme: maximum reasoning at build time, zero reasoning at query time, zero lifecycle.

Continuously / agent-driven: Letta’s heartbeat mechanism chains multiple memory operations before returning control. ReMem’s think-act-memory-refine loop interleaves task reasoning with memory editing mid-inference. These systems don’t wait for external triggers — the agent autonomously decides when to engage with memory.

Threshold-triggered: The Generative Agents prototype (Park et al., 2023) triggers reflection when accumulated importance scores across recent observations exceed a threshold (~150). Neither periodic nor event-driven — accumulated-significance-driven. Most dynamics-first systems in this survey trace design decisions back to this work.

Sleep/idle time: SleepGate’s micro-cycles activate when internal entropy exceeds a threshold or on periodic timers. The only mechanism in the survey that dedicates compute to “thinking about memory” during downtime — directly mirroring biological sleep-dependent memory consolidation.

Never (instructions only): z.ai’s layered .md approach stores human-written rules and agent-learned facts. No autonomous reasoning occurs; the LLM interprets them at load time.

Does forgetting exist?

Most systems only accumulate. The ones that don’t reveal a spectrum from passive to active:

Passive decay — memories fade through disuse:

MuninnDB implements continuous analog decay via the ACT-R formula. Memories fade below retrieval threshold but are never deleted; co-activation (Hebbian learning) can rescue old memories. AgentMemory uses exponential decay with reinforcement: salience * exp(-lambda * deltaT) + reinforcementBoost. SleepGate trains a differentiable forgetting gate that assigns survival probability based on recency, access frequency, and semantic novelty.

Reactive forgetting — contradictions or state changes trigger updates:

Mem0’s four-operation logic compares incoming facts with stored facts and executes explicit deletions on contradiction. Supermemory uses three categorical mechanisms: time-based expiry, contradiction resolution, and noise filtering. Honcho consolidates — redundant or contradictory conclusions are reconciled, not deleted.

Temporal bounding — facts bounded in time, not removed:

Zep marks outdated facts with a valid_to timestamp rather than deleting them. Querying the graph at time T returns only facts valid at T. The full history is preserved — the only system in the survey with provably time-bounded fact retrieval.

Active curation — the system reasons about what to keep:

ReMem prunes and reorganizes memory mid-inference as part of its Refine step — forgetting is part of the reasoning process, not a separate lifecycle. A-MEM’s retroactive recontextualization changes the meaning of existing memories rather than removing them. Cognee’s Memify uses user feedback scores (-5 to +5) to directly influence edge weights and survival probability. LangMem’s PromptOptimizer rewrites the agent’s own instructions from accumulated memory — the most radical form of curation, where what changes is not the memory store but the agent itself.

Resource-pressure eviction (not principled forgetting):

Letta’s context compilation summarizes and moves content to cold storage when the FIFO queue fills. EM-LLM displaces older episodes under capacity pressure. These are memory management under resource constraints, not epistemic decisions about what matters.

Never/manual only: z.ai, Claude Code, Graphify, Rowboat, CrewAI, LlamaIndex.

How is certainty handled?

The treatment of confidence ranges from formal rigor to complete absence:

Approach	Systems
Formal epistemic levels — explicit → deductive → inductive → abductive, with scaffolding constraints	Honcho
Configurable epistemic parameters — skepticism / literalism / empathy scales (1-5) governing how evidence is weighed	Hindsight
Bayesian posterior — 0-1 confidence per entry, updated as evidence changes	MuninnDB
Tiered confidence — EXTRACTED (1.0), INFERRED (0.4-0.9), AMBIGUOUS (0.1-0.3)	Graphify
Learned utility — Q-values from RL training; memories that lead to task success are trusted more	ReMem
Relationship-type encoding — certainty implicit in whether something updates, extends, or derives from another	Supermemory
Importance scoring — LLM-assigned integer (1-10), structural but not formally probabilistic	AgentMemory, CrewAI
Implicit / not at all	Most other systems

Honcho’s strict scaffolding constraint — conclusions can only build upward from higher-certainty premises — remains the most formally rigorous approach. Hindsight’s configurable parameters take a different path: rather than scoring individual facts, they govern the system’s overall epistemic disposition.

Temporal awareness

A dimension that separates systems more cleanly than most:

Zep’s bi-temporal model is the most complete implementation: each fact carries both an event timestamp (when it happened in the world) and an ingestion timestamp (when it was recorded), enabling queries like “what did we believe about X as of last Tuesday?” Hindsight’s dual timestamps (occurrence time + mention time) enable similar reasoning about when beliefs were formed versus when they were last referenced.

Reflexive capacity

How self-aware is the system about its own memory? This was invisible with eight systems but becomes clear at twenty:

A-MEM’s retroactive recontextualization is the only mechanism that explicitly updates existing memories when new ones arrive — the closest any system comes to tracking how understanding changes over time. LangMem’s PromptOptimizer goes further in a different direction: memory doesn’t just inform retrieval, it rewrites the agent’s own behavioral instructions.

What is the relationship between observer and observed?

These systems take fundamentally different epistemic stances about whose reality the memory represents. Two independent choices — does the system assume one shared reality or allow per-perspective realities, and is the memory transparent or opaque to the user — produce a revealing map:

Most systems occupy the top-left: one shared pool of memory, fully inspectable. The interesting outliers are:

Honcho’s perspective-taking is architecturally unique: different peers hold different representations of the same person based only on what they’ve directly observed. Memory is per-relationship, not per-user — a commitment no other system in the survey replicates. Top-right: perspectival but transparent.

Hindsight’s configurable epistemic parameters (skepticism, literalism, empathy) determine how the system relates to incoming information, not just what it stores. Two instances with different parameter settings will develop different memories from the same input. Top-right: the observer’s disposition shapes the observed record, but the mechanism is visible.

ChatGPT Memory’s four-bucket architecture makes an implicit split: some “memory” (model-set context) is never surfaced as explicit entries but shapes behavior — the observer’s relationship to the user is partially opaque even to the user. Bottom-left: shared reality, partially hidden.

The bottom-right quadrant — per-perspective reality that is also opaque — is empty. It would describe a system where different agents hold different views of the same user and neither the user nor the other agents can inspect those views. Whether that’s a gap worth filling or a design to avoid is an open question.

Where the gaps remain

The proliferation of approaches confirms that the field has not converged on what agent memory should be. This is not a maturity problem that will resolve as implementations improve. The systems disagree at the level of philosophy. But the gap landscape has shifted:

The empty quadrant (deep reasoning + dynamic lifecycle): Partially filled. Hindsight, ReMem, and A-MEM each approach it from different angles. The remaining gap is narrower: no system yet combines formal probabilistic reasoning (Bayesian belief updating) with continuous biological-style lifecycle (trained forgetting, sleep consolidation, Hebbian reinforcement). The synthesis would look like Hindsight’s epistemic structure running on SleepGate’s biological mechanisms.

Reflexive Context (how understanding changes over time): Partially addressed. A-MEM’s retroactive recontextualization is the only mechanism that explicitly updates existing memories when new ones arrive. Zep’s bi-temporal model tracks when facts were believed but not how the agent’s understanding of them evolved. This gap remains substantially open.

Deliberation as Infrastructure (holding open questions): Remains open. No surveyed system represents unresolved tension or uncertainty as a first-class memory type. Hindsight’s belief/opinion layer comes closest — opinions are explicitly modeled as defeasible — but they are resolved toward a current position rather than held as genuinely open. A system that stores “I am not yet sure whether X” as a retrievable, actionable memory object does not yet exist.

Per-relationship memory: Only Honcho. No new system implements memory indexed by relationship identity rather than user identity.

Multi-agent shared memory: Partially addressed by CrewAI’s shared namespaces and Letta’s shared memory blocks. No production system implements memory consistency protocols for multi-agent shared state.

Sleep/idle consolidation at the application layer: SleepGate addresses this at the KV cache level. Cognee’s Memify background pipeline is the closest application-layer analogue. No system fully implements biological-style consolidation (replay, pruning, abstraction) over semantic memory during idle periods.

Adjacent territory: memory in agent stacks

Most people encounter persistent memory primarily through the default system built into whatever agent stack they use — CrewAI, LangGraph, AutoGen, LlamaIndex, or custom architectures. These framework-level memory subsystems tend to be shallower than dedicated memory products (they make engineering tradeoffs rather than philosophical commitments), but their aggregate impact is large because they set the default for every agent built on the framework.

Mapping agent systems through the lens of their memory approach — how does CrewAI’s role-scoped memory differ from AutoGen’s conversational teachability, from LangGraph’s state machine persistence? — is a valuable analysis but a distinct one from this note. The foundational prototype for most dynamics-first designs is the Generative Agents architecture (Park et al., 2023), whose tripartite scoring (recency x importance x relevance) and threshold-triggered reflection influenced most subsequent designs.

Systems surveyed

System	Camp	Type	Source
z.ai DevPack	Storage-first	Product	Memory Mechanism
Claude Code auto-memory	Storage-first	Platform	Built-in to Claude Code CLI
Graphify	Storage-first (graph)	Library	github.com/safishamsi/graphify
Microsoft GraphRAG	Storage→Reasoning (build-time only)	Framework	github.com/microsoft/graphrag
Rowboat	Reasoning-first (LLM extraction)	Library	github.com/rowboatlabs/rowboat
Supermemory	Hybrid (storage + reasoning + dynamics)	Product	supermemory.ai
Honcho / Neuromancer	Reasoning-first (formal logic)	Product	docs.honcho.dev
AgentMemory	Reasoning-first (tiered consolidation)	Library	github.com/rohitg00/agentmemory
MuninnDB	Dynamics-first (cognitive)	Product	muninndb.com
Mem0	Dynamics-first (conflict-driven)	Product + Library	mem0.ai
Letta	Reasoning/Dynamics hybrid (agent-as-manager)	Framework + Service	letta.com
Zep / Graphiti	Reasoning/Dynamics hybrid (bi-temporal)	Product	getzep.com
Hindsight	All three (belief networks + CARA)	Library	vectorize.io/hindsight
Cognee	Reasoning-first (ontology + feedback)	Library + Service	github.com/topoteretes/cognee
LangMem	Storage-first (procedural rewrite)	Framework SDK	langchain-ai.github.io/langmem
A-MEM	Dynamics/Reasoning hybrid (Zettelkasten)	Research (NeurIPS 2025)	arXiv:2502.12110
ReMem / Evo-Memory	Dynamics/Reasoning hybrid (MDP)	Research (NeurIPS 2025)	arXiv:2511.20857
SleepGate	Dynamics-first (KV-level biological)	Research	arXiv:2603.14517
EM-LLM	Dynamics/Storage hybrid (episodic)	Research	arXiv:2407.09450
ChatGPT Memory	Storage-first (platform-scale)	Platform	openai.com/chatgpt
Google Gemini	Split (stored profile + accessed PI)	Platform	gemini.google.com