Agent Memory: System Characterization and Implications for Long-Horizon Workloads · Yasmine Omri
2026-06-06 · A faithful, transcript-grounded reading by PodLens
Source paper:https://arxiv.org/pdf/2606.06448
Agent MemorySystem-level CharacterizationPrefill OverheadMemory ConstructionFreshness and LatencyEnergy Overhead
What This Paper Is About
This paper presents the first system-level characterization of Agent Memory Systems for Large Language Models (LLMs). As LLM agents are increasingly deployed in long-horizon tasks requiring continuous long-term reasoning, agents need to persistently store, retrieve, and update their own memories across multiple sessions. Although various agent memory system designs exist, their system-level behaviors and computational overheads have remained uncharacterized. The authors, Yasmine Omri, Ziyu Gan, et al., establish a systematic taxonomy, build a stage-aware profiling harness, and conduct a system-level evaluation of ten representative memory systems on the MemoryAgentBench and MemoryArena benchmarks. The study finds that memory construction overhead dominates the vast majority of the agent's lifecycle and conflicts in computational resources with latency-sensitive question-answering (QA) services. Finally, the paper proposes ten system recommendations regarding agent memory serving architecture, scheduling, and system selection.
Paper Skeleton
- Research Background and Problem Definition: The state accumulated by LLM agents in long-horizon tasks far exceeds the upper limit of what a single inference context can hold. The traditional approach of keeping the entire history in-context faces three core limitations: limited context budgets, quadratic growth in Prefill overhead, and degraded inference fidelity in long sequences (the U-curve effect). External memory systems decouple storage capacity from context length by persisting state to external databases. (1. Introduction · "Realizing this at scale requires agents to")
- Core Taxonomy: The paper proposes a taxonomy that classifies agent memory systems into four paradigms:
- Paradigm I: Long-context memory: Performs no memory construction and directly passes the complete interaction history as a prompt to the model. (2.2. Taxonomy of Agent Memory Paradigms · "performs no memory construction and stores no")
- Paradigm II: Flat RAG memory: Applies a deterministic chunking and indexing pipeline (such as BM25 or EmbedRAG) without calling an LLM for construction, supporting lexical or dense retrieval, and is append-only. (2.2. Taxonomy of Agent Memory Paradigms · "applies a deterministic indexing pipeline")
- Paradigm III: Structure-augmented RAG memory: Uses an LLM as a fixed extractor to extract facts, summaries, entities, or relation triples from the interaction stream. It is divided into append-only (such as GraphRAG, HippoRAG v2) and consolidating (such as Mem0, SimpleMem, where the latter performs ADD/UPDATE/DELETE record updates). (2.2. Taxonomy of Agent Memory Paradigms · "These systems use an LLM as a fixed extractor")
- Paradigm IV: Agentic control flow: Exposes memory operations as tools or actions to the agent's LLM decision loop, allowing the LLM to autonomously control memory reading, writing, and modification (such as A-Mem, Letta, MIRIX). (2.2. Taxonomy of Agent Memory Paradigms · "Memory access is an action selected by")
- System-level Overhead and Behavioral Characterization:
- Construction Overhead Dominates: In systems that use LLMs for memory processing, memory construction overhead dominates the energy consumption of the entire lifecycle, far exceeding the energy consumption of the query phase. (4.2. Construction Dominates the Agent Lifecycle · "exceeds total query-phase energy across 300")
- Prefill and Embedding Bottlenecks: Memory construction is a high-input, low-output task. Its LLM invocation overhead is almost entirely concentrated in the Prefill phase, with Decode accounting for only a tiny fraction. Furthermore, the embedding traffic generated by different paradigms exhibits contrasting characteristics of bimodal batching versus serialized writes. (4.3. Construction Is an Overwhelmingly Embedding and Prefill-dominated Workload · "it repeatedly reads long chunks or windows and emits compact")
- Construction Model Downward Compatibility: Most systems support cost savings by downscaling the LLM used during the construction phase. However, for systems that require strict adherence to JSON schemas or tool-calling syntax (such as MIRIX), model degradation leads to memory store corruption and complete failure. (4.4. Construction-LLM Choice Is Agent memory system-Constrained · "LLM downscaling is a cost lever")
- Research Limitations: The authors acknowledge that the current study is limited to single-node agents. The consistency and coordination requirements of distributed storage in multi-node and multi-agent deployments, as well as the storage and retrieval of multimodal memory (combining images, audio, etc.), remain unresolved challenges. (5. Discussion & Conclusion · "stores that single-node")
Core Arguments List
- The Prefill overhead of full-history context scales quadratically as history accumulates, and there is a risk of losing information in the middle.
- Anchor: 1. Introduction · "prefill costs scale"
- Type: Fact
- External memory systems overcome the system-level limitations of long-context processing by decoupling context length from storage capacity.
- Anchor: 1. Introduction · "decoupling capacity"
- Type: Claim
- For LLM-mediated memory systems, the energy consumed by memory construction dominates the vast majority of the agent's lifecycle.
- Anchor: 4.2. Construction Dominates the Agent Lifecycle · "exceeds total query-phase energy across 300"
- Type: Fact
- Agent memory construction is inherently a read-heavy, write-light workload dominated by Prefill and Embedding.
- Anchor: 4.3. Construction Is an Overwhelmingly Embedding and Prefill-dominated Workload · "it repeatedly reads long chunks or windows and emits compact"
- Type: Fact
- During parallel serving, the massive Prefill throughput of construction tasks occupies KV-cache headroom and directly competes for resources with low-latency QA queries.
- Anchor: 4.3. Construction Is an Overwhelmingly Embedding and Prefill-dominated Workload · "a large construction prefill job occupies KV-cache headroom and stalls the batch scheduler precisely when a latency-sensitive query arrives."
- Type: Prediction
- Downscaling the LLM during the construction phase is a viable cost-control lever, but its lower bound is strictly constrained by the algorithm's output format constraints.
- Anchor: 4.4. Construction-LLM Choice Is Agent memory system-Constrained · "LLM downscaling is a cost lever"
- Type: Fact
- No single memory system is optimal across construction overhead, query latency, and task accuracy simultaneously.
- Anchor: 4.5. The Construction–Serve–Accuracy Frontier · "No agent memory system is optimal across"
- Type: Fact
- Under asynchronous scheduling, slow-construction memory systems serve stale memory data to the agent due to uncommitted writes, leading to a conflict in the "freshness-latency" tradeoff.
- Anchor: 4.6. Inter-Session Construction Creates a Freshness–Latency Tradeoff · "Under asynchronous scheduling, slow-construction agent memory systems serve"
- Type: Fact
Plain English Explanation
We can think of agent memory as a leap from "static document retrieval" to "dynamic, mutable state management." In the past, agents either foolishly crammed hundreds of thousands of words of conversation history into the LLM every single time, or they could only search through a bunch of rigid, never-changing local documents like traditional RAG. But both approaches come with huge costs: the former gets more and more expensive the more you chat (with Prefill overhead scaling quadratically), and the model completely forgets information in the middle; the latter fails to let the agent record user preferences or continuously correct old knowledge as new interactions occur.
Agent memory systems were born to resolve this contradiction. They store memories outside the LLM and only "fish out" the most relevant ones when needed, saving a massive amount of GPU computation costs.
However, there is no free lunch. The core finding of this paper is that agent memory systems actually shift the cost from "query time (Read Path)" to "recording time (Write Path)." For example, in a system like Mem0, every time a user says something, it has to call an LLM in the background to distill and refine that sentence into atomic facts, and even perform deduplication and merging with the existing memory store via ADD, UPDATE, or DELETE operations. This "memory construction" process runs quietly in the background, consuming an astonishing amount of power and time—sometimes dozens of times more than the actual Q&A itself.
Furthermore, because "memorizing" requires the LLM to repeatedly read a long context and spit out just a few short lines of core relationships, this is a high-throughput task heavily dependent on "Prefill" on computing chips (GPUs). If you stuff this background "memory construction task" and the foreground chat Q&A—where users are urgently waiting for the first-token response (latency-sensitive Decode)—into the same GPU cluster, they will violently compete for resources, causing the user's waiting time to multiply.
Finally, the paper tells us that choosing a memory paradigm is like making a multi-dimensional trade-off. If your agent is the type where users chat very little but you have to repeatedly read this history (e.g., high-frequency query types), then a paradigm that puts the overhead in the construction phase to make queries extremely lightweight is highly cost-effective. But if your agent needs to frequently ingest large amounts of real-time data while users only ask questions occasionally, spending high costs to construct fine-grained memories in real-time is a massive waste.
Glossary
- Prefill: The initial phase where an LLM processes the input prompt and builds the internal KV Cache state. When agents construct memory, they need to read long texts, and their overhead is almost entirely concentrated here. (1. Introduction · "prefill costs scale")
- Decode: The generation phase where an LLM performs autoregressive computation word-by-word to generate a response. (4.3. Construction Is an Overwhelmingly Embedding and Prefill-dominated Workload · "it repeatedly reads long chunks or windows and emits compact")
- KV Cache: The temporary state of processed context stored in GPU memory to avoid redundant computations. When the context is extremely long, the KV Cache imposes immense memory pressure. (1. Introduction · "reasoning and recall fidelity degrade significantly in long sequences")
- Memory Ingestion: The first stage of agent memory, which determines whether to use a single turn of dialogue, fixed chunks, or a complete session as the basic unit for processing memory. (2.1. Agent Memory Execution Pipeline · "systems can be decomposed into seven stages")
- Flat RAG: An append-only retrieval paradigm that stores raw text directly into a database via chunking, vectorization, or term frequency statistics, without LLM extraction or rewriting. (2.2. Taxonomy of Agent Memory Paradigms · "applies a deterministic indexing pipeline")
- Consolidating Memory: A memory mechanism that not only extracts facts but also dynamically deduplicates, modifies, and deletes/updates persisted old memories based on new interactions. (2.2. Taxonomy of Agent Memory Paradigms · "These systems use an LLM as a fixed extractor")
- Agentic Control Flow: A memory mechanism that exposes memory operations (such as writing notes or searching archives) directly as tools to the LLM, allowing the model to autonomously control and decide when to read and write. (2.2. Taxonomy of Agent Memory Paradigms · "Memory access is an action selected by")
Before and After This Paper
- Before This Paper: When optimizing long dialogues for agents, the entire LLM community often defaulted to "long context windows" as the ultimate answer—as long as the model supported a million tokens, they would mindlessly cram everything in. When evaluating memory systems, everyone also focused solely on downstream QA accuracy, with almost no one measuring how much power they actually burned in the background or how long they stalled the GPUs. (1. Introduction · "Realizing this at scale requires agents to")
- After This Paper: This study, for the first time, shattered the utopian illusion of long contexts with hard data, pointing out its quadratic Prefill overhead and memory resource constraints. More importantly, it mapped out a three-dimensional "construction-query-accuracy" frontier, proving that there is no one-size-fits-all memory system, and providing practical physical metrics and system architecture recommendations for deploying agent memory at scale in the industry. (4.5. The Construction–Serve–Accuracy Frontier · "No agent memory system is optimal across")
Most Worth-Reading Sections
- Section 1 on the Three Physical Limitations of Long Contexts: (1. Introduction · "reasoning and recall fidelity degrade significantly in long sequences")
- Why it's worth reading: This section clearly and logically explains why we must use external memory systems. Starting from three physical system bottlenecks—the maximum capacity limit, the quadratic growth of Prefill overhead, and the "Lost in the Middle" limitation where LLMs lose information in the middle of long texts—it directly derives the necessity of external memory.
- Section 4.2 on the Discovery of Construction Phase Energy Dominance: (4.2. Construction Dominates the Agent Lifecycle · "exceeds total query-phase energy across 300")
- Why it's worth reading: Using fine-grained energy statistics, the paper reveals a shocking phenomenon: in LLM-mediated memory systems, the energy burned to construct memory dominates by a landslide, far exceeding Q&A retrieval. This completely shifts the conventional mindset of "only focusing on query serving overhead."
- Section 4.3 on the Analysis of Memory Construction Computational Characteristics: (4.3. Construction Is an Overwhelmingly Embedding and Prefill-dominated Workload · "it repeatedly reads long chunks or windows and emits compact")
- Why it's worth reading: This section dissects the computational nature of memory construction in detail—specifically, that it is a high-input, low-output workload that is "heavy on Prefill, almost zero on Decode." This is highly valuable for system scheduling reference for engineers who want to co-locate construction tasks and real-time dialogue services on the same hardware cluster.
Resonances with past episodes
- Complement→ The Era of Experience: Reinforcement Learning Beyond Human Data · David Silver
David Silver proposed that future agents will exist in long-term, uninterrupted streams of experience for continuous learning; this study provides quantitative support from the system level, pointing out that when digesting such uninterrupted streams of experience, the energy consumed by memory construction will absolutely dominate the agent's physical lifecycle.
This4.2. Construction Dominates the Agent Lifecycle · "exceeds total query-phase energy across 300" For LLM-mediated memory systems, the energy consumed by memory construction dominates the vast majority of the agent's lifecycle.
RelatedStreams · "An experiential agent can continue to learn throughout a lifetime" Agents in the experiential era will exist in long-term, uninterrupted streams of experience, rather than brief, single interaction episodes.
- Corroboration→ How an AI Chip Works from the Bottom Up · Reiner Pope
The memory construction workload is highly biased toward the pre-first-token generation phase (Prefill) and vector embedding (Embedding). Its core is repeatedly reading long contexts and performing dense matrix multiplications, which perfectly aligns with the design advantage of systolic arrays to improve the computation-to-communication ratio (compute utilization) through local weight reuse.
This4.3. Construction Is an Overwhelmingly Embedding and Prefill-dominated Workload · "it repeatedly reads long chunks or windows and emits compact" Agent memory construction is inherently a read-heavy, write-light workload dominated by Prefill and Embedding.
Related[00:25:37 - 00:26:40],
[00:29:55 - 00:30:22] Systolic arrays (the technology behind Nvidia's Tensor Cores and Google's TPUs) solve the data movement bottleneck by creating a large, dedicated hardware unit for matrix multiplication. This design improves the computation-to-communication ratio by storing the weight matrix locally and reusing it for many input vectors.
This is one source-grounded reading, not a replacement for the original. Every point is anchored to its source, so you can check it yourself — and corrections are welcome.