中文

Agent Memory: System Characterization and Implications for Long-Horizon Workloads · Yasmine Omri

2026-06-06 · A faithful, transcript-grounded reading by PodLens

Source paper:https://arxiv.org/pdf/2606.06448

Agent MemorySystem-level CharacterizationPrefill OverheadMemory ConstructionFreshness and LatencyEnergy Overhead

What This Paper Is About

This paper presents the first system-level characterization of Agent Memory Systems for Large Language Models (LLMs). As LLM agents are increasingly deployed in long-horizon tasks requiring continuous long-term reasoning, agents need to persistently store, retrieve, and update their own memories across multiple sessions. Although various agent memory system designs exist, their system-level behaviors and computational overheads have remained uncharacterized. The authors, Yasmine Omri, Ziyu Gan, et al., establish a systematic taxonomy, build a stage-aware profiling harness, and conduct a system-level evaluation of ten representative memory systems on the MemoryAgentBench and MemoryArena benchmarks. The study finds that memory construction overhead dominates the vast majority of the agent's lifecycle and conflicts in computational resources with latency-sensitive question-answering (QA) services. Finally, the paper proposes ten system recommendations regarding agent memory serving architecture, scheduling, and system selection.

Paper Skeleton

Core Arguments List

  1. The Prefill overhead of full-history context scales quadratically as history accumulates, and there is a risk of losing information in the middle. - Anchor: 1. Introduction · "prefill costs scale" - Type: Fact
  2. External memory systems overcome the system-level limitations of long-context processing by decoupling context length from storage capacity. - Anchor: 1. Introduction · "decoupling capacity" - Type: Claim
  3. For LLM-mediated memory systems, the energy consumed by memory construction dominates the vast majority of the agent's lifecycle. - Anchor: 4.2. Construction Dominates the Agent Lifecycle · "exceeds total query-phase energy across 300" - Type: Fact
  4. Agent memory construction is inherently a read-heavy, write-light workload dominated by Prefill and Embedding. - Anchor: 4.3. Construction Is an Overwhelmingly Embedding and Prefill-dominated Workload · "it repeatedly reads long chunks or windows and emits compact" - Type: Fact
  5. During parallel serving, the massive Prefill throughput of construction tasks occupies KV-cache headroom and directly competes for resources with low-latency QA queries. - Anchor: 4.3. Construction Is an Overwhelmingly Embedding and Prefill-dominated Workload · "a large construction prefill job occupies KV-cache headroom and stalls the batch scheduler precisely when a latency-sensitive query arrives." - Type: Prediction
  6. Downscaling the LLM during the construction phase is a viable cost-control lever, but its lower bound is strictly constrained by the algorithm's output format constraints. - Anchor: 4.4. Construction-LLM Choice Is Agent memory system-Constrained · "LLM downscaling is a cost lever" - Type: Fact
  7. No single memory system is optimal across construction overhead, query latency, and task accuracy simultaneously. - Anchor: 4.5. The Construction–Serve–Accuracy Frontier · "No agent memory system is optimal across" - Type: Fact
  8. Under asynchronous scheduling, slow-construction memory systems serve stale memory data to the agent due to uncommitted writes, leading to a conflict in the "freshness-latency" tradeoff. - Anchor: 4.6. Inter-Session Construction Creates a Freshness–Latency Tradeoff · "Under asynchronous scheduling, slow-construction agent memory systems serve" - Type: Fact

Plain English Explanation

We can think of agent memory as a leap from "static document retrieval" to "dynamic, mutable state management." In the past, agents either foolishly crammed hundreds of thousands of words of conversation history into the LLM every single time, or they could only search through a bunch of rigid, never-changing local documents like traditional RAG. But both approaches come with huge costs: the former gets more and more expensive the more you chat (with Prefill overhead scaling quadratically), and the model completely forgets information in the middle; the latter fails to let the agent record user preferences or continuously correct old knowledge as new interactions occur.

Agent memory systems were born to resolve this contradiction. They store memories outside the LLM and only "fish out" the most relevant ones when needed, saving a massive amount of GPU computation costs.

However, there is no free lunch. The core finding of this paper is that agent memory systems actually shift the cost from "query time (Read Path)" to "recording time (Write Path)." For example, in a system like Mem0, every time a user says something, it has to call an LLM in the background to distill and refine that sentence into atomic facts, and even perform deduplication and merging with the existing memory store via ADD, UPDATE, or DELETE operations. This "memory construction" process runs quietly in the background, consuming an astonishing amount of power and time—sometimes dozens of times more than the actual Q&A itself.

Furthermore, because "memorizing" requires the LLM to repeatedly read a long context and spit out just a few short lines of core relationships, this is a high-throughput task heavily dependent on "Prefill" on computing chips (GPUs). If you stuff this background "memory construction task" and the foreground chat Q&A—where users are urgently waiting for the first-token response (latency-sensitive Decode)—into the same GPU cluster, they will violently compete for resources, causing the user's waiting time to multiply.

Finally, the paper tells us that choosing a memory paradigm is like making a multi-dimensional trade-off. If your agent is the type where users chat very little but you have to repeatedly read this history (e.g., high-frequency query types), then a paradigm that puts the overhead in the construction phase to make queries extremely lightweight is highly cost-effective. But if your agent needs to frequently ingest large amounts of real-time data while users only ask questions occasionally, spending high costs to construct fine-grained memories in real-time is a massive waste.

Glossary

Before and After This Paper

Most Worth-Reading Sections

Resonances with past episodes

A faithful reading and plain-language retelling of the paper, generated by PodLens.

This is one source-grounded reading, not a replacement for the original. Every point is anchored to its source, so you can check it yourself — and corrections are welcome.