WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
Abstract
Multimodal large language models require sophisticated memory systems that can track evolving environments and manage information dynamically across multiple sessions, with new benchmarks revealing limitations in current approaches.
Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.
Community
WorldMemArena is a new benchmark evaluating the multimodal memory of long-horizon agents using a four-stage Action-World Interaction Loop and multi-session tasks for detailed performance diagnostics.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MemReader: From Passive to Active Extraction for Long-Term Agent Memory (2026)
- MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts (2026)
- Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions (2026)
- MemGym: a Long-Horizon Memory Environment for LLM Agents (2026)
- MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents (2026)
- When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory (2026)
- LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.29341 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper