Latent Preference Modeling for Cross-Session Personalized Tool Calling
Abstract
Personalized tool calling in LLM-based agents is improved through memory-augmented methods that capture user choice reasoning rather than just choices, using minimal token overhead.
Users often omit essential details in their requests to LLM-based agents, resulting in under-specified inputs for tool use. This poses a fundamental challenge for tool-augmented agents, as API execution typically requires complete arguments, highlighting the need for personalized tool calling. To study this problem, we introduce MPT, a benchmark comprising 265 multi-session dialogues that cover three challenges: Preference Recall, Preference Induction, and Preference Transfer. We also propose PRefine, a test-time memory-augmented method that represents user preferences as evolving hypotheses. Through a generate--verify--refine loop, it extracts reusable constraints from history and improves tool-calling accuracy while using only 1.24% of the tokens required by full-history prompting. These results indicate that robust personalization in agentic systems depends on memory that captures the reasons behind user choices, not just the choices themselves.
Community
LLM agents are increasingly expected to call APIs on behalf of users, but real users rarely spell out every argument they want — they just say "book a flight for my trip" and expect the agent to know they always fly economy. We argue this isn't a memory retrieval problem but a memory abstraction problem: the agent has to figure out which past choices reflect reusable preferences and which were just one-off decisions.
We introduce MPT, a benchmark of 265 multi-session dialogues testing three reasoning types — Preference Recall, Induction, and Transfer — and PRefine, a test-time method that maintains latent preferences as evolving hypotheses through a generate–verify–refine loop. PRefine improves tool-calling accuracy across 8 LLMs while using only 1.24% of the tokens required by full-history prompting. The takeaway: robust personalization depends on capturing the reasons behind user choices, not just the choices themselves.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments (2026)
- User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction (2026)
- MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization (2026)
- BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs (2026)
- AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation (2026)
- Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks (2026)
- Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.17886 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper