Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity
Abstract
LLM-based agents fail to exploit discovered unexpected information despite recognizing it, indicating a lack of environmental curiosity that depends on tools, compute, and training data distribution.
LLM-based agents are assumed to integrate environmental observations into their reasoning: discovering highly relevant but unexpected information should naturally lead to a model exploiting its own discoveries. We show that this assumption is false for current LLM-based agents, which struggle to reflect or react to unexpected information. Across three benchmarks (Terminal-Bench, SWE-Bench, AppWorld), we inject complete task solutions into the agent environments to deliberately expose a task's solution to a model. While agents discover these solutions on Terminal-Bench in 79-81% of runs, they interact, or exploit, them in only 37-50% of cases. This gap is starkest in AppWorld: agents see documentation stating that a command "returns the complete solution to this task" in over 90% of attempts but exploit this in fewer than 7% of trials. We show that agents lack what we call environmental curiosity: the capability to recognize and investigate unexpected but relevant observations in response to environmental stimuli. We identify three main factors influencing environmental curiosity: available tools in the agent scaffold, test-time compute, and training data distribution. Our findings identify configurations that maximize curiosity also achieve the best performance on the unmodified benchmarks. Yet even jointly optimized agents still ignore discovered solutions in the majority of trials: current agents use the environment to fetch expected information, but not to revise their strategy or maximally exploit useful stimuli.
Community
Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity
LLM agents are assumed to integrate environmental observations into their reasoning. It turns out they don't.
We inject complete solutions into agent environments as a file or API endpoint. Agents discover them in almost every run and ignore them almost always. Starkest example: on AppWorld, gpt-oss-120b sees a CLI command documented as "returns the complete solution to this task" in 97.54% of runs and calls it in 0.53%. Same pattern for GLM-4.7 and other models, across Terminal-Bench, SWE-Bench, and AppWorld.
We call this missing capability environmental curiosity: the ability to recognize and investigate unexpected but relevant observations. It matters because agents operating in novel environments need to catch subtle, unexpected, but highly relevant information to succeed, not just execute memorized patterns. And we find that configurations that maximize environmental curiosity also achieve the best performance on the unmodified benchmarks.
Agents Lack Environmental Curiosity
We propose two metrics to measure environmental curiosity: discovery@k (whether the agent surfaces relevant information) and interaction@k (whether the agent acts on it). The gap between the two is consistent across models and benchmarks.
Three test-time factors shape environmental curiosity
Tool availability. Adding str_replace_editor (the default SWE-agent tool along bash) on top of bash increases pass@1 but consistently reduces interaction with discovered solutions. Agents default to learned tool-specific patterns rather than examining their environment.
Reasoning budget. Increasing gpt-oss-120b from low to high reasoning triples interaction@1. And this is not an artifact of better discovery as discovery is consistently high: The probability of interaction given discovery rises from 17.65% (low) to 45.69% (high).
Prompting. Explicit instructions to explore the environment improve both interaction and pass@1. The prompt that maximizes interaction is also the best-performing prompt on the unmodified benchmark.
Narrow fine-tuning suppresses curiosity
We fine-tune the same base model on three task distributions and compare. Narrow in-distribution training reduces curiosity: on AppWorld w/ solution, AppWorld-SFT achieves higher pass@1 than the broader T-Bench-SFT (44.2 vs 34.5) but lower interaction@10 (26.9 vs 41.5). Narrow training compresses the solution space the agent explores. And curiosity does not transfer across domains: on each solution-injected benchmark, the respective in-domain model achieves higher interaction rates and better pass@10 scaling than the out-of-domain one. The same pattern appears on the original, unmodified benchmarks: narrow wins at pass@1, broader wins at pass@k.
Discussion
Current agents run the ReACT loop:
Action → Observation → Reasoning → Next Action
Environmental curiosity requires reflecting on whether observations fit the agent's current model of the environment:
Action → Observation → Reasoning and reflecting on observations → Next Action
Even with all test-time factors jointly optimized, agents ignore discovered solutions in the majority of trials. The gap is not only inference-time configuration, it's inherent to how LLMs are trained. We find 3 main open questions:
- Does post-training suppress environmental curiosity that pre-training may produce, or does it never emerge? Measuring this in base models is hard because curiosity can only be observed through agentic behavior.
- We tried three SFT setups to teach the reflective loop (curious first turns via rejection sampling, mid-trajectory file removal, masked adversarial turns). None worked. Training for environmental curiosity is an open problem.
- Outcome-driven metrics like pass@k reward rigid plan execution the same as adaptive reasoning. Process-oriented metrics that assess whether agents ground reasoning in observations are a necessary complement.
📜 https://arxiv.org/abs/2604.17609
Work by Cohere ❤️
Get this paper in your agent:
hf papers read 2604.17609 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper