Project Page

Exploration and Exploitation Errors Are Measurable for Language Model Agents

Jaden Park^1,* · Jungtaek Kim^1,* · Jongwon Jeong^1,* · Robert D. Nowak¹ · Kangwook Lee^2,3 · Yong Jae Lee¹

¹University of Wisconsin–Madison · ²KRAFTON · ³Ludo Robotics

^*Equal Contribution

arXiv GitHub

Overview

Figure 1. Each episode combines a partially observed 2D grid map with an unknown symbolic task DAG so that exploration and exploitation behavior can be measured directly from trajectories.

Sample Trajectories

All examples below use exploration=high, task_dag=hard, and seed=2.

Claude Opus 4.6

Claude Haiku 4.5

Gemini 3.1 Pro

Gemini 3.1 Flash Lite

GPT-5.4

GPT-4.1 Nano

Abstract

Language model agents are increasingly used in complex open-ended settings, from coding and workflow automation to physical interaction. In these domains, good performance requires both exploring the problem space and exploiting the information gathered so far. Yet most evaluations still collapse these behaviors into a single success number.

We introduce a controllable environment built from partially observable 2D grid maps and unknown symbolic task DAGs, together with a policy-agnostic metric that separates exploration and exploitation errors directly from action trajectories. This allows us to ask not only whether an agent succeeds, but also whether failure came from poor discovery, poor use of discovered information, or both.

Across frontier models, we find that low exploration error is a strong predictor of success, that agents with similar success rates can still behave very differently, and that prompt design together with lightweight harness engineering can significantly reshape failure modes.

Key Contributions

Policy-Agnostic Error Measurement

The framework measures exploration and exploitation errors from trajectories alone, without access to hidden policies or a single hand-authored reference strategy.

Controllable Environments

Each task combines a partially observed grid map with a symbolic task DAG, and the generator adjusts exploitation demands and task DAG difficulties.

Suppression of Semantic Shortcuts

Symbolic task nodes isolate whether the agent is reasoning from interaction history rather than relying on pretrained semantic priors.

Behavioral Diagnostics Beyond Success Rate

The environment surfaces distinct failure modes across models, prompts, and harnesses, making it useful for agent analysis rather than only ranking.

Method

The environment is built to distinguish failure to discover useful information from failure to use already discovered information. The environment, the error metric, and the evaluation protocol are all designed around that separation.

Illustration of metric edge cases including shortcut exploitation, exploration detours, and symmetric gain traps.

Figure 2. The metric stays strategy-agnostic while still detecting structurally redundant behavior in no-progress segments.

Environment Design

The agent moves through a partially observed 2D grid. Visiting a cell reveals legal moves and any symbolic task node at that location. Discovered nodes reveal local DAG structure but not the locations of related nodes, so progress requires both discovery and later revisitation.

Exploration and Exploitation Metric

At each timestep the environment constructs a target set of productive destinations: unseen frontier cells, actionable discovered nodes, or both. It combines a gain test with a stale-score over no-progress trajectories so it can allow benign backtracking while penalizing repeated unproductive wandering.

Evaluation Protocol

The paper evaluates 13 frontier language models, multiple prompt variants, and explicit harness engineering. The environment reports success, exploration error, and exploitation error, yielding a more informative behavioral profile than success alone.

Experimental Results

The main empirical picture is that successful agents first need to discover the right states. Across the evaluated models, exploration error is far more predictive of success than exploitation error, even though both are measurable and both can be improved through intervention.

Success rate versus exploration error across frontier models.

Success rate and exploration error show a strong negative relationship (R² = 0.947), indicating that poor exploration strongly limits success.

Success rate versus exploitation error across frontier models.

Exploitation error alone is much less predictive of success, because agents that fail to discover key states can still appear locally efficient on what they already know.

Low exploration error is a strong predictor of success.

The strongest trend in the paper is the link between effective exploration and end-task completion. If an agent does not discover actionable information, it rarely gets the chance to exploit well.

Similar success rates can hide different strategies.

Models with comparable aggregate performance can still differ substantially in when they keep exploring versus when they commit to already discovered goals.

Prompting and harness design reshape failure modes.

Exploration-focused prompts reduce exploration error, exploitation-focused prompts reduce exploitation error, and structured memory summaries substantially improve success for several models.

Qualitative side-by-side trajectory comparison across two models.

Different models reveal noticeably different exploration and exploitation behaviors even on the same environment episode.

Prompt Steering and Harness Engineering

For GPT-4.1, an exploration-focused prompt raises success from 63% to 80% and lowers exploration error from 0.123 to 0.099. Explicit structured memory summaries raise GPT-4.1 from 63.0% to 92.6% success and Gemini 3.1 Flash Lite from 51.9% to 88.9%.

Left to right: GPT-4.1 exploration error, exploitation error, and success rate across prompt types.

Error curves as exploration demand changes.

Increasing exploration demand clearly increases exploration error.

Increasing symbolic task complexity shifts the balance between exploration and exploitation burden.

Quickstart

The repository includes environment generation, simulation, analysis scripts, visualization utilities, and tests for reproducing the environment. More details can be found in the GitHub repository.

pip install -e .

bash scripts/setup_llm_keys.sh
source ~/.symbolic_environment_llm_keys

symbolic-environment simulate \
  --exploration-level high \
  --task-dag-difficulty hard \
  --seed 0 \
  --agent llm_agent \
  --llm-provider azure \
  --llm-model gpt-4.1 \
  --llm-prompt-set reasoning-exploration \
  --gif

Citation

@article{park2026exploration,
    title={Exploration and Exploitation Errors Are Measurable for Language Model Agents},
    author={Jaden Park and Jungtaek Kim and Jongwon Jeong and Robert D. Nowak and Kangwook Lee and Yong Jae Lee},
    journal={arXiv preprint arXiv:2604.13151},
    year={2026}
}