LLMs Know When to Use Tools But Fail to Act on It
LLMs don't fail at tool use because they can't recognize when they need help — they fail because they don't act on that recognition. A new study puts the mismatch rate at up to 54%, and traces the breakdown to a single transition: cognition to action.
Explanation
When an AI agent decides whether to answer a question itself or call an external tool (like a calculator or search engine), you'd assume the main challenge is knowing which situation you're in. Turns out, that's not the bottleneck.
Researchers tested four large language models on arithmetic and factual question-answering tasks, measuring how often models should use a tool (based on whether they actually get the answer right without one) versus how often they do. The mismatch is striking: 26.5–54% on math tasks, 30.8–41.8% on factual QA. Nearly half the time, the model's behavior doesn't match what its own capability profile demands.
The key insight comes from probing the models' internal states. The researchers split tool use into two stages: cognition (does the model internally "believe" a tool is needed?) and execution (does it actually call one?). Both signals are detectable in the model's hidden layers — but in the late layers that directly drive the next token output, the two signals point in nearly opposite directions. The model knows, but doesn't do.
Most of the mismatch lives in that cognition-to-action gap, not in faulty self-assessment. The model's internal read of the situation is often correct; something breaks in the translation to behavior.
Why does this matter today? Because the entire agentic AI stack — from coding assistants to autonomous research tools — assumes that if you give a model access to tools and good judgment, it will use them appropriately. This research suggests the failure mode isn't judgment; it's a structural disconnect in how internal states become outputs. Fixing it likely requires targeted interventions at the late-layer, action-generation stage, not just better training data or prompting.
The paper introduces a model-adaptive definition of tool necessity: rather than labeling a query as tool-requiring in the abstract, necessity is defined relative to each model's empirical solve rate without tools. This is a meaningful methodological upgrade over prior work that treated necessity as model-agnostic — a query that GPT-4 can handle cold may genuinely require retrieval for a smaller model, and conflating the two inflates apparent competence.
With this framing, the authors benchmark four LLMs on arithmetic and factual QA, finding behavioral mismatches of 26.5–54.0% and 30.8–41.8% respectively. These aren't edge cases — they're the dominant regime.
The mechanistic diagnosis is the paper's sharpest contribution. Tool use is decomposed into a cognition stage (internal belief about necessity, probed from hidden states) and an execution stage (observed tool-call behavior). Linear probes recover both signals with meaningful accuracy, confirming they're encoded in the residual stream. The problem: in the late-layer, last-token regime — the computational locus that determines next-token generation — the probe directions for cognition and execution become nearly orthogonal. The model's internal necessity signal is effectively decoupled from the action-generation pathway.
Trajectory analysis across the two-stage process confirms that most mismatch originates at the cognition-to-action transition, not in cognition itself. Models are not primarily miscalibrated about their own limitations; they're failing to route that calibration into behavior.
Open questions the paper raises but doesn't fully close: What architectural or training dynamics cause the late-layer orthogonality? Is this a consequence of RLHF-style fine-tuning suppressing tool calls in favor of fluent direct answers? Would targeted representation engineering or fine-tuning on the action stage close the gap without degrading cognition? The linear decodability of both probes suggests the information is there — the intervention surface is the projection, not the encoding.
For practitioners building agentic pipelines, the implication is concrete: tool-call reliability cannot be fixed by prompt engineering alone if the failure is structural at the representation level. Watch for follow-up work on steering vectors or late-layer fine-tuning as the likely next move.
Reality meter
Why this score?
Trust Layer LLMs internally recognize when external tools are needed but systematically fail to translate that recognition into tool-call actions, with mismatch rates of up to 54% — a structural 'knowing-doing gap' concentrated at the cognition-to-action transition.
LLMs internally recognize when external tools are needed but systematically fail to translate that recognition into tool-call actions, with mismatch rates of up to 54% — a structural 'knowing-doing gap' concentrated at the cognition-to-action transition.
- Behavioral mismatch between model-adaptive tool necessity and observed tool-call behavior ranges from 26.5–54.0% on arithmetic tasks and 30.8–41.8% on factual QA across four tested models.
- Both cognition (internal belief about necessity) and execution (actual tool-call behavior) signals are linearly decodable from LLM hidden states, confirming they are encoded in the model's representations.
- In the late-layer, last-token regime that drives next-token generation, the probe directions for cognition and execution become nearly orthogonal — mechanistically explaining the decoupling.
- Trajectory analysis shows the majority of mismatch is concentrated in the cognition-to-action transition, not in the cognition stage itself.
- Tool necessity is defined model-adaptively based on each model's empirical solve rate without tools, distinguishing this work from prior model-agnostic annotation approaches.
- The study covers only arithmetic and factual QA datasets; generalization to more open-ended or multi-step agentic tasks is undemonstrated.
- Only four models are tested; the range of mismatch rates (26.5–54%) varies substantially, and the paper does not fully explain what drives the variance across models.
- Linear probe decodability confirms the signals exist but does not establish that they are causally relevant to behavior — correlation between probe direction and action gap needs stronger causal validation.
The core quantitative claims (mismatch rates, probe orthogonality) are grounded in empirical measurements across multiple models and datasets, with a clear mechanistic decomposition — not just a behavioral observation.
The paper makes no overclaims; it explicitly scopes findings to the tested tasks and frames the knowing-doing gap as a diagnosis requiring further intervention work, not a solved problem.
The finding directly challenges the assumption underlying agentic AI system design — that better judgment is the fix — and points to a specific, actionable failure locus (late-layer action generation), making it practically relevant to anyone building tool-augmented LLM pipelines.
- 1 source on file
- Avg trust 90/100
- Trust 90/100
Time horizon
Community read
Glossary
- linear probes
- Machine learning classifiers trained on hidden neural network states to detect and measure whether specific information (like a model's internal beliefs) is encoded in those states. They work by finding linear directions in the network's internal representations that correlate with the target signal.
- residual stream
- The main information pathway running through a transformer neural network, where data flows and accumulates across layers. It's the central channel through which information is processed and transformed as it moves through the model.
- RLHF (Reinforcement Learning from Human Feedback)
- A training technique that fine-tunes language models using human preferences as reward signals, steering the model toward outputs humans find more helpful, harmless, and honest. It's commonly used to align model behavior with desired outcomes.
- representation engineering
- A technique for modifying how information is encoded within a neural network's internal states to change the model's behavior, without retraining the entire model. It involves directly manipulating the learned representations to steer outputs in desired directions.
- steering vectors
- Computed directions in a neural network's representation space that, when applied to the model's internal states, reliably shift its behavior toward specific outcomes. They act as a control mechanism for guiding model outputs without full retraining.
- orthogonal
- In the context of neural networks, two signals or directions are orthogonal when they are mathematically independent and carry no shared information. When probe directions become orthogonal, it means the model's internal beliefs and its action-generation pathways are decoupled.
What's your read?
Your read shapes future topic weighting.
Your vote feeds topic weights, community direction and future prioritisation. Open community direction
Sources
Optional Submit a prediction Optional: add your prediction on the core question if you like.
Prediction
Will a targeted late-layer intervention (e.g., representation steering or stage-specific fine-tuning) reduce the cognition-to-action mismatch in LLM tool use below 15% within 18 months of this paper's publication?