LLMs Know the Rules They Break During Multi-Turn Ideation
Large language models can accurately recite a constraint they are actively violating — in the same conversation. DriftBench quantifies this dissociation across seven models and finds "knows-but-violates" rates as high as 99%.
Explanation
When you use an AI to iteratively develop a research idea — pushing it to be more novel, more rigorous, more detailed — the model tends to drift away from the original requirements you set. That's the core finding of DriftBench, a new benchmark built specifically to catch this failure mode.
The researchers ran 2,146 scored sessions across seven models from five providers, covering 38 research briefs drawn from 24 scientific fields. The setup mimics real collaborative ideation: a user sets constraints, then applies iterative pressure over multiple turns. What they found is consistent and uncomfortable — more turns reliably produce more structural complexity, and more structural complexity reliably correlates with lower adherence to the original brief.
The sharpest result is the "knows-but-violates" (KBV) metric. When prompted with a restatement probe — essentially asking the model to repeat back the constraints — models do so accurately, even while their actual outputs ignore those same constraints. KBV rates range from 8% to 99% depending on the model. That's not a rounding error; that's a fundamental gap between declarative memory and behavioral compliance.
Structured checkpointing (periodically re-anchoring the model to its original constraints) reduces KBV rates somewhat, but doesn't close the gap. Complexity inflation — outputs growing more elaborate without becoming more compliant — persists regardless.
One methodological note worth flagging: the LLM judge used to score constraint adherence under-detects violations compared to blind human raters, meaning the reported numbers are conservative. The real drift is likely worse than the benchmark shows.
For anyone using AI in research workflows today — grant writing, hypothesis generation, protocol design — this is a practical warning: the model that helped you brainstorm in round one is not reliably tracking your original goals by round five.
DriftBench targets a specific and underexplored failure mode: constraint adherence degradation under iterative conversational pressure in multi-turn LLM-assisted scientific ideation. The benchmark spans 2,146 scored runs, seven models (including two open-weight), four interaction conditions, and 38 research briefs across 24 domains — a scope large enough to claim cross-model generalizability, though the exact model identities and provider breakdown matter for interpreting variance.
The central construct is the knows-but-violates (KBV) rate: constraint non-compliance despite accurate declarative recall of those constraints, measured via a restatement probe inserted mid-session. The 8%–99% range across models is the headline, but the distribution shape — whether this is bimodal, correlated with model size, or with RLHF tuning — isn't detailed in the abstract. That's an open question the full paper presumably addresses.
The dissociation between declarative recall and behavioral adherence maps onto a known tension in alignment research: instruction-following at inference time is not the same as maintaining a stable goal representation across a long context window. Prior work on "sycophancy" and "context window forgetting" is adjacent, but KBV is a cleaner, more operationalizable construct because it controls for recall explicitly.
Structured checkpointing as a partial mitigation is the most actionable finding for practitioners — and its failure to fully close the gap is the most important one for researchers. It suggests the problem isn't simply attention to constraints but something more structural about how iterative pressure reshapes the model's implicit optimization target within a session.
The conservative bias in LLM-judged scores (confirmed by blind human raters) is a methodological red flag for the broader eval ecosystem: if self-referential LLM judges systematically under-detect the very failure modes they're asked to catch, benchmark scores across the field may be systematically inflated. Sensitivity analyses across temperature (0.7 vs. 1.0) and pressure type (novelty vs. rigor) add robustness, though they don't vary context length or prompt format — both plausible confounders.
The full benchmark — briefs, prompts, rubrics, transcripts, scores — is released openly, which makes this immediately forkable for follow-on work. Watch for whether frontier model providers respond with architectural or prompting-level fixes, and whether KBV rates correlate with model scale in the full results.
Reality meter
Why this score?
Trust Layer LLMs engaged in iterative scientific ideation systematically violate constraints they can accurately recall, and this dissociation worsens under conversational pressure across all tested models.
LLMs engaged in iterative scientific ideation systematically violate constraints they can accurately recall, and this dissociation worsens under conversational pressure across all tested models.
- 2,146 scored benchmark runs across seven models from five providers, four interaction conditions, and 38 research briefs from 24 scientific domains.
- Knows-but-violates (KBV) rate — constraint non-compliance despite accurate restatement — ranges from 8% to 99% across models.
- Iterative pressure reliably increases structural complexity and often reduces adherence to original constraints.
- Structured checkpointing partially reduces KBV rates but does not close the dissociation, and complexity inflation persists.
- Blind human raters confirm the LLM judge under-detects constraint violations, making reported adherence scores conservative.
- The abstract does not name the specific models tested, making it impossible to assess whether results skew toward weaker or older systems.
- KBV rate variance (8%–99%) is enormous; without the distribution breakdown, the aggregate finding obscures which models are actually problematic.
- Restatement probes are researcher-designed stimuli — it's unclear whether they reliably elicit genuine recall or merely surface-level repetition of prompt text.
The benchmark is large-scale, open, and human-validated against blind raters, giving the core KBV finding solid empirical grounding despite the abstract's lack of model-level detail.
The source is an arXiv preprint with no peer review noted, and the 8%–99% KBV range is presented without sufficient distributional context to assess how representative the worst cases are.
If KBV rates are this high even with checkpointing, every multi-turn AI-assisted research workflow has a latent compliance risk that current eval tooling systematically underestimates — that's an immediate operational concern, not a future one.
- 1 source on file
- Avg trust 90/100
- Trust 90/100
Time horizon
Community read
Glossary
- knows-but-violates (KBV) rate
- The proportion of instances where a language model fails to follow constraints in its outputs despite being able to accurately recall and restate those same constraints when directly asked. It measures the gap between what a model knows and what it actually does.
- constraint adherence degradation
- The progressive decline in a model's ability to follow specified rules or requirements as a conversation continues over multiple turns, typically caused by accumulated conversational pressure or context.
- RLHF tuning
- Reinforcement Learning from Human Feedback, a training technique where a language model is fine-tuned using rewards based on human judgments of output quality, used to improve alignment with human preferences.
- context window
- The maximum length of text (measured in tokens) that a language model can process and maintain awareness of at one time during a conversation or task.
- sycophancy
- A failure mode in language models where they agree with or defer to user preferences or statements even when doing so contradicts their training or produces inaccurate outputs.
- structured checkpointing
- A mitigation technique where a model periodically restates or resets to key constraints and goals during a long interaction to maintain adherence to those requirements.
What's your read?
Your read shapes future topic weighting.
Your vote feeds topic weights, community direction and future prioritisation. Open community direction
Sources
Optional Submit a prediction Optional: add your prediction on the core question if you like.
Prediction
Will at least one major LLM provider publicly address constraint drift in multi-turn sessions — via architecture, system prompt design, or documented mitigation — within 12 months of DriftBench's release?