Alice System Learns Game Rules From Interaction Alone, No Labels Needed
An AI agent just learned to build executable world models of a deliberately mislabeled puzzle game — without rule descriptions, rewards, or any trustworthy language to lean on. That's not a benchmark trick; it's a direct attack on the core brittleness of LLM-based planning.
Explanation
Most AI planning systems cheat a little: they rely on the names of things to guess how those things behave. Call a wall "wall" and the model already half-knows it blocks movement. Strip that away — rename every rule and property with random unrelated words — and most systems collapse.
That's exactly the trap set by "Baba in Wonderland," a modified version of the puzzle game Baba Is You where the simulator logic is preserved but all the meaningful labels are replaced with nonsense. It's a clean test of whether a system is actually learning dynamics or just pattern-matching on vocabulary.
Alice, the system introduced in this paper, is built to survive that trap. It works in a closed loop: propose a candidate rule update, test it against past and new transitions, and treat any contradiction not as failure but as information. When a new rule explains a fresh transition but breaks an old one, Alice reads that conflict as evidence that two distinct dynamics were being lumped together. It then splits them into separate hypothesis classes and steers future exploration toward transitions that are underrepresented in the current model.
The result is an agent that progressively sharpens its internal program of the world through interaction evidence alone — no reward signal, no rule descriptions, no semantic shortcuts.
Experiments on Baba in Wonderland show Alice substantially outperforms baselines at recovering correct executable world models under prior misalignment. Ablations confirm that both the conflict-based class refinement and the class-aware exploration strategy are load-bearing — neither alone gets you there.
Why care now? Executable world models — programs an agent can run, inspect, and plan with — are increasingly seen as the missing layer between raw LLM reasoning and reliable autonomous behavior. Alice's approach suggests that the path to robust models runs through structured contradiction, not better priors. Watch whether this transfers beyond grid-world puzzles to environments with continuous or stochastic dynamics.
The core problem Alice addresses is prior misalignment in online world-model induction: the agent's lexical priors (e.g., what a token named "push" implies) are actively misleading, so any system that bootstraps dynamics from surface semantics will induce systematically wrong transition laws. Baba in Wonderland operationalizes this by preserving the full Baba Is You simulator while permuting rule-property labels — a surgical intervention that isolates semantic leakage from genuine dynamics learning.
Alice's mechanism is a closed-loop hypothesis refinement engine. The key insight is treating preservation conflicts — cases where a candidate update explains a new transition but invalidates previously explained ones — as structural signal rather than noise. This is a form of online discrimination: conflicts reveal that the current program has conflated two or more distinct state-dependent dynamics under a single rule. Alice responds by splitting the conflated class into finer hypothesis classes, each paired with compact, class-stratified counterexamples that constrain future update candidates.
The exploration side is equally deliberate. Rather than uniform or curiosity-driven frontier sampling, Alice biases toward transitions that are novel and underrepresented relative to the current program's coverage — a targeted strategy to surface the evidence most likely to resolve remaining ambiguities. This is reminiscent of active learning's version-space reduction, applied online without a fixed hypothesis space.
Evaluation is on a single domain (Baba in Wonderland), which is both a strength (clean ground truth, reproducible) and a limitation (grid-world, discrete, deterministic). The ablation structure is credible: removing class refinement and class-aware exploration independently degrades performance, supporting the claim that both components are necessary.
Open questions the paper leaves on the table: how does Alice scale when the hypothesis space is large or the dynamics are stochastic? Does the conflict-detection mechanism remain tractable as program complexity grows? And critically — does "substantially improves" translate to full rule recovery, or just better partial coverage? The abstract doesn't quantify the gap, which matters for assessing how close this is to a deployable planning substrate.
The falsifier to watch: if Alice's gains evaporate on domains with continuous state spaces or noisy transitions, the approach may be fundamentally tied to the clean discrete structure of rule-based puzzle games.
Reality meter
Why this score?
Trust Layer A closed-loop system called Alice can induce correct executable world models from interaction evidence alone, without rule descriptions, reward signals, or reliable lexical priors, by treating preservation conflicts as structural signal for dynamics refinement.
A closed-loop system called Alice can induce correct executable world models from interaction evidence alone, without rule descriptions, reward signals, or reliable lexical priors, by treating preservation conflicts as structural signal for dynamics refinement.
- Alice is evaluated on Baba in Wonderland, a variant of Baba Is You that preserves simulator dynamics while replacing semantically meaningful rule-property labels with unrelated words — explicitly designed to break lexical prior reliance.
- Alice treats failed candidate updates (those that explain new transitions but invalidate previously explained ones) as evidence that distinct dynamics have been conflated, triggering class refinement rather than simple rejection.
- Class refinement produces compact, class-stratified preservation counterexamples that constrain future update candidates and guide exploration toward underrepresented transitions.
- Experiments show Alice 'substantially improves' executable world-model learning under prior misalignment compared to baselines.
- Ablations confirm both class refinement and class-aware exploration are individually necessary — removing either degrades performance.
- The abstract reports 'substantial improvement' without quantifying the performance gap — the magnitude of the result cannot be assessed from the source alone.
- Evaluation is confined to a single domain (a discrete, deterministic grid-world puzzle game); generalizability to stochastic or continuous environments is undemonstrated.
- No mention of computational cost or scalability as program complexity or hypothesis space size grows.
The experimental setup is concrete and reproducible (a named, well-defined benchmark), the mechanism is described with sufficient specificity, and ablations support the causal claims — this is a credible empirical result, not a demo.
The source is an arXiv abstract with no quantified numbers on the key result, making 'substantially improves' unverifiable from the excerpt; the single-domain scope limits how broadly the claim can be read.
Executable world models are a genuine bottleneck for reliable autonomous planning, and a method that works without semantic priors addresses a real fragility — but impact is currently bounded by the gap between discrete puzzle games and real-world deployment targets.
- 1 source on file
- Avg trust 90/100
- Trust 90/100
Time horizon
Community read
Glossary
- world-model induction
- The process of learning a model of how the environment works by observing transitions between states. In this context, it refers to an agent learning the rules and dynamics of a game or system from experience.
- lexical priors
- Initial assumptions or biases about what words or tokens mean based on their surface-level semantics. These can mislead learning systems when the actual meaning differs from what the word suggests.
- semantic leakage
- The problem where an agent's learning is corrupted by relying on the surface meaning of words or labels rather than discovering the true underlying dynamics of the system.
- hypothesis refinement
- A learning process where candidate explanations or rules are iteratively improved by testing them against new observations and splitting overly broad hypotheses into more specific ones.
- version-space reduction
- An active learning technique that strategically selects examples to eliminate candidate hypotheses, progressively narrowing down the set of possible correct solutions.
- stochastic
- Involving randomness or probability; systems where the same input can produce different outputs due to chance rather than following deterministic rules.
What's your read?
Your read shapes future topic weighting.
Your vote feeds topic weights, community direction and future prioritisation. Open community direction
Sources
Optional Submit a prediction Optional: add your prediction on the core question if you like.
Prediction
Will Alice or a direct successor demonstrate executable world-model learning via conflict-based refinement in a continuous or stochastic environment within 18 months?