Brick-Composer Trains MLLMs to Assemble Physical Objects Step by Step
AI that can read a design and build it from physical parts has been a fantasy — Brick-Composer makes it measurably less so, lifting assembly success from under 1% to ~15% per step, with a single 8B model correctly handling 42% of steps end-to-end.
Explanation
The core problem: multimodal large language models (MLLMs — AI systems that process both images and text) are surprisingly bad at the kind of spatial reasoning needed to assemble real objects from parts. They can describe a LEGO-style brick, but ask them to pick the right one from a lineup and place it precisely, and they fall apart.
To measure exactly how badly, the researchers built BC-Bench, the first benchmark designed to test MLLMs on diverse brick assembly. The task is framed as a sequence of decisions: at each step, the model must (1) identify the correct brick from candidates, and (2) predict where and how to place it. Both subtasks have to be right for the step to count.
Baseline results were brutal. State-of-the-art MLLMs achieved less than 1% strict step-level success — meaning almost every assembly attempt failed at some point in the sequence.
Brick-Composer fixes this with three training signals layered on top of an existing MLLM (Qwen-3-8B). "Human Design Sparks" feed the model rich construction demonstrations that encode how parts relate to each other. "World Feedback" grounds the model's predictions in what actually happens visually and physically when a brick is placed. "Synthetic Experience" generates additional training data beyond real object designs, so the model isn't bottlenecked by dataset size.
The results: brick selection accuracy more than triples, pose estimation errors drop substantially, and strict step success climbs from sub-1% to ~15%. On full object assembly, the trained model gets 42% of steps right — not production-ready, but a genuine proof of concept that targeted, physically grounded training can unlock spatial assembly skills in a general-purpose language model.
The gap between 42% step accuracy and a complete, correct build is still large — errors compound across steps. What to watch: whether this approach scales to more complex geometries, and whether the benchmark holds up as a meaningful proxy for real-world robotic assembly.
The paper frames brick assembly as a sequential decision-making problem with two coupled subtasks per step: categorical brick selection (from a candidate set) and 6-DoF pose estimation. Both must succeed simultaneously for a step to register as correct under the strict metric — which is why baseline MLLM performance collapses to sub-1% despite reasonable per-subtask intuitions.
BC-Bench is the methodological anchor here. It's the first benchmark targeting MLLMs specifically on diverse (non-uniform) brick types, which matters because prior assembly work has largely assumed constrained part sets or relied on programmatic solvers rather than vision-language models. The benchmark's existence is independently useful regardless of Brick-Composer's results.
The three-signal training framework is the core contribution. Human Design Sparks are affordance-rich demonstrations — essentially teaching the model construction intent, not just geometry. World Feedback is a physically grounded reward signal: the model sees the visual and physical consequences of its predicted placements, closing the loop between prediction and outcome. Synthetic Experience addresses the data bottleneck by generating novel object designs, decoupling benchmark scale from real-world design corpora. Together these signals are applied to Qwen-3-8B, a publicly available 8B-parameter multimodal model.
Quantitative outcomes: >3× improvement in brick selection accuracy, substantial reduction in pose estimation error (magnitude not precisely quoted in the abstract), and step-level success rising from <1% to ~15%. Full-object step accuracy reaches 42% — a figure that sounds modest but represents a qualitative regime shift from "essentially random" to "meaningfully guided."
Open questions the paper likely doesn't fully resolve: how error compounds across a full assembly sequence (42% per-step accuracy implies near-zero full-object completion for anything beyond a few steps), whether World Feedback generalizes to out-of-distribution geometries, and how the benchmark's difficulty distribution maps to real robotic manipulation constraints (gripper tolerances, occlusion, physical compliance). The absence of a robot-in-the-loop evaluation is the obvious falsifier gap — sim-to-real transfer for pose estimation at brick-level precision is non-trivial. Still, as a pure vision-language capability study, the delta is hard to dismiss.
Reality meter
Why this score?
Trust Layer Multimodal LLMs can acquire meaningful brick assembly skills — selection and pose estimation — through physically grounded training, as demonstrated by a >3× accuracy gain and a step-success jump from <1% to ~15%.
Multimodal LLMs can acquire meaningful brick assembly skills — selection and pose estimation — through physically grounded training, as demonstrated by a >3× accuracy gain and a step-success jump from <1% to ~15%.
- BC-Bench is introduced as the first benchmark for evaluating MLLMs on assembly with diverse (non-uniform) bricks.
- Baseline state-of-the-art MLLMs achieve less than 1% strict step-level assembly success on BC-Bench.
- Brick-Composer improves brick selection accuracy by over three times compared to baseline MLLMs.
- Strict step-level assembly success rises from less than 1% to around 15% after Brick-Composer training.
- A Qwen-3-8B model trained with Brick-Composer correctly completes approximately 42% of steps for a full object assembly.
- 42% per-step accuracy implies near-zero full-object completion for multi-step assemblies due to compounding errors — the headline number flatters the practical capability.
- No robot-in-the-loop evaluation is described; sim-to-real transfer for brick-level pose estimation precision remains an open and non-trivial gap.
- Pose estimation error reduction is described as 'substantial' without a precise magnitude quoted in the abstract, making independent calibration of the improvement difficult.
Results are grounded in a concrete benchmark with quantified before/after metrics on a real model (Qwen-3-8B); the sub-1% baseline is a credible sanity check, not a strawman.
The abstract is measured — it explicitly calls current MLLMs 'far from reliable builders' and frames 42% step accuracy as a first step, not a solved problem.
A >3× selection gain and a 15× step-success improvement on a new benchmark signal genuine capability unlock, but the gap to practical robotic assembly remains large and unaddressed in this work.
- 1 source on file
- Avg trust 90/100
- Trust 90/100
Time horizon
Community read
Glossary
- 6-DoF pose estimation
- The task of determining an object's complete 3D position and orientation in space, where DoF stands for degrees of freedom (three for position, three for rotation). In brick assembly, this means predicting exactly where and how each brick should be placed.
- MLLM
- Multimodal Large Language Model — an AI system that processes and reasons about both text and visual information (images) together. MLLMs can understand images and answer questions about them in natural language.
- World Feedback
- A training signal that shows the model the actual visual and physical consequences of its predicted actions, allowing it to learn from the real or simulated outcomes of its placement decisions rather than just from static examples.
- Affordance-rich demonstrations
- Training examples that teach not just the geometric or visual properties of objects, but also the underlying intent and purpose behind how they should be used or assembled — in this case, showing construction intent rather than just shape information.
- Sim-to-real transfer
- The challenge of taking a model trained in simulation (virtual environments) and making it work reliably in the real physical world, where factors like friction, sensor noise, and material properties differ from the simulation.
What's your read?
Your read shapes future topic weighting.
Your vote feeds topic weights, community direction and future prioritisation. Open community direction
Sources
Optional Submit a prediction Optional: add your prediction on the core question if you like.
Prediction
Will a Brick-Composer-style MLLM framework achieve over 50% strict step-level assembly success on BC-Bench within 18 months?