MARB studies

Each MARB study is a self-contained experiment on the same task, the same authored part kit, and the same automated CADCLAW grader. Only the driver, or the short note we give it, changes. This page is the running log, so every result has one place to live.

A note on terms. Buildability metrics (did the model export a loadable STEP, and how many part instances did it place) describe what a model built. They are not the grade, which scores how correctly each part is located, oriented, and gapped against the answer key. Grading is a separate step. Studies that report buildability say so plainly.

StudyDriverDateRunsHeadlineStatus
Sighted local cells qwen3-vl 32B + Nemotron 3 Nano Omni (vision, goal image in-loop) 2026-06-12 15 Seeing the goal made it worse: the sighted 32B places ~15 parts at 873 mm GAP vs the blind 80B text model's ~90 parts at 272 mm. Nemotron Omni: 0/5 exports. Published
Fable 5 effort sweep Claude Fable 5 (CadQuery) 2026-06-11 4 One model, four effort settings up to a multi-agent ultra run. Effort is not monotonic — but ultra cut GAP to 3.0 mm and set the board-best relative position. Published
Local open-weight anchor qwen3-coder-next (80B, on one local box) 2026-05-30 30 Recursive prompt study, now graded. One CAD command took buildability from 1 in 5 to 5 in 5, but the graded result is parts placed 100 to 400 mm off: grouped, not jointed. Published
First results, frontier models Claude (Fusion, CadQuery), OpenAI Codex (CadQuery) 2026-05-26 3 Each placed about 100 authored parts from one photo. Scores 12 to 15 out of 100. None buildable yet. Published

Study: the Claude Fable 5 effort sweep

Driver: Claude Fable 5, reasoning effort low / medium / high, plus an ultra run (multi-agent: build, 4-agent adversarial audit, targeted fixes). CAD tool: CadQuery. Kit and grader identical to the first results. Blind runs, one per setting. Positional grades (MARB v0.9).

A new frontier model arrived, and it exposes a knob the first results could not: reasoning effort. Same model, same task, same kit; only the effort setting changes. Does paying for more thinking buy a better machine?

EffortGAP medianORIENT alignedPOS relative medianWall-clockBilled tokens
Ultra (multi-agent)3.0 mm47%30.4 mm83 min23.0M
Medium6.5 mm59%48.5 mm39 min4.7M
Low7.0 mm53%68.0 mm38 min4.2M
High7.0 mm49%38.1 mm45 min18.5M

Token bills recovered post-run from the Claude Code session transcripts (per-message usage summed; the recovery script ships in the repo). High effort billed 3.9× medium for worse GAP and orientation — the non-monotonic curve has a price tag.

Within the single-agent sweep the answer is not monotonic. Medium effort beat both its cheaper and its more expensive siblings on interface gaps and orientation; high effort bought better relative position and nothing else; low drifted most on position. Turning the same knob further does not help.

Changing the harness does. The ultra run — the same model orchestrating subagents that probe every kit part, audit the v1 export adversarially (collision booleans, per-constraint arithmetic, visual fidelity against the reference images, BOM counts), then apply targeted fixes — cut GAP to 3.0 mm and set the best relative position on the entire board, 30.4 mm, ahead of every Opus and Codex run. The audit caught faults invisible at render scale: a motor shaft 4.9 mm short of its pinion, belts cutting 3.4 mm into an idler, a pulley installed hub-backwards. The price was roughly double the wall-clock and a 23.0M-token bill (~497K of it the audit subagents) — within 4% of what Opus CadQuery billed for its nine self-review attempts.

On the overall board, the Fable 5 runs slot between Claude Opus 4.7 and GPT-5 Codex on the primary GAP metric, with ultra at rank 3. The pattern echoes the local-anchor study below from the other end of the capability curve: undirected budget, whether turns for a local model or a reasoning-effort setting for a frontier one, does not convert into a better assembly — structured verification does. The full board lives on the benchmark page; single runs each, so treat the deltas as directional until we add seeds. The recap article pulls every finding to date together, including what we can and cannot say yet about token use.

Study: the sighted local cells

Drivers: qwen3-vl:32b (vision, 8 turns n=5 + 12 turns n=2 preliminary) and nemotron3:33b Omni (vision, n=5), goal image inlined on turn 1, lean-v5 guidance. Same kit, grader, and harness as every other cell. Token capture native. Graded MARB v0.9.

Every model so far built the machine from a text brief plus reference images it had to ask for. These cells hand a vision model the goal image directly, in-loop, on turn one. The obvious hypothesis: seeing the target helps. It does not.

CellBuildableParts placed (median)GAP medianORIENTPOS rel.
Blind text 80B (qwen3-coder-next, best cohort, n=9)9/10~62272 ± 149 mm12%118 ± 47 mm
Sighted 32B (qwen3-vl, 8 turns, n=5)5/512–17873 ± 174 mm0%1005 ± 613 mm
Sighted 32B, 12 turns (n=2, preliminary)2/2 graded16–20335 ± 138 mm5%260 ± 249 mm
Sighted Nemotron 3 Nano Omni (n=5)0/5
Goal vs blind text build vs sighted vision build, one camera

Study: the local open-weight anchor

Driver: qwen3-coder-next:q4_K_M (80B total, 3B active), text only, on a single local AI supercomputer. CAD tool: CadQuery 2.7.0. Kit: v1.1. Blind run, no internet, no memory of past work. 40 runs in six cohorts plus extension seeds. Buildability metrics plus positional grades (MARB v0.9; graded cohorts now n = 9 and n = 8 loadable of 10 attempts each).

The cloud models in the first results are large and expensive. This study asks the honest floor question: how does a strong coding model that a small shop could own and run for free, on one machine, do on the same task. And once it is running, what short note actually helps it.

We ran it thirty times and changed only the brief operational note before each batch of five. Every note stayed inside the fairness wall: it clarified the CAD tool or the task, and never revealed the reference design.

CohortNote givenTurn budgetBuildable fileParts placed (median)
ANone (control)81 of 515
BCorrect CAD export idiom85 of 598
CB, plus a build-volume clarification83 of 528
DB, plus design-goal requests82 of 524
ESame as D144 of 530
FLean note, the export idiom sharpened85 of 584

Target is about 100 placed part instances. The kit holds authored STEP parts; the model places each one and exports a single STEP file.

What the recursion taught us

The grades: placed, not jointed

Updated 2026-06-12 with five extension seeds per cohort: the original "5 of 5 buildable" did not survive more seeds — combined buildability is 9/10 (mechanics v2) and 8/10 (lean v5), and the spreads widened. That is what seeds are for; single-cohort rates flatter. The graded aggregates below use all loadable seeds.

The two cohorts that export reliably have now been graded with the same GAP, POS, and ORIENT rubric as the frontier track, so the numbers are directly comparable. Read them with the buildability table above: the model places roughly the right number of parts at roughly the right scale, and the grades quantify how far those parts sit from a machine.

Cohort (n = 5)GAP medianORIENT alignedPOS relative median
Mechanics v2 (n=9 of 10)272 ± 149 mm12%118 ± 47 mm
Lean v5 (n=8 of 10)341 ± 133 mm20%233 ± 139 mm
Frontier range, for scale0.0 to 7.8 mm47 to 69%38 to 68 mm
Three-panel figure: the goal machine next to the two best local builds, which place similar parts loosely rather than as a jointed frame
The goal next to the two best local builds. Right inventory, right scale, wrong assembly.

What this study does not yet answer

Method, harness, grades, and the full run catalogue are in the open-source repository. Correctness grading uses the same CADCLAW gates as every other MARB run.