Mechanical Assembly Readiness Benchmark

AI can assemble machines in CAD. MARB is the missing yardstick for whether the result is correct and buildable: a tool-independent, automated benchmark graded by CADCLAW and mapped onto the readiness scales industry already trusts: TRL, MRL, IRL.

Benchmark, grades, and write-ups are open (papers & data). White paper review copies: [email protected]

Sunnyday Technologies
L0–L7Capability ladder
0Alignment failures (L1)
95/100Black-box geometry grade
~100Parts placed & verified

The gap

AI-CAD benchmarking is an active field, there are strong public benchmarks for generating geometry (ABC, DeepCAD, Fusion 360 Gallery, CADBench, Text2CAD-Bench), classifying parts (MCB), and predicting how two parts join (JoinABLe, AutoMate). None grade the capability to a functional mechanical assembly.

No public benchmark asks whether a complete, multi-part machine an AI assembled is correct and buildable: right parts, no collisions, nothing floating, every interface aligned across the whole system, judged automatically, independent of the authoring tool, and tied to a readiness level. That is the niche MARB fills.

Honest scope. Joint-prediction datasets score part pairs; generation benchmarks score single-part fidelity; none map to TRL/MRL/IRL. Prior-art survey current as of 2026-05.

The ladder

The benchmark is a ladder. Each rung is defined by the next thing an automated verifier must be able to prove, you can only benchmark what you can verify.

L0 · Component

One part is exactly as specified.

L1 · Assemble the kit Today

Parts placed, aligned, no collisions, nothing floating. The Model T.

L2 · Constraint-robust

Re-solves correctly when parameters change.

L3 · Mechanically valid

Full-travel kinematics + load-holding, measured.

L4 · Engineering change

Re-design to a new requirement, no regressions.

L5 · Design from intent

Pick & arrange parts from functional goals. Tesla.

L6 · Optimize & invent

Provably beat the human baseline, multi-physics.

L7 · Autonomous loop

Design → build → measure → certify → self-improve.

Today the bar is L1. L6–L7 are not achievable by anyone yet, that gap is the measure of the road.

Capability × Readiness

The most useful artifact MARB offers: every rung mapped onto TRL (Technology Readiness), MRL (Manufacturing Readiness), and the integration-focused IRL: so anyone who already speaks readiness-levels can place an AI result instantly.

The CADCLAW capability ladder (L0–L7) mapped to TRL, MRL, and IRL evidence bands

CADCLAW capability ladder ↔ TRL / MRL / IRL, indicative evidence bands, not equivalence.

Two axes, one bridge. TRL/MRL/IRL measure a machine's readiness, assigned by an authority. The ladder measures the AI's capability. CADCLAW is the bridge: it auto-generates the integration evidence those gates consume. The tightest native fit is IRL: CADCLAW's checks (interference, alignment, floating) are integration-readiness evidence.

What it does not claim. CADCLAW supports a readiness assessment; it does not assign a TRL. Upper bands also need operational and production data that design-time verification cannot supply alone.

How the benchmark runs

One task, any tool, one grader. Every AI workflow gets the same prompt + kit and is judged on its exported STEP by the same automated, tool-independent gates, the score is the only thing that differs.

MARB pipeline: inputs to AI driver to exported STEP to CADCLAW gates to MARB score and readiness to human review

The MARB pipeline, inputs → AI driver → exported STEP → CADCLAW gates → MARB score + readiness → human review, with a read-fix-rerun loop.

The metrics

Every run yields the same small, defined set of numbers, two kinds, never mixed: artifact quality (is the build correct?) and effort (what did it cost?).

Gates

Black-box, on the exported STEP: Inventory (right parts) · Interference (no overlaps) · Floating (all connected) · Orientation (correct pose, v-next).

L1 sub-grade · 0–100

How well the kit is assembled, × a buildability factor that falls with interference (clip count + overlap volume). A self-intersecting frame can't be built, so it can't score high.

MARB full-stack · 0–100

Rung on the L0–L7 capability ladder. A clean L1 ≈ 15 today; the rest of the scale is the road to autonomous design. Mapped to TRL / MRL / IRL.

Effort (reported separately)

Wall-clock, tokens, attempts, retries, corrections, human interventions, measured from the run log, never folded into the score.

Same task, same kit, same grader, the only variable is the driver. That makes MARB a clean performance-vs-model-capability axis: every model (frontier, local, or agentic system) is one point on the same graph.

Today, L1, on a real machine

An AI placed roughly 100 parts into a correct, two-metre 3D-printer / CNC frame:

The reference above is the resolver-built answer key, a clean L1. Below: the first cross-tool head-to-head, independent AI workflows given the same task and judged by this same grader.

The board, 2026-06-11

Nine graded builds of the same 100-part machine, frontier hosted models down to a local open-weight anchor, ranked by GAP median on one ladder. The board updates as runs land; every cell's provenance (model, tool, timing, tokens, attempts) is in the open run registry.

MARB v0.9 scoreboard: nine AI builds ranked by GAP median, from Claude Opus 4.7 at 0.0 mm down to the local 80B open-weight model at 410 mm

The full board. Claude Fable 5's multi-agent ultra run holds rank 3 and the board-best relative position; the local 80B rows carry n=5 error bars.

The Angry Millimeter chart: billed tokens versus GAP median for six frontier builds, with Claude Fable 5 at medium effort in the value corner

The Angry Millimeter: what tokens cost today. Bills recovered from session transcripts; full ledger and method in the recap.

Deep dives: the recap article (findings to date + the token ledger), the studies log (the Fable 5 effort sweep and the graded local open-weight anchor), and the first-results write-up (the original three-way head-to-head, preserved below as the founding study).

First head-to-head (the founding study, 2026-05-26)

Three independent AI workflows, Claude Opus 4.7 (Fusion and CadQuery) and OpenAI Codex / GPT-5 (CadQuery, the coding agent, not the chat app), fresh prompt-only sessions, same kit, graded on the identical black-box lens. All three placed every part with zero human help; none is buildable yet.

The target M3-CRETE machine, the single reference image each AI was given

The goal image provided.

GAP median versus build time for all seven frontier builds; Claude Opus 4.7 on CadQuery hit 0.0 mm in 49 minutes, OpenAI Codex was fastest at 13 minutes

Gap correctness vs. speed, now across all seven frontier builds. OpenAI Codex one-shotted in 13 minutes; the Fable 5 effort sweep clusters mid-board; the ultra run trades 83 minutes for rank 3.

#Model · toolGAP median ↓ORIENT aligned ↑POS rel median ↓TimeCost est.
1Claude Opus 4.7 · CadQuery0.0 mm51%49.9 mm48.8 min~$68
2Claude Opus 4.7 · Fusion2.0 mm47%47.7 mm33.7 min~$174
3GPT‑5 Codex · CadQuery7.8 mm69%47.2 mm13.0 minnot reported
n/aCADCLAW reference (answer key)0.0 mm100%0.0 mmresolvern/a

MARB v0.9. GAP median = error vs the answer key's intended interface gap (≈0 mm bolted, ≈1 mm motion clearance). ORIENT aligned = % of asymmetric parts in the correct rotation. POS rel median = position error relative to each part's neighbours (frame‑invariant). Tolerances are tight by design (≤5 mm = located, ≤5° = aligned); none buildable yet. Cost est. at Anthropic Opus list price (~$15 / $75 per M tokens, $1.50 cache‑read); Codex CLI didn't report tokens. Results 2026‑05‑26 · CadQuery 2.7.0, Autodesk Fusion 2702.1.58 · graded per the MARB v0.9 scoring spec.

Every flagged clip is structural, beams overlapping at splice joints and post/frame junctions, and the centered 2040 inserts overlapping their beams. Both also placed the Z-posts in the wrong rotation versus the reference, an orientation error the current gates don't yet catch, and the next gate we're adding.

Interference is the buildability gate. A self-intersecting frame can't be built, so it scales the L1 score down, a near-miss outranks a mess. Effort (time, tokens, attempts) is reported separately, never folded into the artifact score.
Fairness note. CADCLAW and its placement resolver were first built around CadQuery, before the Fusion connection (MCP) existed, so the CadQuery runs may carry a home-field edge in tooling maturity. We flag it so the comparison stays honest; an orientation gate and more reference tasks will tighten it.
Claude-Fusion build progression from 10 to 100 parts

Claude-Fusion, building the frame (10 → 100 parts). The model emits no parametric timeline, so we recovered the build order afterward by driving the live Fusion model through its MCP, revealing the placed parts in order under a fixed isometric camera.

Grid of in-process CAD review renders the Claude-CadQuery driver generated while assembling the machine

Claude-CadQuery's own in-process renders, the orthographic + isometric checks it generated as it built. This is the human-reviewable output that lets a watcher catch a bad run early and stop it to save tokens.

What we asked, and what we didn't

We gave the goal, not the method. The driver got the target, the pictured assembly plus design constraints, and the kit, but not the build sequence. The original human-guided build specified an inside-out order (X axis → Y → Z-posts) and detailed steps; here we deliberately withheld that and let the AI decide how to reach the pictured result.

So MARB really tests whether a model understands the mechanical system: here a V-slot-and-roller gantry, a basic, well-documented pattern. Placing all 100 parts but self-intersecting, with mis-oriented posts, says a model recognizes the parts but not yet how they go together.

Human-reviewable throughout. Both drivers emitted orthographic + isometric renders as they built, so a person could watch the assembly take shape and abort a bad run early to save tokens. (Fusion didn't persist its in-session views, so we recovered the progression afterward straight from the live model via its MCP, see above.)

One point on a bigger graph, and the spread is instructive. The three frontier runs took 13–49 minutes for one ~100-part assembly. The model that iterated most, reviewing its own renders, re-extracting real hole patterns, scored highest but ran longest; the one-shot run was 4× faster and scored lowest. More careful self-review buys accuracy at the cost of time and tokens. That local-anchor point has since landed — a 30-run, graded study of an 80B open-weight model (studies log) — along with a four-setting reasoning-effort sweep and the token ledger.

Why it matters, in plain English

Checking a design for mistakes is one of the most common, and costly, jobs in engineering. The longer a mistake hides, the more it costs. A widely-used rule of thumb, the 1-10-100 rule: a mistake caught while designing costs about $1 to fix; caught while building, about $10; caught after it ships, $100 or more.

Today that checking is mostly done by hand, an engineer rotating a model on screen, hunting for parts that overlap or don't line up. It's slow, easy to miss things, and it doesn't scale. The hours and the escaped mistakes are pure lost value.

Why now, and not a year ago? Two things just arrived at the same time: AI that can actually drive CAD and assemble parts, and an automated checker (CADCLAW) that can prove an assembly is right, instantly, the same way every time. Put them together and you can both build and verify by machine. MARB is the scoreboard that keeps it honest.

For engineers

A pytest-style, tool-independent check that an assembly is buildable, on every change, not just at sign-off.

For programs

Automated, auditable readiness evidence that plugs into the TRL/MRL/IRL gate reviews you already run.

What makes it different

MARB is not "a CAD checker." Specifics separate it from CAD-vendor tooling and from prior benchmarks:

Open standard, proprietary engine. The ladder, rubric, schemas, and chart are open to measure and audit against; the gate algorithms and resolver are the engine. Adopt the standard without the IP, and trust the score because the method is open. MARB sits on top of every CAD tool; it doesn't compete with where designs are drawn. CADCLAW reads and proposes fixes, it never edits your model. The human stays the engineer.

Papers & data, all of it open

Every result on this site traces to a public artifact. The benchmark repo holds the scoring spec, the graders, the blind kits, the grades, and the full run registry with per-run provenance (model, tool, timing, tokens, attempts).

ArtifactWhat it isWhere
The Angry MillimeterRecap article: findings to date + the token ledgermarb.cadclaw.io/recap
Studies logEvery experiment: Fable 5 effort sweep, graded local open-weight anchor, first resultsmarb.cadclaw.io/studies
Scoring spec (v0.9)The canonical, versioned method behind every gradeMARB/spec
Frontier track write-upsClaude tracks comparison + prompt-framework findingscomparison · findings
Local-anchor study30 runs, six prompt cohorts, graded — the full write-upMARB/results
Grades & registryRaw graded metrics + per-run provenance incl. recovered token billsgrades · registry
Blind kits & gradersRun your own model against the board (kits, harness, metrics)MARB repo
CADCLAW engineThe open verification engine that grades every runCADCLAW repo

Read & review

We are publishing the ladder and the readiness correlation as a candidate standard and inviting the field to measure against it.

Reviewers and collaborators welcome, labs, CAD vendors, and standards bodies especially.

Prediction (CEMFORGE), printing (M3-CRETE), and proof (ACME Lab) stay connected across the Sunnyday construction technology portfolio, with Open3DCP data and LogiMix logistics.