Evals & Benchmarking — n8n Workflow Generation

The Problem

Every AI tool claims it makes the model better. Almost none prove it. For a tool that generates n8n workflows — JSON that either imports and runs or fails — "better" is measurable, so there's no excuse not to measure it. I wanted to answer three questions with evidence, not vibes: does my n8n Knowledge plugin actually produce more valid workflows than the leading alternative? At what cost? And when I change the plugin, do the numbers go up or down?

So before optimizing anything, I built the measurement instrument.

What I Measured

The harness runs 128 real-world workflow prompts spanning three difficulty groups (simple single-node tasks → complex multi-integration agent wiring), on two different AI models, against both my plugin and the most popular community tool in the space — the excellent n8n-mcp MCP server (21k+ GitHub stars). Same prompts, same models, same scoring — the only variable is the tool.

A real engine as judge. Every generated workflow from both tools is validated by the actual n8n validation engine — not a heuristic, not an LLM's opinion of correctness. Valid means it would import.
A second, semantic judge. Schema-validity isn't intent-fidelity. A blinded Claude Opus LLM-judge scores whether each result actually did what the prompt asked — a dimension the validator structurally cannot see. Verdicts are cached and the judge runs with no knowledge of which tool produced which workflow. Known-bug avoidance is scored separately, by 28 deterministic rules (one per bug) that read the workflow JSON, so it stays auditable rather than resting on the judge.
Pitfall coverage. Beyond "does it validate," I measure whether the tool warned the user about known n8n bugs before they hit them in production.
Cost & overhead. Every run records dollar cost, tool-call turns, and input tokens — because a tool that's marginally more accurate but 2× the cost is not obviously better.

Iterate-to-Improve: the part that actually matters

Having a benchmark is table stakes. Using it to make decisions — and letting it overrule you — is the real discipline. An earlier version of the plugin lost the accuracy race on harder prompts. Closing that gap was measured engineering: validator feedback loops, node-spec injection on validation errors, automatic fixes for common structural mistakes, and a per-session validation budget that teaches the model to batch fixes instead of thrashing. Every one of those shipped only because the evals said it helped.

The harness also killed ideas. Several promising-sounding changes were reverted because the numbers went the wrong way — a model-invocation timeout that biased slow runs, a "static" validation-budget mode that made the repair loop worse on every run. Without the harness those would have shipped on intuition. With it, they got cut.

Honest Scope & Limitations

The strongest signal of taking evals seriously is knowing what your eval does not prove. Stated plainly:

The validator measures n8n-mcp compliance, not a live n8n import. The judging engine is n8n-mcp's bundled validator. It's the same engine n8n ships node definitions from, and it's a strong proxy — but "valid" here means "passes the validator," not "I uploaded it to a running n8n instance and watched it execute." That's a deliberate, disclosed trade-off for reproducibility and speed.
The pitfall prompts share provenance with the pitfall catalog. Some prompts are derived from the same known-bug corpus the plugin recalls from, so the pitfall-avoidance metric flatters the tool that draws from that corpus. I report it as a directional signal, not a clean win.
The judge is an LLM. The Opus intent-judge is blinded and cached, but it's still a model scoring models. I treat its scores as a second opinion alongside the deterministic validator, never as ground truth on its own.
Sample and recency. The numbers below take the newest run per prompt rather than multi-sample variance — coverage, not a variance study. The Sonnet gate-ON column is a full 128-prompt run on the current shipped plugin code; DeepSeek is the latest available result per prompt; n8n-mcp is version-independent (it doesn't run my plugin), so it's a fair comparison cell. Models and both tools move; a benchmark is a snapshot, and this one is dated on purpose.

The Numbers

These come straight from the harness's own reporting script — no hand-picked figures, pulled from the results database. The basis is the newest run per prompt, integrity-cleaned (timeout-truncated runs excluded), every condition driven by the identical shared system prompt, scored at 100% judge coverage. Snapshot date: June 24, 2026.

I read the results as a funnel — each stage is a stricter bar than the last:

valid% — passes the real n8n schema validator (n8n-mcp). It would import.
correct% — valid and actually accomplishes what the prompt asked (blinded Opus judge).
works% — correct and designs around the relevant known n8n bug, so it won't silently fail in production. This is the headline metric.
Pitfall avoidance% — known-bug avoidance: of the 28 prompts that touch a known n8n bug, the share the workflow designed around rather than walking into (scored by 28 deterministic rules, not the judge).

The plugin ships with a pitfall gate — one knob that trades cost against how aggressively it surfaces known n8n bugs. Both settings are shown against the n8n-mcp baseline. Leading with the ship-default backend, Claude Sonnet 4.6 — a full 128-prompt run on the current shipped plugin code:

Claude Sonnet 4.6 128-prompt battery · newest run / prompt · Jun 24 2026	valid%	correct%	works% headline	Pitfall avoidance% of 28 bug prompts	$ / run	tool turns	time (mean / median)
Plugin — gate ON (ship default)	94%	93%	80%	39%	$0.75	9.8	375s / 269s
Plugin — gate OFF (cheaper pitfall trade)	98%	93%	79%	32%	$0.90	9.7	526s / 242s
n8n-mcp (baseline)	72%	70%	59%	32%	$1.26	19.4	601s / 367s
Raw model (no tools · status quo · 78/128)	26%	26%	26%	29%	$0.58	3.7	359s / 156s

On the headline works% metric, the plugin's gate-ON default lands at 80% vs the baseline's 59% (+21pp) — and it gets there while being ~40% cheaper ($0.75 vs $1.26) and running ~49% fewer tool turns (9.8 vs 19.4). It wins every column: valid 94% vs 72%, correct 93% vs 70%, pitfall avoidance 39% vs 32%. Gate-OFF trades a little pitfall coverage (32%) for the highest validity of the three (98%) at near-identical cost.

The Raw model row is the status-quo control — same model, same prompts, no plugin and no n8n-mcp, the place most people actually start. It produces a valid, working workflow only 26% of the time on Sonnet (9% on DeepSeek Flash) versus 80% works with the plugin — so the tooling, not the model, is doing the heavy lifting. The Sonnet baseline is measured over 78/128 prompts with all three difficulty groups sampled; weighting each group's rate up to the full corpus lands in the same place, so it's representative, not cherry-picked.

The edge is not Claude-specific. The same comparison on the DeepSeek v4 Flash backend (latest available per prompt):

DeepSeek v4 Flash 128-prompt battery · newest run / prompt · Jun 24 2026	valid%	correct%	works% headline	Pitfall avoidance% of 28 bug prompts	$ / run	tool turns	time (mean / median)
Plugin — gate ON (ship default)	92%	75%	67%	46%	$0.024	27.5	430s / 321s
Plugin — gate OFF (cheaper pitfall trade)	88%	76%	64%	34%	$0.023	23.5	351s / 263s
n8n-mcp (baseline)	79%	70%	62%	36%	$0.033	38.1	347s / 265s
Raw model (no tools · status quo · 128/128)	9%	—	—	36%	$0.013	10.2	228s / 177s

Same shape, thinner margin: works% 67% vs 62% (+5pp), valid 92% vs 79% (+13pp), pitfall avoidance 46% vs 36% (+10pp), at ~27% lower cost and ~28% fewer tool turns. The plugin's lead narrows on the weaker model — its validation/repair loop has to do more work — but it still wins every stage of the funnel on both backends. That consistency, not any single number, is the point: a knob whose cost/coverage curve you can actually see is worth more than a headline figure, and getting that curve required the harness.

And on the strongest tier, DeepSeek v4 Pro:

DeepSeek v4 Pro 128-prompt battery · newest run / prompt · Jun 24 2026	valid%	correct%	works% headline	Pitfall avoidance% of 28 bug prompts	$ / run	tool turns	time (mean / median)
Plugin — gate ON (ship default)	98%	79%	68%	36%	$0.044	16.9	357s / 251s
n8n-mcp (baseline)	80%	72%	60%	32%	$0.059	30.6	283s / 218s

Tech Stack

Harness: bash orchestrator (run-eval-v2.sh) running both tools across 128 prompts × difficulty groups × models, with per-run isolated config dirs so no condition leaks state into another
Deterministic judge: the n8n-mcp validation engine (an independent open-source project), run as a hosted validator microservice with a public health contract (engine version + content hashes) so plugin-time and scoring-time validation can never silently diverge
Semantic judge: a blinded Claude Opus LLM-judge (judge_results.py) scoring intent-fit only, fail-closed, verdicts cached as .judge.json
Deterministic pitfall scorer: 28 rules (one per known bug) that read the workflow JSON directly — known-bug avoidance is scored here, never by the LLM judge, so anyone can audit it
Models under test: Claude Sonnet 4.6, DeepSeek v4 Flash, and DeepSeek v4 Pro; tools under test: this plugin vs n8n-mcp
Reporting: a funnel of per-prompt validity → correctness → works-in-production, plus pitfall coverage, cost, tool-turns, and input tokens, taken from the newest run per prompt

My Role

Designed and built the entire harness — the prompt set and its difficulty tiers, the two-judge architecture (deterministic validator + blinded Opus judge), the cost/turn/token instrumentation, and the isolated-run plumbing that keeps the comparison fair. More importantly, I used it as the decision gate for the plugin: every significant change was benchmarked before it shipped, and several promising ideas were reverted because the evals said no. The validation engine itself is the MIT-licensed n8n-mcp by Romuald Członkowski — integrating it rather than rebuilding it is what let the quality bar move fast and kept the benchmark honest (the competitor's own engine judges both tools).