86.7% on 104 Benchmarks. The Other 13.3% Is the Interesting Part.

We ran Cipher against all 104 challenges in the XBOW Validation Benchmark — an independent suite of web security challenges used to evaluate pentesting tools and practitioners.

The results: 144 of 166 expected vulnerability tags detected (86.7%).

Beyond that, Cipher found 95 additional vulnerabilities — each independently verified with a working exploit — and 322 security hardening gaps the benchmark never asked for.

This is not a claim about single-run superiority. It’s about what cumulative coverage looks like when iteration is economically viable.

602 findings across 104 benchmarks. 214 rated Critical or High.

We published everything — reports, exploit scripts, and raw data — on our benchmark results page. Including the 22 misses.

Here’s why the misses are the interesting part.

The first clue determines everything

Every pentest starts the same way: probe the target, find a clue, follow it. The clue you find first determines the path you take, and the path determines what you discover. Two testers hitting the same application will find different things — not because one is better, but because they followed different threads.

XBEN-022 — a 5-step chain:

The agent discovered hardcoded credentials in a login page. That’s a clue — if the developer hardcoded passwords, they probably took shortcuts elsewhere too. So the agent kept pulling the thread: hardcoded creds → file upload → path traversal → server-side template injection → remote code execution.

Five vulnerabilities, chained. Each step only made sense because of what came before it. A different first clue would have led somewhere else entirely.

XBEN-039 — SQLi to full admin access:

The agent found SQL injection in a login form, used it to bypass authentication entirely, then escalated to admin privileges. One vulnerability unlocked the next. The chain emerged from exploration, not from a checklist.

Every “miss” found something else

We missed 22 of 166 expected vulnerability tags. But when you look at what actually happened, the agent didn’t fail — it explored different terrain and found different things.

XBEN-060 — Expected: default credentials. Found: PHP type juggling bypass.

The benchmark planted hardcoded credentials. Cipher never discovered them because it found a more interesting path — a language-level flaw that bypassed authentication entirely. The agent didn’t need the credentials. It found a deeper way in.

XBEN-006 — Expected: IDOR. Found: SQL injection that extracted the same private data.

The benchmark expected the agent to manipulate object references in the URL. Instead, the agent found SQL injection that extracted the private data directly from the database. Same outcome, different path. The data was compromised either way.

XBEN-104 — Expected: SSTI. Found: credentials, XSS, session forgery, business logic.

The benchmark had one planted vulnerability. Cipher found four others and never tested for SSTI because it was busy exploiting everything else.

XBEN-025 — Expected: default credentials. Found: SSTI and XSS.

Brute-force was attempted, no credentials cracked. But the agent pivoted and found Server-Side Template Injection and Cross-Site Scripting instead.

Most vendors hide their misses. We’re publishing all 22 with explanations, because the misses tell you as much as the hits. They fall into four categories:

Agent found something else (8 of 22) — Explored a different thread and found real vulnerabilities through a different path.
Agent tested but couldn’t bypass the defense (5 of 22) — Attempted the expected vulnerability class with dozens of payloads, but couldn’t generate the specific evasion required to bypass a proxy or trigger the flaw.
Agent didn’t attempt this class (7 of 22) — Never tested this specific vulnerability type. These are real coverage gaps — the paths not taken.
Infrastructure blocked (2 of 22) — Port firewalled or reverse proxy stripped auth headers. The agent couldn’t reach the vulnerability regardless of capability.

The fair question for categories 1 and 2: why not test both paths? Because even with multiple agents working together, the assessment follows productive threads — when one path yields results, the team doubles down. Same thing happens with a human pentest team. More agents doesn’t eliminate path dependency. More assessments does.

Category 3 is the one that matters most. Those are the paths a second assessment would explore.

A 20-year pentester hits the same ceiling

The XBOW benchmark has been attempted by principal pentesters with 20+ years of experience. They score 85% on the same benchmark. Staff-level pentesters score 59%. Cipher edged past the 20-year veteran by 1.7 percentage points.

Nobody expects a human pentester to find 100% of vulnerabilities in a single engagement. A single pentest — whether human or AI — is a sample, not a census.

The answer is iteration — and the economics finally work

The reason most companies pentest once a year isn’t that annual testing is sufficient. It’s that manual pentests cost $30,000+ and take weeks to deliver. Testing every sprint was never economically viable.

At $999 per assessment with results in ~2 hours, it is now.

Run an assessment → fix what was found → run again → the agent explores new paths → find new things → fix → repeat. Each cycle covers more ground. Coverage compounds over time.

“We did a pentest last quarter, we’re secure” is like saying “we hiked one trail in Yellowstone, we’ve seen the whole park.” A second pass, starting from different clues, would find some of the 22 missed tags and miss some of the 144 expected tags it found the first time — but the union of both passes covers more than either alone.

Cipher’s reports also include Passed Tests — what was tested and found secure. This gives you a map of trails already explored, so the next assessment can focus on uncharted ground.

This is what we learned from running 104 benchmarks. Coverage is not something you achieve once. It’s something you accumulate.

A note on data integrity: We verified that Cipher’s underlying model has no prior knowledge of the XBOW benchmark data. Canary string tests and direct prompts about benchmark IDs both returned blank. These results were not memorized.

Explore the full benchmark data →

Run your first assessment →

The first clue determines everything

Every “miss” found something else

A 20-year pentester hits the same ceiling

The answer is iteration — and the economics finally work

Share this post