TL;DR
- Vibehacker placed 2nd on BountyBench, a benchmark that grades AI agents on real bug-bounty scenarios.
- Score: $1,830. We ran fully blackbox. No hints, no narrowed scope, no guided walkthroughs.
- Most entries on the leaderboard use the benchmark's hint tiers. Second place without them is a different result than second place with them.
Most security benchmarks for AI agents let you peek at some version of the answer key. Generic bug category. Specific technique. Sometimes a full walkthrough. The scores people post are usually scores with training wheels.
I wanted to know what Vibehacker could do without any. So we ran BountyBench on hard mode and hit 2nd place with $1,830.
One thing before anyone gets the wrong idea: $1,830 is a benchmark score, not money we collected. Those bounties were paid to the original researchers who disclosed each bug. The benchmark mirrors the same dollar values so an agent's performance can be compared against what a human got paid for equivalent work.
How BountyBench grades you
BountyBench puts AI agents into real bug-bounty scenarios: real vulnerable applications, graded on whether the agent finds the planted bug and writes a working exploit. Each target has a bounty value attached. Your total is the sum of what you actually pull off.
It also offers optional hint tiers per target. Generic category of bug. Specific technique needed. Step-by-step walkthrough. Hints are how most entries on the leaderboard get the scores they post.
We did not use any of them.
Vibehacker ran the same way it runs against a real target. No hints, no narrowed scope, no handholding. Just the endpoint and a test account.
That's the mode that matches what a real attacker has. It's also the mode that tells you whether an agent can find bugs on its own, or whether it's pattern-matching on hint text.
| Run mode | What the agent gets | Where most entries sit |
|---|---|---|
| Maximum hints | Category + technique + walkthrough | Most of the leaderboard |
| Partial hints | Category only | Some |
| Blackbox | Endpoint + test account, nothing else | Vibehacker, $1,830, 2nd place |
Why the blackbox number is the one that matters
If you are evaluating an automated security tool for your production app, "scored 85% with maximum hints" is close to useless information. Your app does not ship with hints. The benchmark number only transfers to reality if the agent got its result under the same constraints a real attacker has.
Second place with the hint system stripped feels better to me than first place with it turned on.
So we ran blackbox. I would rather post the honest number.
How we got there — coming next week
2nd place on a benchmark designed to grade AI agents on real bug hunting is not something you get with a good prompt and a fast model. The swarm had to be taught. Or more precisely, it had to teach itself.
Next week I'll walk through how. The self-healing, self-improving loop behind Vibehacker. How the agents flag their own failures, rewrite their own playbooks, and quietly get better at attack classes they've never seen, all without a human writing the fix each time. Going under the hood, with the lab results that made me trust the thing enough to push it onto a public benchmark.
If you want to see what Vibehacker finds on your own site in the meantime, book a demo. First scan is free.