A Mastro study · June 2026

Is security a skill issue?

Five scanners, 3,084 skills, a different verdict 64% of the time.

They can't agree on what “safe” even means, but they'll still show you a green check.

63.9%

of skills got a different verdict from at least two scanners

14.2%

had one scanner call it CRITICAL while another called it SAFE

Two security scanners read the same skill. Same file, same line: to draw you an architecture diagram, it ships your AWS config to an outside API. One scanner wrote that down word for word and stamped it SAFE. The other read the same line and stamped it CRITICAL. Hold that thought, because there are 437 more where that came from. First, why I went looking.

This morning my AI read my latest blood panel and flagged two numbers worth asking my doctor about. Genuinely useful. The skill that did it was written by someone I've never heard of, and I ran it without reading a line. Reckless, or just human? The answer is yes. But be honest about the last time you ran npx skills add or pulled in a package with four hundred transitive dependencies. Maybe you skimmed the top-level code. Nobody audits the whole tree. At some point we all stop reading and start deciding who to trust.

And that trust is the whole magic of agent skills. The ones that change what you're capable of are the ones you didn't write: a stranger's hard-won domain expertise packed into a file your AI just executes. A trainer's programming, a tax pro's playbook, a doctor's read on your labs. The ceiling on what AI can do for me isn't the model anymore. It's how many strangers' skills I'm willing to point at my files, my credentials, my actual life. And there's the catch: a skill is just a markdown file, but it's a markdown file my agent will go off and obey, usually with scripts it can run and every tool I've handed it. It can read my disk. It can phone home. The distance between “does my taxes” and “mails my keys to a stranger” is a few sentences I'll never read closely. So every install is a small act of faith, and faith doesn't scale.

We already know what happens when it doesn't hold. Earlier this year OpenClaw's skill marketplace got gutted. A researcher proved the download counter could be faked and pushed a dummy skill to #1; in eight hours, sixteen developers in seven countries installed it and ran his code.[1] His payload was a harmless ping. The criminals working the same hole were not so kind, flooding the catalogue with skills that quietly lifted SSH keys, cloud credentials, and crypto wallets.[2] At peak infection, per OWASP, five of the seven most-downloaded skills were malware.[3] Snyk found one in eight skills carried a critical issue;[4] Bitdefender clocked the malicious rate one week in February near 17%.[5]

So the ecosystem did the reasonable thing. It put up a guard. Paste a skill, a panel of name-brand scanners looks it over, and a verdict comes back: a reassuring green check, a risk score, a badge that says you're fine. Install with confidence. That checkmark is the thing standing between me and using AI the way it's actually supposed to work, and it is the part of this story I trusted most.

It is also the part that's lying to you. I pulled the verdicts for 3,084 skills, five scanners each, to see if the badge meant anything. It doesn't. The scanners can't agree on what “safe” even is, and the green check papers over a fight they're losing.

Five tools, three different questions

Before you can really say the scanners disagree, you have to notice they were never looking at the same thing in the first place. “Is this skill safe” isn't one question. It's at least three, and each of these tools is quietly answering a different one.

Snyk

Repo

LLM judges + static rules

Reads the code and the prose, flags injection, secrets, and suspicious downloads.

Socket

Site

Supply-chain static + AI

Scans every file a skill references with its package-security engines, then counts alerts.

Gen Agent Trust Hub

Site

Narrative LLM analysis

Writes a paragraph of reasoning about the skill, then assigns a severity from Safe to Critical.

Runlayer

Site

Runtime gateway

A runtime gateway that watches behavior, with a pre-release scan. The panel’s most trigger-happy.

ZeroLeaks

Repo

Dynamic red-teamer

Doesn’t read the skill at all. Attacks a running model and returns a 0–100 security score.

Read those again and the disagreement kind of starts to feel inevitable. A supply-chain scanner asks does the code do something bad? A prompt-injection judge asks does the prose try to hijack the agent? A runtime red-teamer asks can I break the model that runs this? None of them is wrong, exactly. They're answering different questions and then stamping all of it with the same word: safe.

(These five are what skills.sh actually shows you; Cisco and NVIDIA's SkillSpector exist too, and they fall into the same three camps.) And none of this is a knock on any one tool. Security scanners just don't agree, full stop. NIST has run tool-versus-tool bake-offs for a decade and keeps finding “the overlap is typically limited”;[6] a 2024 study of five mature scanners found only 6% of real vulnerabilities got caught by more than one of them.[7] If seasoned code scanners barely overlap on ordinary bugs, five tools answering three different questions about markdown were always going to scatter. The only real question is how badly.

Two out of three times, the badge is a guess

Take every skill in the sample that at least two scanners looked at, and ask the dumbest possible question: did they all land on the same side? Either everyone cleared it, or everyone flagged it, or (the fun one) some cleared it while others flagged it.

63.9%

35.5%

Scanners disagree63.9%All clear it35.5%All flag it0.5%

Fig. 2Across 3,083 skills that ≥2 scanners reviewed. On 63.9%, at least one scanner cleared the skill while another flagged it. Unanimous “this is dangerous” is almost nonexistent, at 0.5%.

Sixty-four percent. On roughly two of every three skills, you could walk away with a clean bill of health or a warning depending entirely on which badge you happened to be looking at. And the slice where every scanner agrees something is actually wrong, which is the case you'd most want them to nail, is 0.5%. Sixteen skills out of three thousand. Flip a coin seven times: you have better odds of landing heads on every single toss than of finding a skill these five scanners all agree is dangerous. Agreement on danger is the rarest outcome on the board. Disagreement isn't the edge case here. It's the default.

My first thought was, okay, maybe this is just noise: five tools roughly tracking the truth and differing at the margins. But no. The differences are structural, and you can see the structure the second you ask how often each scanner reaches for the alarm at all.

Fig. 1Share of skills each scanner flags, across 3,084+ skills. The two strictest scanners flag 5–20×more often than the most lenient, so a skill's verdict depends largely on which scanner you ask.

And there it is. Runlayer flags three out of five skills it sees. ZeroLeaks flags one in thirty-three. That's not two tools disagreeing about specific skills, that's two tools that have set the dial in completely different places. One treats nearly everything as suspicious; the other treats nearly nothing as a problem. Same skills, twenty-times different odds of getting flagged. A “verdict” that swings that hard depending on who you ask isn't really a verdict. It's a coin, just weighted differently in each tool's pocket.

A coin flip with better branding

Now, “they agree 84% of the time” sounds reassuring right up until you remember that most skills are fine and most scanners default to pass. Two tools that both rubber-stamp almost everything are going to look like they agree. They're just both saying yes. So the honest move is to subtract the agreement you'd get from two tools guessing independently. That's what Cohen's κ does. Zero means “no better than chance,” one means perfect, and anything under 0.20 reads as “slight.” [8]

	Gen	Snyk	Socket	Runlayer	ZeroLeaks
Gen	—
Snyk	0.18	—
Socket	0.13	0.17	—
Runlayer	0.08	0.08	0.01	—
ZeroLeaks	0.01	0.03	0.05	0.01	—

coin flip (0.0)perfect (1.0)

Fig. 3Cohen's κ for every scanner pair: agreement beyond chance. κ = 0 is a coin flip; κ = 1 is perfect. Every pair lands between 0.01 and 0.18, “slight” at best on the standard scale.^[κ] There is no hidden consensus here.

And every pair is in the basement. The best two scanners on the panel manage κ = 0.18, which is still just “slight.” A few pairs sit at 0.01, meaning knowing what one said tells you basically nothing about what the other will say, beyond what a coin would tell you. The two that look chummiest on raw percentages (Socket and ZeroLeaks, ~90%) collapse to κ = 0.05 the moment you remember they both almost never flag anything.

Honestly this one chart is the whole argument. If the scanners were just noisy instruments measuring the same underlying truth, the κ values would be high and the disagreements would cluster on the genuinely ambiguous skills. They don't. The scanners are measuring different things, and there's no shared signal under the noise to average your way back to. Which is the trap, really. Faced with five badges, the natural instinct is to average them into one tidy score. But averaging five independent answers to three different questions doesn't get you a better answer. It gets you a confident-looking number that means nothing.

CRITICAL to one tool, SAFE to the next

Disagreeing about pass-versus-flag is one thing. But on 14.2% of skills, call it one in seven, one scanner rated the skill CRITICAL or HIGH while another rated the exact same skill SAFE. Not some nuance at the margin. A head-on collision at the very top of the severity scale. Here are three, pulled straight from the data, untouched:

eraserlabs/eraser-io/aws-diagrams

All five looked at the same fact, that this skill ships your AWS config to an outside API, and graded it from SAFE to CRITICAL. Not a malfunction. A disagreement about how much that should scare you.

PassGenSAFEParses AWS configuration files and sends the data to an external API (app.eraser.io) to generate diagrams. Infrastructure details are transmitted to a third party.
PassSocketNo alerts
FailSnykCRITICALMalicious code pattern (E006): explicit instructions to always POST generated DSL, which may include sensitive AWS metadata and internal file paths, to an external API.
WarnRunlayerMEDIUM1/1 file flagged
PassZeroLeaksNONEScore: 93/100 · 2 sections analyzed

railwayapp/railway-skills/metrics

A clean 2-versus-2 split on a skill that reads your local auth token. Gen and Runlayer call it HIGH. Snyk and Socket wave it through.

FailGenHIGHReads the user's local auth token (~/.railway/config.json) and calls the Railway API. Reads sensitive files, makes network requests to non-whitelisted domains, and presents an indirect prompt-injection surface.
PassSocketNo alerts
PassSnykLOWRisk: LOW · No issues
FailRunlayerHIGH2/6 files flagged

yaklang/hack-skills/recon-and-methodology

A security-research playbook. Gen reads it correctly as benign tooling; Snyk rates it CRITICAL. Same markdown, opposite poles.

PassGenSAFEA reconnaissance playbook for bug-bounty hunters: command-line examples for industry-standard recon tools. No malicious code, exfiltration, or obfuscation detected.
PassSocketNo alerts
FailSnykCRITICALRisk: CRITICAL · 2 issues
PassZeroLeaksNONEScore: 93/100 · 2 sections analyzed

Look at the first one; it's the collision this essay opened with. Eraser's AWS-diagram skill does one slightly nervy thing: to draw your diagram, it ships your AWS config to an outside API. Every scanner sawthat. Gen wrote it down word for word, “infrastructure details are transmitted to a third party,” and marked it SAFE. Snyk saw the same line, called it a malicious-code pattern, and marked it CRITICAL. Same fact, same file, opposite ends of the scale. That's not one tool malfunctioning. It's two tools disagreeing about how much “sends your cloud topology to a stranger” should scare you, and the green check picks a winner without telling you there was a fight.

The Railway one might be the cleanest picture of the whole problem. Real skill, real company, and it reads your local auth token off disk. Two scanners call that HIGH. Two wave it right through as fine. There's a real answer to “does this read my credentials,” and the panel is split clean down the middle on it. Trust the green badges and you install something that reads your token. Trust the red ones and you skip a tool you might've actually wanted. They can't both be guidance.

And the badge can be faked in an afternoon

Okay, you might think the fix is obvious: just trust the strict ones. If Runlayer and Snyk cry wolf, fine, at least nothing slips through. Except things do slip through. A few weeks ago two researchers at Trail of Bits sat down with exactly this panel (the three scanners skills.sh shows, plus Cisco's and ClawHub's) and walked malware past all of them. Took them under an hour.[9] They hid the payload in a .docx. They poisoned compiled Python bytecode while leaving the source clean. They padded a file with a hundred thousand blank lines to push the bad part out past the scanner's reading window. They talked the LLM judges out of it with corporate-sounding nonsense.

Their conclusion is the part that stuck with me: “their static nature gives an adversary unlimited bites at the apple to tweak an attack until it finds a way through.” A scanner runs once and hands down a verdict. The attacker gets to iterate against that verdict forever. And that asymmetry doesn't go away with a stricter rule or a smarter judge; prompt injection lives in meaning, not in signatures, and you can always just rephrase meaning.[10]

So stack it all up. The scanners disagree on two of every three skills. They agree no better than chance. One in seven skills gets opposite extremes from two of them. And each one, on its own, can be walked past in an afternoon. A panel that disagrees is one problem. A panel that disagrees and whose every member can be fooled is a worse one, because at that point the green badge isn't telling you a skill is safe. It's telling you a scanner didn't catch anything, which is not the same thing, and never was.

So what is the green check even worth?

The tempting move is to average them. Roll five verdicts into a tidy 78/100 and get on with your day. But you've seen the κ matrix by now. Averaging independent answers to different questions just manufactures false precision. A 78 that's really one CRITICAL, two SAFEs, and two shrugs isn't a 78 of anything.

The other tempting move is to pick a favorite scanner and ignore the rest. That just trades five blind spots for one, and Trail of Bits already showed your favorite has a blind spot too.

What's left is harder and a lot less satisfying, which in my experience is usually the sign it's the right answer. You don't resolve the disagreement by hiding it. You read it. Five tools split on a skill is information; it's telling you exactly where to go look. The skill everyone waved through needs a glance. The skill where Gen wrote a whole paragraph about credential access, and Snyk said CRITICAL, and Socket shrugged: that's the one you actually open, and the split is what told you to.

And when you do have to collapse it into one call, the rule that survives this data isn't the mean, it's the maximum. Err on the side of caution: take the worst credible signal, don't let a chorus of “looks fine” outvote one specific, legible alarm, and then show the disagreement instead of burying it under a tidy average. A skill where four scanners shrug and one flags credential access isn't 80% safe. It's a skill that touches your credentials, and you should hear about that. That's a different operation than averaging, and it's the only one the numbers here actually justify.

Zoom out, though, because this is bigger than five scanners flunking a test. Code is already the cheap part. When an agent can write the function, the scarce thing, the only thing left worth paying for, is judgment: whose taste, whose playbook, whose read on the lab panel do you trust enough to run on your own life. Skills are how that judgment gets shipped. And the green checkmark was supposed to be the proof you could trust it.

That's the real con. The checkmark isn't selling you safety. It's selling you certainty, which is the one thing software can't actually hand you, dressed up in a color that means “go.” And a coin flip with good branding is worse than honest doubt, because honest doubt makes you read the skill and a green check makes you close the tab.

We've been here before, by the way. The last time we let anyone ship code straight to everyone's most personal device, we didn't solve it with a better virus scanner. We solved it with a marketplace someone was willing to stand behind. The App Store and Play Store aren't safe because Apple and Google scan harder than you could. They're safe because there's a name attached, an account that gets banned, a curator with something to lose. The scan is the smallest part. The trust is the product.

That store doesn't exist for skills yet, which honestly surprises me. Anthropic, OpenAI, whoever owns the runtime, will almost certainly build the SkillStore eventually, because the company that owns the agent has every reason to own the one place you're told it's safe to install from. Until then it's the wild registry and a checkmark that means nothing. The only thing that actually fills the gap in the meantime is the oldest one there is: people who've used the thing, vouching for it to people who haven't. Reviews, from names, with skin in the game. That's the layer we're building at Mastro, and it's the part the scanner was never going to give you.

Because when anyone can generate the code, trust is the last moat. And trust you can't verify isn't trust at all, it's just a feeling someone sold you. The answer was never a sixth scanner, or a number that splits the difference, or a greener checkmark. It's a name willing to stand behind the thing. Earn the trust instead of printing it. That's the harder problem, and it's the only one worth working on.

Notes & sources

[1] Backdooring the #1 downloaded ClawHub skill (proof-of-concept) — Jamieson O'Reilly, Dvuln. A researcher exploited an unauthenticated download counter to push a benign skill to #1; in eight hours, 16 developers across 7 countries installed and ran it.
[2] ClawHavoc: 341 malicious skills found by the bot they were targeting — Koi Security. Audited all 2,857 skills then on ClawHub; found 341 malicious, 335 from one coordinated operation delivering Atomic macOS Stealer.
[3] AST01: Malicious Skills — Agentic Skills Top 10 — OWASP Foundation. "Five of the top seven most-downloaded ClawHub skills at peak infection were confirmed malware." Documents the ClawHavoc campaign (Jan 2026): 1,184 malicious skills across 12 accounts sharing one C2 IP. (Incubator project.)
[4] ToxicSkills: malicious AI agent skills in the ClawHub supply chain — Snyk. Scanned 3,984 skills: 13.4% (534) carried at least one critical-level issue; 36.8% had ≥1 flaw. Publishing requires only a SKILL.md and a one-week-old GitHub account.
[5] Helpful skills or hidden payloads: inside the OpenClaw malicious-skill trap — Bitdefender Labs. "Around 17% of OpenClaw skills analyzed in the first week of February 2026 exhibit malicious behavior."
[6] Counting Bugs is Harder Than You Think (SCAM 2011) — Paul E. Black, NIST. NIST's SATE expositions show static-analysis tools "find different instances of weaknesses in the same program, and the overlap is typically limited."
[7] An Empirical Study of Static Analysis Tools for Secure Code Review (ISSTA 2024) — Charoenwet et al., ACM ISSTA. Across five SAST tools, only 6% of vulnerability-contributing commits were caught by more than one tool; combining all five nearly doubled coverage over the best single tool.
[8] The Measurement of Observer Agreement for Categorical Data (1977) — Landis & Koch, Biometrics 33(1):159–174. The standard interpretation scale for Cohen's κ: <0 poor, 0–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect.
[9] The sorry state of skill distribution — Samuel Judson & Tjaden Hess, Trail of Bits. Slipped malware past ClawHub, Cisco, and all three skills.sh scanners in under an hour: "their static nature gives an adversary unlimited bites at the apple."
[10] LLM01 Prompt Injection & LLM03 Supply Chain — Top 10 for LLM Apps (2025) — OWASP Gen AI Security Project. Prompt injection is OWASP's #1 ranked LLM risk; supply-chain compromise is in the top three.

The sample's at 3,084skills and still growing; the figures are a snapshot and drift a little as it grows, but the conclusions have held since the first few hundred. If you want to poke holes in any of this, start with the box above. That's where I figure it's most likely to be wrong.

Giulio Colleluorigiulio.co