Why Claude Mythos Changes AppSec Research

If you’re like our team, the morning after the Claude Mythos announcement brought more questions than answers. Among them: “Serious question. Do customers still need SAST?”

It’s a fair question if you stop at the headline. Claude Mythos, Anthropic’s frontier AI model currently gated to vetted partners through Project Glasswing, had autonomously identified thousands of zero-day vulnerabilities across major operating systems and browsers [1][2]. No rule books, no checklists.

Based on the article you read, the security community was split. Some called it a revolution; others called it a research demo. Both missed the only question that matters for your build pipeline: can it run on every pull request?

That question, and three more like it, is what separates a compelling AI research capability from a tool that belongs in your development process. Mythos fails all four.

What Claude Mythos Is Actually Doing

Claude Mythos is doing something genuinely different from traditional security scanning. Confusing the two leads to bad planning decisions.

Traditional static analysis tools like Veracode Static Analysis work by reading through your source code and checking it against a library of known vulnerability patterns. It’s looking for things like SQL Injection, Cross-Site Scripting, exposed passwords in code, and unsafe file path handling. Consistent. Fast. Built to run automatically on every code change.

What these tools can’t do is catch vulnerabilities that don’t match any known pattern: bugs buried in complex business logic, authentication flaws that only appear when two parts of the system interact in a specific way, or security mistakes that are technically “not wrong” by any single rule but dangerous in the full context of the app. Those have always needed a human security engineer to spot.

Mythos goes after that gap. Instead of checking code against a list of known patterns, it tries to understand what the code is supposed to do, then looks for places where the actual behavior doesn’t match the intent [3]. That’s why it reportedly found vulnerabilities 27 years old in codebases that had been continuously scanned throughout their entire lifespan [2].

What it isn’t is a replacement for standard security scanning. Anthropic isn’t positioning it that way either.

The restricted rollout through Project Glasswing, accessible only to vetted partners including AWS, Apple, Google, Microsoft, Cisco, CrowdStrike, NVIDIA, and JPMorganChase, is a deliberate safety decision [4]. The same AI capability that finds novel vulnerabilities can also be used to write working exploits for them, and Anthropic’s 244-page safety report makes that concern explicit [3].

OpenAI said it plainly in the Codex Security FAQ: “Does it replace SAST? No. Codex Security complements SAST. It adds semantic, LLM-based reasoning and automated validation, while existing SAST tools still provide broad deterministic coverage.” [5]

Semgrep, used in millions of developer workflows, confirmed: “Anthropic and OpenAI explicitly recommend running both LLMs and SAST.” [6]

When the people who built these tools say they don’t replace standard scanning, write it down.

Four Tests. Claude Mythos Fails All of Them.

Impressive demos are one thing. These are the tests that determine whether a tool actually belongs in how your team ships software.

Test 1: Can It Run Before Code Gets Merged?

The modern approach to security is catching problems during development, not after the fact. That means security checks need to be completed before a developer merges up their code branch, while they still remember what they built and why. Most build systems are configured to time out after 10 to 15 minutes. Tools that take longer get disabled or ignored.

AI security tools doing the kind of deep analysis Mythos is designed for typically require 8 or more hours to review a single codebase [7]. That’s not a bug to be fixed. It’s the direct cost of what makes it different: reading the entire codebase to understand context, tracing how data moves through the whole application, and verifying each finding before surfacing it. Cut that process short and you lose the quality advantage.

Veracode Static Analysis finishes in minutes on large enterprise applications. For a security leader managing 200 applications across dozens of teams, a tool that takes up most of a workday per codebase can only run as an occasional audit. Occasional audits, run monthly or quarterly because nothing faster is practical, are exactly how serious vulnerabilities survive production undetected.

Test 2: What Does It Cost Per App Per Sprint?

Standard security scanning is essentially free to run repeatedly: once the tool is set up, scanning one more application one more time costs almost nothing. That’s what makes it possible to check every code change, across every project, every day.

AI scanning breaks that math. The Anthropic/Mozilla project, a two-week security review of the Firefox codebase, cost approximately $4,000 in AI compute costs alone, not including people or operational overhead [8]. Independent research on comprehensive AI security testing has reported costs exceeding $1,200 per run for even moderate-sized projects [9].

Scaled to running continuously across hundreds of applications, costs can reach upwards of $15,000 per run when compute, review infrastructure, result triage, and overhead are fully counted.

At that price, checking every code change is economically impossible. You can’t review every pull request. What you can afford is selective, infrequent, expensive reviews: exactly the situation the industry spent a decade trying to move away from.

The applications that don’t make the short list for AI scanning get covered only by whatever other tools are running. In a portfolio of 200 applications, that’s the vast majority of your actual risk.

Anthropic’s $100 million commitment in compute credits to Project Glasswing participants [10] is the clearest signal available: the economics of running this kind of AI security analysis at commercial scale are still unsolved. When the AI company must subsidize the tool cost for its own launch partners, those economics aren’t ready for enterprise security budgets.

Test 3: Can You Build an Audit Trail from It?

When Veracode finds a SQL Injection flaw in a pre-release build, that finding gets a unique ID, a severity rating, an owner, and a ticket. When the developer fixes it, the tool rescans and confirms it’s closed. The full record of discovery, assignment, fix, and verification becomes documentation for compliance under PCI DSS, NIST, and OWASP requirements.

That whole process depends on one thing: consistency. The same code scanned today and scanned tomorrow should produce the same findings at the same severity.

AI scanning tools can’t guarantee this. The same codebase reviewed by the same model on the same day can return different numbers of issues with different severity scores. VentureBeat, citing Cycode CTO Ronen Slavin, reported that AI-based scanning is “inconsistent by nature” and that security teams require “consistent, reproducible, audit-grade results,” a standard that AI scanning “does not constitute as infrastructure” yet [11].

There’s a second problem worth naming: when severity scores change between scans, your tracking breaks. If a critical flaw shows high severity in the next scan, your time-to-find and time-to-fix metrics become meaningless.

For a CISO walking into a compliance review, or a security VP presenting improvement trends to the board, inconsistent findings don’t just create audit risk. They make the program impossible to manage.

Test 4: Does the Auto-Fix Actually Work?

Auto-remediation is one of the most heavily marketed AI security capabilities. The right question isn’t whether an AI can generate a code fix. It’s whether that fix is based on a confirmed real problem, and reliable enough that a developer can apply it without second-guessing it.

AI scanning tools return a mix of confident findings and possible-but-unverified observations that need expert review to sort out. For a security team, 40 findings of mixed confidence spread across a codebase is a days-long triage project. For a developer trying to merge a branch, it’s a reason to stop trusting the tool entirely.

Veracode Fix works differently. It generates specific code fixes tied directly to findings from a Veracode Static Analysis scan, using Veracode’s proprietary database of real-world fix patterns rather than a general-purpose AI writing code from scratch [12]. Because Fix works off confirmed, categorized findings, every suggested patch addresses a real vulnerability with known location, type, and severity. Developers aren’t asked to fix something that might or might not exist.

Fix covers more than 70% of detected flaws across ten languages [13], cuts average fix time by 200% [14], and works directly inside VS Code, the Veracode CLI, or the pipeline. The CLI can auto-apply the best suggestion across an entire project for bulk cleanup [13].

At RSA 2026, Veracode announced the expansion of Fix to open-source dependency vulnerabilities as well, not just code your developers wrote but the third-party libraries they import, all without breaking builds [15].

AI auto-fix suggestions are only as reliable as the underlying finding. When the detection is inconsistent, the fix is too. Veracode Fix doesn’t have that problem: the fix is grounded in a confirmed, categorized finding from a consistent, repeatable scan. The suggestion your developer applies is correct for the actual code, not a generic patch that may or may not address the real issue.

The Harder Question Claude Mythos Is Actually Forcing

Reading Claude Mythos as a threat to your scanning budget misses the real signal.

The actual question it’s forcing: since AI tools are writing application code and finding vulnerabilities at scale, can your security program prove that what you’re shipping can be trusted?

“Did we run a scan?” isn’t sufficient anymore when code is being written by AI assistants, changed at machine speed, and shipped in delivery pipelines faster than any human can review [16]. The question that matters: can you continuously prove trustworthiness, with a documented audit trail, findings tracked to closure, and compliance records that hold up under scrutiny, before the software reaches production?

Answering that requires layers. Veracode Static Analysis running on every code change and every merge request for consistent baseline coverage across the full portfolio. Open-source dependency scanning alongside it to catch vulnerable third-party libraries before they enter the build.

Runtime testing validates behavior in staging before release. AI-assisted deep review gets applied selectively to the highest-risk parts of the codebase, where the cost and time can be absorbed.

The Cloud Security Alliance is right: organizations that benefit from AI security tools treat them “not as drop-in replacements for existing SAST or DAST platforms, but as a qualitatively different class of capability requiring new governance structures, new triage workflows” [18].

Five Questions to Ask About Your Program Right Now

1) Is security scanning running on every code commit and merge request, or only before release?

If developers first see findings after the build is done, your “catch it early” strategy has gaps that AI tools widen, not close. Scanning before code gets merged is the foundation for everything else that is built on.

2) Do you have a consistent, tracked record of findings across your full portfolio?

If you can report how fast you’re finding and fixing vulnerabilities by type, and show improvement trends quarter over quarter, you have a mature baseline. If not, adding inconsistent AI scanning on top of an already unclear picture makes the problem harder, not easier.

3) Are you scanning AI-written code the same way you scan human-written code?

A growing share of committed code was never reviewed by a human developer. Veracode doesn’t care whether code came from a person or an AI assistant; it runs the same checks either way. Applying that consistently to AI-generated code is one of the highest-value security steps available right now.

4) Are your developers getting fix suggestions based on confirmed findings, or AI guesses?

The question isn’t whether an AI can suggest a fix. It’s whether that suggestion is based on a confirmed, categorized finding from your actual codebase, or is a generic AI output that may not address the real issue. Veracode Fix covers 70%+ of detected flaws, cuts fix time by 200%, and recently expanded to open-source dependency fixes [13][14][15].

5) Can you produce the documentation your compliance frameworks require?

PCI DSS requires a documented security review process. NIST requires using a combination of analysis techniques. OWASP requires comprehensive, documented verification of application security. Consistent, repeatable scanning with a full audit trail satisfies those requirements. Occasional AI scanning with variable results doesn’t.

Claude Mythos is a genuine technical achievement. Project Glasswing is a serious effort to apply frontier AI to one of the hardest problems in software engineering.

But in security program management, the gap between an impressive research capability and a tool that belongs in your daily process has real consequences: failed audits, breach investigations, board conversations nobody wants to have. Not research papers.

Your consistent, repeatable scanning foundation isn’t what AI is replacing. It’s what makes adding AI worth doing.

Veracode helps organizations build and prove Software Trust across their entire application portfolio, from first commit through production release. To learn how Veracode Static Analysis integrates into your development pipeline, visit veracode.com.

All statistics and citations are drawn from publicly available research, vendor documentation, and independent market analysis current as of May 2026.

Sources

[1] Project Glasswing: Securing critical software for the AI era \ Anthropic (anthropic.com)

[2] Claude Mythos and Project Glasswing: Cybersecurity NIghtmare (abhigarg.com)

[3] Claude Mythos Found Thousands of Zero-Days in Weeks. Here’s the Architectu… (medium.com)

[4] Vulnerability Management in the Claude Mythos Era | Holm Security (holmsecurity.com)

[5] FAQ – Codex Security | OpenAI Developers (developers.openai.com)

[6] Mythos & Semgrep | Semgrep (semgrep.dev)

[7] Web source

[8] The AI Vulnerability Scanning Market: OpenAI Codex Security and the Anthro… (labs.cloudsecurityalliance.org)

[9] AI Security Testing Costs Overview (netguru.com)

[10] Project Glasswing \ Anthropic (anthropic.com)

[11] Anthropic and OpenAI just exposed SAST’s structural blind spot with free t… (venturebeat.com)

[12] Veracode Fix | Veracode Docs (docs.veracode.com)

[13] AI Code Remediation | Fix Application Vulnerabilities with Veracode (veracode.com)

[14] Application Security Remediation Guidance | Veracode (veracode.com)

[15] Veracode Expands Industry-Leading Fix with AI-Powered SCA Remediation to C… (veracode.com)

[16] Anthropic’s Mythos: Why the Future of Security Is Software Trust (veracode.com)

[17] Checkmarx Application Security Guide to Claude Mythos (checkmarx.com)

[18] The AI Vulnerability Scanning Market: OpenAI Codex Security and the Anthro… (labs.cloudsecurityalliance.org)

By Joe Ariganello

Joe Ariganello, Senior Director, Product Marketing at Veracode is a forward-thinking Product Marketing & GTM Strategy Executive. With over 20 years of experience transforming complex technologies into compelling market narratives, Joe has driven measurable business growth through strategic positioning and innovative go-to-market strategies for companies, including MixMode, FireEye, Anomali, Blue Voyant and more.

Why Claude Mythos Changes AppSec Research, Not Your Scanning Stack