AIxCC -- Find and Fix Flaws in Software Automatically -- Virtualization Review

AIxCC -- Find and Fix Flaws in Software Automatically

By Paul Schnackenburg
09/02/2025

I've been in IT my whole (long) working life and I've seen societies' worldwide transition from mostly manual processes to predominantly computer-based ones. This is true for my MSP clients, my personal life, and it's true for government and society at large. We're now incredibly dependent on IT infrastructure for every part of our lives -- banking, business, real estate, health, news and social media, and almost everything else. And all of it is built on software, much of which is Open-Source Software (OSS), in most cases maintained by dedicated volunteers.

This leaves us with huge tech debt, and systems that aren't fixable with the number of human resources that can be dedicated to them.

A few years ago, DARPA took a look at this and launched a competition, the AI Cybersecurity Challenge (AIxCC) which asked teams to create autonomous systems to investigate given software code for vulnerabilities, determine if the flaw was “reachable” / exploitable and if so, create a patch for it, test the patch and then apply it. All of this in a system that's completely hands-off.

It's worth mentioning that the precursor to AIxCC was the Cyber Grand Challenge, DARPAs first crack (2014-2016) at enticing the community to build a system that could find and patch bugs autonomously. However, the Machine Learning algorithms of the time, combined with other competition limitations, made the results less impactful in the real world. The advent of LLMs trained on vast amounts of code changed the playing field and made the results much more useful in the real world.

The AIxCC competition launched in August 2023, and the semifinal was held in August 2024 at DEF CON 32, where 42 teams competed. Seven of those teams progressed and worked for another year, until the final round was held shortly before DEF CON 33, with the winning teams announced in August 2025. In this article I'll look at the composition of these systems, what makes them so interesting from a cybersecurity point of view, and speculate on where this approach might take software vulnerability patching in the future.

The Core Challenge
I'm sure we've all seen this image:

**[Click on image for larger view.]***(source: xkcd).*

And it really shows the challenge in one simple illustration. Recently we've seen signs of the situation deteriorating further, with maintainers having to discard floods of low quality bug submissions to their packages because they've been generated by AI, with no human input. Another package (libxml2) stopped treating security vulnerability reports as "secrets" and simply publishes them along with other bugs, in an effort to force downstream users of the package to not just use it, but also contribute their own time and effort to it. Libxml2 is used by Apple, Google and Microsoft – companies that definitely could afford to devote money and time to OSS packages they incorporate in their products and services.

The Future Is Here
There were different tracks in this competition but to keep it simple I'll focus on the seven teams that made it to the final, and their efforts in creating a Cyber Reasoning System (CRS) that could operate completely autonomously. The semifinal covered five OSS repositories whereas the final provided 28 OSS packages (Wireshark, OpenSSL, ZooKeeper, SystemD, Curl and many others) for a total of 54 million lines of code to be tested, either in Java or C. There were 70 synthetic vulnerabilities injected by the organizers overall into the different packages.

Each CRS had to operate completely autonomously, and were given an infrastructure of 3 nodes, 64 CPU cores and 256 GB of RAM to run in. Each team was also given a certain amount of credit from Anthropic, Google, OpenAI and Microsoft (they donated 1 million dollars of AI time to the competition) to use for their CRS.

The systems had to identify vulnerabilities, using fuzzing and / or LLM code analysis (static or dynamic), and create a Proof of Vulnerability (PoV) for each of them. Then the system needed to write a patch for the bug, test it and then apply it. The finals added the requirement to create a Static Analysis Results Interchange Format (SARIF) report of the bug. SARIF is a standard format for the output from static analysis tools. The fourth ask was to collate the POV, Patch and Sarif assessment into a bundle. The system should also be able to analyze delta scans (base code plus diff changes) and full scans.

In the semifinal the results looked like this:

**[Click on image for larger view.]** Results for Semifinals across the seven teams

As you can see, the performance for C code was much better than Java, and there was clearly more fine-tuning needed to find and patch bugs. It's worth noting that one CRS found a real (not injected) bug (in SQLite) in the semifinal. Each of the seven teams that continued from the semifinal received $2 million to support their work leading up to the final competition.

In the final the results were much better:

**[Click on image for larger view.]** Results for the Final across the seven teams

Good results across both C and Java. The overall statistics for all teams, and comparing the semis with the final were:

**[Click on image for larger view.]** Performance improvement between Semifinal and Final

In the finals the CRSs also found six real world bugs in C, and twelve in Java, all of which were disclosed to the respective packages responsibly. Eleven of the Java patches had patches created automatically for them.

**[Click on image for larger view.]** Real vulnerabilities found Semifinal and Final

The Winning Teams
Team Atlanta with their CRS called Atlantis took home the first prize of $ 4 million dollars, with Trail of Bits with Buttercup in second place for $ 3 million and Theori with RoboDuck in third place for $ 1.5 million. The links lead to the Github repositories for each CRS, a condition of entry was that the final (and semifinal) code had to be released as open source.

Team Atlanta was made up of about 30 people, about half of them PhD students, with the rest from both the academic world (Georgia Tech) and Samsung. Interestingly, they've already begun using Atlantis internally at Samsung to find and patch bugs. Their CRS used o4-mini, GPT-4o and o3.

The Trail of Bits team was 8-10 people who optimized their solution for low cost, so that it can be used for open-source projects who often have very limited budgets. Their favored LLMs were Claude Sonnet 4, GPT-4.1 mini and GPT 4.1.

Theori's team built their solution to use LLMs heavily, paving new ground in an area of code analysis traditionally relying on fuzzing. They used o3, Claude Sonnet 4 and o4-mini.

Overall, most teams built their solutions in Python, with some adding a sprinkling of Rust for speed. Buttercup offers a standalone version that can run on a laptop and Atlantis built their multilanguage fuzzer to be able to not only accept the C and Java source code that was required in the competition, but also Python, Rust and Go.

Here's the scores of the seven teams in the final:

**[Click on image for larger view.]** Scoreboard breakdown of the Finals

System Architecture

Whilst the approach varies widely in how each team approached their design, some fundamentals are common between all of them. They run as code on top of a Kubernetes cluster (Azure Kubernetes Service, AKS, was the platform for the competition). They receive input from OSS-Fuzz, a free fuzzing platform for critical OSS projects, developed by Google. This consists of at least two files, the source code itself lifted from GitHub and a OSS-Fuzz integration wrapper with metadata (optionally there could be a diff file where the CRS should only identify bugs that are in the diff file, and not in the original source code).

The CRSs will then build the software using individual agents, and use either fuzzing or LLMS, or a combination of both, to identify bugs, then try to verify them and if they are valid and reachable (able to be triggered) build out a Proof of Vulnerability (PoV). These PoVs are then passed on to other agents in the system to use LLMs to build patches, yet other agents to test those patches and if they fix the vulnerability successfully, submit them as a patch. If possible, the discovered bug is also documented in a Sarif file, and the PoV, patch and Sarif file are bundled together. And the CRS had to be completely solid and resilient to any issues from an infrastructure point of view, as there was no human interaction allowed during the scoring runs, something that all the teams commented on as being challenging during their interviews.

What AIxCC Means for Software Code Quality
As many people way smarter than me have pointed out, we have a software quality problem and it's affecting the world's ability to protect ourselves from criminals, both in business and our personal lives. Just as the DARPA Grand Challenge in 2004 eventually led to self-driving cars being a reality I think that the AIxCC and the seven open source, freely available CRS systems will enable an additional layer of code quality improvements. The stats from the final competition run were that the average cost of each successful task (PoV, Patch, Sarif, bundle) was $ 152, which isn't prohibitively expensive, and puts automatic patching systems in reach for even small businesses / startups.

I suspect we'll see commercialization of some of these CRS solutions being offered by vendors in the code checking space. What I'd really like to see is Google / Microsoft / AWS offering free runtime on Kubernetes in their clouds for qualifying OSS packages. If your volunteer-led OSS package is valuable to the wider software field (used in critical infrastructure, used in X other important packages, or other criteria), they should provide Y number of hours of K8S runtime per month to run one (or more) of the CRSs against their code.

**[Click on image for larger view.]** Statistics for the Final round

Conclusion
The concept of an autonomous system looking at source code, and running programs, identifying potential bugs, testing them, validating the flaw, developing a patch and testing the patch, all without human intervention is exciting and points to one way that LLMs can really benefit software quality.

Whether their uptake will be widespread in either commercial or the free OSS world remains to be seen, but as Andrew Carney said when revealing the winners in the AIxCC final, this is the floor. With improvements in LLMs and fuzzing, these systems will only get better from here on.