In-Depth
KubeCon 2025: Exploring the KubeCon Ecosystem, Part 1
KubeCon + CloudNativeCon North America 2025, which was held in Atlanta during the second week of November, was a showcase for innovative technologies.
The Kubernetes (K8s) and Cloud Native ecosystem showcased at this conference is extensive. See my previous coverage of Day 0, Day 1 and Day 2 for more details on the big show.
During the event, I had the opportunity to speak with representatives from many companies. As always, I was impressed not only by the scope but also by the depth of the technology on display. It's worth noting that this technology originates from both established companies and startups.
[Click on image for larger view.]
Due to the large number of companies present at the event, it was challenging to select the ones I wanted to highlight for this article. The large number of projects associated with K8s is best represented by the Projects and Products Landscape chart created by the Cloud Native Computing Foundation (CNCF). Due to the large number of products and projects represented on the chart, it is not possible to reproduce it in a viewable manner. However, you can view the chart in its entirety online.
[Click on image for larger view.]
The CNCF members that develop and support this technology range from some of the world's largest IT companies to companies with just a few employees. Again, the number of members makes it impossible to see them on the chart below, but you can see the full chart online.
[Click on image for larger view.]
After careful consideration, I have decided to give you the broadest range of companies and products possible. These range from mature, well-established companies like Nutanix, which delivers a complete, turnkey Kubernetes solution, to Tailscale, a startup laser-focused on providing network connectivity to containers running in Kubernetes.
Before discussing these companies, I would like to share with you a very interesting panel that I had the opportunity to attend.
Cloud Native AI Production Roundtable
The key insights I took away from the Cloud Native AI Production Roundtable focused on the intersection of Artificial Intelligence and cloud-native technologies. The discussion revealed a consensus among the panelists that Kubernetes has become the de facto platform for AI workloads. The rapid adoption within the community drove this trend, eventually reaching enterprise customers. This adoption was fueled by the ability to leverage existing investments and skills in cloud-native infrastructure, avoiding the need to build a separate, parallel stack for AI.
[Click on image for larger view.]
The panel, expertly moderated by Natasha Woods of the CNCF, consisted of the following industry experts:
| Panelist |
Title & Company |
Key Area of Expertise |
| Lachlan Evanson |
Principal Product Manager, Azure (Microsoft) |
Kubernetes Steering Committee, cloud-native open source |
| Brandon Royal |
Product Manager, Google Cloud |
Agentic AI systems, AI infrastructure at scale |
| Keith Babo |
Vice President Product, solo.io |
Agentic infrastructure, cloud-native application networking |
| Hong Wand |
Co-founder and CEO, Akuity |
GitOps, cloud-native control planes, and distributed systems |
The panelists unanimously confirmed that Kubernetes is a no-brainer choice for running AI. It provides a stable, extensible, and predictable distributed means for all three pillars of AI workloads: training, inference, and agentic systems.
Tom's Tip - Agentic systems are AI systems designed to take actions autonomously based on goals, context, and real-time information. They behave more like "agents" than tools.
While inference on Kubernetes is mature, the emergence of agentic AI is creating new infrastructure challenges. These systems, which integrate with numerous external tools, are driving a fundamental change in what security and observability look like, leading to a resurgence in technologies like service mesh to manage complex, large-scale interactions.
Initiatives like the new CNCF AI Conformance for Kubernetes (which was announced on the first day of the conference) are critical for establishing a reliable baseline, signaling to the ecosystem that the platform is ready for prime time. The goal is to create standard abstractions for hardware accelerators (GPUs, TPUs) and networking, allowing the community to innovate on higher-level problems without getting bogged down in low-level implementation details. Lachlan Evanson put it succinctly when he said:
"We can put a stamp and signal to the community that Kubernetes as a platform with this set of APIs is ready for prime time, so the platform Builders...can come in, take that as standard, and build brand new communities and tools."
This allows tool creators and platform teams to build with confidence, knowing that the underlying Kubernetes distribution will perform as required.
One of the significant challenges is bridging the gap between AI engineers, who typically work with Python frameworks and are not experts in Kubernetes, and the underlying infrastructure. The solution is not to force AI developers to learn Kubernetes, but to meet them where they are by providing interfaces and tools that abstract away the complexity of the platform.
The industry is shifting from simple cost optimization to day one value optimization due to the scarcity of AI accelerators. Concurrently, technologies like service mesh are evolving from complex, sidecar-based models to simpler, more integrated ambient and sidecarless approaches that are better suited to the new security and traffic management demands of AI.
The demands of AI are pushing Kubernetes on two different scaling dimensions. On the one hand, we have massive single clusters that are used for large-scale distributed training, and a handful of organizations require enormous clusters. Brandon Royal noted that Google recently announced GKE support for 130,000 nodes in a single cluster to meet this demand.
On the other hand, we have massive multi-clusters that are used for edge AI use cases; the challenge is managing a huge number of smaller clusters. Hong cited a restaurant chain planning to run GPUs in individual restaurant locations, resulting in a massive number of endpoints to manage.
The ultimate success of Kubernetes for AI is to make it boring -- that is, a stable, reliable, and invisible foundation. When Kubernetes achieves this, engineers can submit a job and trust the platform to run it correctly, without needing to understand the underlying nodes, schedulers, or kernels.
Vendor Recaps
Below is a recap of some of my conversations with vendors at KubeCon. I attempted to encompass a wide range of technologies and vendor sizes, from startups with one or two employees to large, well-established companies such as Nutanix and SUSE.
Komodor
Founded in 2020, Komodor emerged from stealth with the mission of simplifying the operational complexity of Kubernetes environments. Komodor has grown its customer base dramatically during that time and has extended its product capabilities to meet the demands of day-2 operations, cost-optimization, and drift detection in multi-cluster, multi-cloud Kubernetes landscapes. Large and small companies, including Intel, Priceline, Cisco, and OpenTable are using it.
[Click on image for larger view.]
Essentially, Komodor positions itself as an autonomous AI Site Reliability Engineering (SRE) platform designed to manage the increasing complexity of day-two Kubernetes operations. Its core AI agent continuously analyzes cluster health, identifies root causes, and recommends or executes remediations. This offloads a large portion of work traditionally handled by SRE and DevOps teams. The platform benefits from a deeply trained AI model with high reported accuracy and integrates tightly with the cloud-native ecosystem.
I asked how difficult it is to deploy, and they said it was designed for simplicity, allowing for deployment with a single Helm command. It supports a wide range of use cases, from resolving GPU-related failures in AI training workloads to empowering non-experts to diagnose and solve issues.
They also emphasize that they have a reliability-first approach to cost optimization, distinguishing themselves from tools that prioritize savings at the expense of system stability. Its AI intelligently right-sizes workloads, optimizes pod placement, manages spot instances, and reduces over-provisioning while ensuring performance and availability remain intact.
When I asked about what was on their road map, they said they are going to extend their capabilities to AI inferencing workloads. The company aims to become the central AI-driven operations layer for the entire AI/ML lifecycle running on Kubernetes.
In part two of this post, I will cover a few more of the companies that I wanted to highlight.