In-Depth

KubeCon '25 Takeaways: Top Trends, Tools & Technologies

Kubernetes expert Tom Fenton helmed a post-KubeCon + CloudNativeCon 2025 webcast today to walk through what stood out at this year's event, and why several of those takeaways matter to enterprise IT teams running containers in production.

"If Kubernetes is to become the OS for AI, we need a clear standard to guarantee portability and interoperability. And that led to the the major announcement on the first day of the conference, the CNCF AI Conformance Program. This is the single most significant outcome of the entire event."

Tom Fenton, Kubernetes Expert

Kubernetes is a vital component of modern cloud-native computing, serving as an open source container orchestration platform used to deploy, scale, and manage distributed applications across clusters of servers. Originally created to automate the scheduling and lifecycle of containerized workloads, it has evolved into the de facto control plane for cloud-native infrastructure, with a large ecosystem of related projects for networking, storage, security, observability, and application delivery. Over the past decade, Kubernetes has expanded from a tool primarily associated with stateless microservices to a general-purpose platform for running a wide range of production workloads across public cloud, private data centers, and edge environments, emphasizing portability, automation, and consistent operations at scale.

Tom's talk about last month's big conference, which he attended in Atlanta, covered a lot of ground, but two themes kept surfacing: Kubernetes continuing its shift into the default platform for AI workloads, and the operational realities of scaling those workloads efficiently when compute accelerators are scarce.

Note that today's webcast, titled "KubeCon '25 Takeaways: Top Trends, Tools & Technologies," is being made available for on-demand replay courtesy of the sponsor, Nutanix, who also presented a session.

Because we don't have space here to detail Tom's entire presentation chock full of advice and expert observations, below is a focused recap of two of his main themes.

Kubernetes Is Positioning Itself as the Operating System for AI
One of Tom's central takeaways was how thoroughly AI has moved into the Kubernetes mainstream at KubeCon 2025. Over the last several years, Kubernetes proved itself as a durable orchestration layer for cloud-native apps. At this event, Tom said the community was clearly extending that foundation into AI, treating Kubernetes not just as a container scheduler but as the underlying platform that AI stacks will assume.

He summarized this direction succinctly: "Kubernetes is the AI operating system."

Tom connected that idea to what he described as the event's most consequential announcement: a new Cloud Native Computing Foundation (CNCF) AI Conformance Program. The motivation, he explained, is the growing fragmentation in AI infrastructure and tooling. Teams building on Kubernetes today often face different GPU and accelerator configurations, different AI frameworks, and different deployment patterns depending on where workloads run. Without shared expectations, portability becomes difficult and operational overhead grows.

CNCF AI Conformance Program
[Click on image for larger view.] CNCF AI Conformance Program (source: Tom Fenton).

The AI Conformance Program is intended to reduce that friction by establishing a baseline of capabilities and APIs that AI workloads can rely on across Kubernetes environments. In practice, that means less hand-tuning per cloud or cluster type, and more confidence that an AI pipeline developed in one environment will behave consistently in another. Tom stressed that this type of standardization is arriving at a key moment, when many organizations are still working out how to move AI proofs of concept into repeatable production deployments.

He also described a broader shift in how Kubernetes contributors are thinking about the stack. Instead of viewing Kubernetes as "just" an orchestrator sitting below AI tooling, the community is increasingly treating it as the stable substrate on top of which training, inference, and emerging AI application patterns can be built. That framing helps explain the push for conformance and shared developer expectations.

Scaling AI Workloads: Dual Challenges and Scarce Accelerators

Tom's second major theme focused on the operational side of AI in Kubernetes environments. He pointed to what he called dual scaling challenges: on one end of the spectrum, organizations are building very large centralized Kubernetes clusters to support heavy AI training workloads; on the other end, they're increasingly deploying far-flung Kubernetes footprints at the edge to support inference close to users or devices.

Dual Scaling Challenges
[Click on image for larger view.] Dual Scaling Challenges (source: Tom Fenton).

The operational demands of those two models are different. Massive centralized clusters require careful scheduling and resource controls to keep training pipelines fed efficiently. Edge footprints introduce multi-cluster management and lifecycle complexity at scale. Tom emphasized that both patterns are expanding quickly, and that Kubernetes operators are being asked to solve both at once.

Part of the pressure comes from the nature of AI engineers' expectations. Tom noted that AI developers increasingly want infrastructure to be "invisible" so they can focus on experimentation and model work, not on cluster mechanics. In that context, he said, "It's essential to kind of keep in mind that AI engineers" are trying to iterate quickly without getting pulled into infrastructure details. For platform teams, that translates into pressure to simplify consumption while still keeping expensive resources productive and available.

That leads directly to the scarcity problem. Tom described the persistent shortage of GPUs and other accelerators as a defining constraint for AI operations. When accelerators are hard to get, wasting them is not just a cost issue -- it's a velocity issue. Organizations that can't keep accelerators utilized risk slowing down development cycles and delaying time to value.

His framing was that AI operations are pushing IT teams to optimize for throughput and business impact rather than solely for cost reduction. As he put it, "so the new goal is to maximize value and streamline operations" around acceleration-heavy workloads. The implication for Kubernetes teams is clear: scheduling, resource governance, and multi-cluster tooling need to evolve to treat accelerators as strategic assets, not just another pool of compute.

And More
Be sure to register for these upcoming Virtualization & Cloud Review events. Attending live lets you ask questions during the webcast to get expert one-on-one advice for your particular circumstances -- and, for many sessions, qualify for attendee giveaways. Here are a few on the near-term calendar:

For the full and continually updated lineup, see our webcasts and summits list.

And here is a list of Tom's articles from the event:

About the Author

David Ramel is an editor and writer at Converge 360.

Featured

Subscribe on YouTube