Q&A
10 Questions from KubeCon '25 Takeaways Webinar
Key Takeaways
- Kubernetes is increasingly positioned as a foundational platform for AI workloads, requiring new operator and developer skills.
- AI-assisted operations and standardized conformance aim to reduce toil and improve portability and reliability across clusters.
- Security, observability, and simplification platforms remain central as Kubernetes grows more complex with GPUs and multi-cluster demands.
2025, I recently had the opportunity to present at a webinar sponsored by Redmond Magazine and 1105 Media (the parent company of Virtualization and Cloud Review), entitled "KubeCon '25 Takeaways: Top Trends, Tools & Technologies. It was a great event, and I had a lot of fun doing it. During my presentation, I discussed the key points I took away from KubeCon over the four days I attended it.
[Click on image for larger view.]
There were a few questions I didn't have time to address during the event, so below are the questions and, more importantly, my answers.
Tom's Tip - The webinar was recorded and can be watched here.
1. You hinted that Kubernetes is evolving from a container orchestrator into a foundational platform for AI workloads. What implications does that shift have for operators and developers?
My Answer:
Yes, Kubernetes is now being positioned as the operating system for AI. This makes sense as it provides workload portability, autoscaling, GPU scheduling, in-place pod resizing, and dynamic IPAM. These are all features that AI and ML systems require. This shift means operators must become comfortable with GPU-aware scheduling, multi-cloud pipelines, and model-serving architecture. Developers will gain a standardized, platform-agnostic way to deploy inference services, agentic workers, and model training pipelines, dramatically simplifying the development of AI-native applications.
2. You mentioned that Red Hat and Google had new AI-integrated tooling for OpenShift and GKE. Which of these innovations do you think will have the biggest long-term impact on cluster operations?
My Answer:
The most impactful innovation is AI-assisted cluster operations, specifically, I mentioned Red Hat Lightspeed and Google's AI-enabled management portal. These systems use AI to answer operational questions, suggest remediations, and even automate certain troubleshooting tasks. While accelerated GPU provisioning and IPAM improvements are important, AI-driven insight will drastically reduce toil and elevate the role of platform engineers, developers, and others.
3. Why is the Kubernetes AI Conformance Program so significant for AI workloads, and what risks arise without standardization?
My Answer:
AI workloads by their very nature vary widely in resource demands, scheduling behavior, and data access patterns. The AI Conformance Program establishes consistent expectations for running them across distributions. Without standardization, organizations risk unpredictable model performance, inconsistent scaling, lock-in to specific cloud providers, and difficulty debugging workloads that behave differently between clusters. Conformance ensures portability and reliability.
4. You mentioned that Niantic and Scopely use Kubernetes to run global Pokémon GO events. What does this reveal about Kubernetes's ability to support real-time, latency-sensitive applications?
My Answer:
It shows that Kubernetes can handle extreme, globally distributed, low-latency workloads, especially when they are paired with advanced autoscaling and ML-powered prediction pipelines. These workloads require rapid, location-dependent scaling, and Kubernetes has matured enough to meet those demands through features such as regional clusters, autoscaling optimizations, and model-driven traffic forecasting.
5. It sounds like CNCF is investing heavily in supply chain integrity. What are the biggest remaining gaps in cloud-native security, and how should they be addressed?
My Answer:
The most significant gaps that I see are securing the software lifecycle end-to-end, reducing vulnerabilities introduced through dependencies, and ensuring consistent runtime security across multi-cluster deployments. The community must expand itsautomated tooling. The CNCF uses Antithesis for fault injection and increased the frequency of third-party audits. Better education around SBOM usage and stronger signing/verification systems are also needed to keep the ecosystem secure.
6. You said that OpenAI saved 30,000 CPU cores with a one-line logging fix. What does this tell us about the importance of understanding system internals at scale?
My Answer:
I think that it shows that small inefficiencies multiply drastically at hyperscale. Engineers working at a large scale must understand not just the application but the underlying libraries, sidecars, runtime systems, and observability agents. Optimization isn't optional; it results in significant cost savings, reduced environmental impact, and simpler operational overhead. This example reinforces that performance engineering still matters, even in a world where cloud resources feel infinite.
But what I found most interesting is that a human found and fixed this issue, not AI.
7. I found the "Cloud Native for Good" stories were powerful. Where else do you see cloud-native technology having a global humanitarian impact?
My Answer:
Thanks, I enjoyed these stories. It showed how cloud-native infrastructure can help in disaster response, refugee logistics, health outreach, real-time climate monitoring, crisis communications, and educational access. Any sector that relies on distributed systems, real-time insights, or data coordination can benefit. Kubernetes democratizes compute, making advanced IT capabilities accessible to humanitarian organizations that traditionally couldn't afford them.
8. Chronosphere emphasized AI-guided observability using a Temporal Knowledge Graph. How will AI reshape observability and SRE workflows?
My Answer:
AI will shift observability from reactive dashboards to proactive guidance. Instead of operators digging through dashboards, platforms like Chronosphere can correlate events, analyze time-dependent patterns, surface suspected causes, and even generate corrective actions. SRE workflows shift toward validation and governance rather than manual troubleshooting, reducing engineer burnout and significantly shortening incident duration.
9. As I understand it, SUSE is building a private, governed AI platform rather than a SaaS LLM. Which industries benefit most from this architecture?
My Answer:
It will definitely help heavily regulated industries such as finance, healthcare, defense, government, energy, pharmaceuticals, and manufacturing. These organizations simply cannot send sensitive IP or datasets to public LLMs. SUSE's sovereign AI model gives them full ownership of their data, models, GPU usage, and observability stack. It also helps organizations manage MCP sprawl and coordinate tool governance, which are critical for enterprise AI adoption in these industries.
10. Companies like Devtron, Nutanix NKP, and Tailscale aim to simplify Kubernetes operations. Is Kubernetes becoming easier or harder for organizations to adopt?
My Answer:
Both -- and that tension defines the current era. Kubernetes itself continues to grow in complexity, especially with AI-native workloads, GPUs, and multi-cluster architecture. But platforms like NKP, Devtron 2.0, and Tailscale's operator make the experience easier by abstracting away infrastructure details. Kubernetes remains powerful but challenging; the ecosystem around it is what makes adoption realistic for most enterprises. Over time, these simplifications will make Kubernetes "feel" easier, even if the underlying system continues to evolve rapidly and grow more complex.
About the Author
Tom Fenton has a wealth of hands-on IT experience gained over the past 30 years in a variety of technologies, with the past 20 years focusing on virtualization and storage. He previously worked as a Technical Marketing Manager for ControlUp. He also previously worked at VMware in Staff and Senior level positions. He has also worked as a Senior Validation Engineer with The Taneja Group, where he headed the Validation Service Lab and was instrumental in starting up its vSphere Virtual Volumes practice. He's on X @vDoppler.