Summary
OpenAI, facing massive telemetry data from its growing Kubernetes clusters, discovered a significant CPU bottleneck in Fluent Bit, their observability platform. Fabian Ponce, an OpenAI Technical Staff Member, revealed at KubeCon+CloudNativeCon North America 2025 that a single function, `fstatat64`, responsible for determining log file sizes, consumed 35% of Fluent Bit's CPU cycles. By disabling this function, which proved unnecessary at their scale, OpenAI recovered approximately 30,000 CPU cores across their Kubernetes clusters, demonstrating how small optimizations in large systems can lead to substantial resource savings.
Why It Matters
A technical IT operations leader should read this article because it highlights the critical importance of deep performance profiling and optimization, even in seemingly minor components, within large-scale distributed systems. OpenAI's experience with Fluent Bit serves as a powerful case study, illustrating that bottlenecks can appear in unexpected places and that even small, targeted tweaks can yield massive resource recovery (30,000 CPU cores in this instance). This insight is invaluable for leaders managing complex infrastructures, as it underscores the potential for significant cost savings, improved efficiency, and enhanced capacity by proactively identifying and addressing such 'insatiable appetites' for resources within their own observability and infrastructure tools.





