Secure, High-Performance Networking for Data-Intensive Kubernetes Workloads
The intersection of Generative AI and cloud computing has been transforming how organizations build and manage their infrastructure. The demands on networking infrastructure are greater than ever, especially as data-intensive workloads are increasingly built using Kubernetes-based compute. This is particularly true in high-performance computing (HPC) environments, where the need to build advanced training models securely using Kubernetes is paramount. The scalability and flexibility offered by its ecosystem makes Kubernetes a preferred choice for managing complex workloads, but it also brings forward unique networking challenges that need to be addressed. This blog series offers practical insights and strategies to build secure, scalable Kubernetes clusters using Azure infrastructure.
Networking Requirements for HPC and AI Workloads
High-performance computing and AI workloads, such as the training of large language models (LLMs), demand networking platforms with high input/output (I/O) capabilities. These platforms must provide low latency and high bandwidth to ensure efficient data handling and processing. As the size and complexity of datasets grow, the networking infrastructure must scale accordingly to maintain performance and reliability. Overall, the requirements can be classified into –
Scalability: As organizations expand their AI initiatives, the networking infrastructure must be capable of scaling up to accommodate increasing data loads and more complex models. Scalable solutions allow seamless growth without compromising performance.
Security: Protecting data integrity and ensuring secure access to workloads are paramount. Networking platforms must incorporate robust security measures to safeguard sensitive information and prevent unauthorized access. Implementing a least-privilege approach minimizing the attack surface by granting only the necessary permissions to users and applications.
Observability: Monitoring network performance and identifying potential issues are critical for maintaining optimal operations. Advanced observability tools help in tracking traffic patterns, diagnosing problems, and ensuring efficient data flow across the network.
Low Latency: AI training models, particularly for LLMs, require high-speed data transfer to process vast amounts of information in real-time. Low latency is crucial to minimize delays in data communication, which can impact the overall training time and model accuracy.
High Bandwidth: The volume of data exchanged between compute nodes during training processes necessitates high bandwidth. This ensures that data can be transferred quickly and efficiently, preventing bottlenecks that could slow down computations.
Key Implementation Strategies
By leveraging AKS, developers can easily deploy and manage containerized AI models, ensuring consistent performance and rapid iteration. The built-in integration with Azure’s high-performance storage, networking and security features ensures that AI workloads can be processed efficiently. Additionally, AKS supports advanced GPU scheduling(reference), enabling the use of specialized hardware for training and inference, thus accelerating the development of sophisticated GenAI applications.
Let’s now examine some of the latest cluster networking features we’ve introduced to deliver a high-performance network datapath architecture, helping users in building a secure and scalable network platform. With Azure CNI powered by Cilium, users have the right foundational infrastructure to address those requirements, along with comprehensive integrations with Azure’s extensive networking capabilities.
Azure CNI Powered by Cilium
Azure Container Networking Interface (CNI), powered by Cilium, is built on a Linux technology called eBPF (Extended Berkeley Packet Filter). eBPF allows the execution of sandboxed programs in the kernel with high efficiency and minimal overhead, making it ideal for advanced networking tasks. Azure CNI leverages eBPF to offer multiple performance benefits, along with advanced in-cluster security and observability capabilities.
Performance Benefits of eBPF
eBPF provides numerous advantages that are essential for high-performance networking:
Efficient Packet Processing: eBPF enables the execution of custom packet processing logic directly in the kernel, reducing the need for context switches between user space and kernel space. This results in faster packet handling and lower latency.
Dynamic Programmability: eBPF allows for dynamic updates to networking policies and rules without requiring kernel recompilation or system restarts. This flexibility is crucial for adapting to changing network conditions and security requirements.
High Throughput: By offloading packet processing to the kernel, eBPF can handle high throughput with minimal impact on system performance. This is particularly beneficial for data-intensive workloads that demand high bandwidth.
Efficient IP Addressing for Scale and Interoperability
Planning IP addressing is a cornerstone of building dynamic data workloads on AKA. Leveraging overlay mode, which is the default in AKS clusters starting with v1.30, and Azure CNI by Cilium, supports both overlay and Vnet addressing for direct-to-pod access. Furthermore, Azure CNI by Cilium supports dual stack IP addressing that allows for both IPv4 and IPv6 protocols to coexist within the same network. This flexibility is essential for supporting legacy applications that may still rely on IPv4 while simultaneously enabling the adoption of newer, more efficient IPv6 based systems. By utilizing dual stack network configurations, organizations can ensure compatibility and smooth interoperability, reducing the overhead associated with maintaining separate network infrastructures. Additionally, mixed IP addressing facilitates a smoother transition to IPv6, enhancing future-proofing and scalability as network demands grow.
In-Cluster Security and Observability
Azure CNI, powered by Cilium, enhances in-cluster security and observability through several key features:
Advanced Network Policies: Azure CNI supports Layer3, Layer4 network policies along with Fully Qualified Domain Name (FDQN) based advanced network policies. This enables users to restrict connections to specific DNS names, enhancing security by limiting access to trusted endpoints.
Comprehensive Network Observability: Azure CNI’s network observability platform, based on Cilium, provides detailed insights into network traffic and performance. Users can identify DNS performance issues, such as throttling of DNS queries, missing DNS responses, and errors, as well as track top DNS queries. This level of visibility is crucial for diagnosing problems and optimizing network performance. Users can trace packet flows across their cluster for detailed analysis and debugging with the Hubble CLI on-demand network flows.
Users can unlock the recently launched observability and FQDN based features by enabling Advanced Container Networking Services(ACNS) on AKS clusters. Let’s take a closer look at how you can enable FQDN filtering through CiliumNetworkPolicy (CNP), and DNS Proxy that allows you to upgrade Cilium Agent with minimal impact to DNS resolution. Let’s say you have a Kubernetes pod labeled app: genai_backend and you want to control its egress traffic. Specifically, you want to allow it to access to “myblobstorage.com” while blocking all other egress traffic, except for DNS queries to the kube-dns service.
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: allow-genai-to-blobstorage
spec:
endpointSelector:
matchLabels:
app: genai_backend
egress:
– toEndpoints:
– matchLabels:
“k8s:io.kubernetes.pod.namespace”:kube-system
“k8s:k8s-app”:kube-dns
– toFQDNs:
– matchName: app1.myblobstorage.com
– toPorts:
– ports:
– port: “53”
protocol: ANY
rules:
dns:
– matchPattern: “*.myblobstorage.com”
Additional considerations for High-Performance Networking
Kubernetes-based data applications will also require high-performance networking from the container networking platform. The underlying networks often require high throughput and low latency translating into high-speed interfaces configured with technologies like Infiniband. These interfaces can deliver bandwidths up to 100 Gbps or more, significantly reducing data transfer times and enhancing application performance.
Often, configuration management of multiple interfaces can be cumbersome, as it involves setting up network fabrics, managing traffic flows, and ensuring compatibility with existing infrastructure. We have heard from many of our users the need for native features that integrate seamlessly with their Kubernetes environments. With Azure CNI, users have the flexibility to securely configure these high-speed interfaces using native Kubernetes constructs like Custom Resource Definitions (CRDs). Additionally, Azure CNI supports SR-IOV (Single Root I/O Virtualization) technologies, which allows for dedicated network interfaces for pods, further enhancing performance by reducing the CPU overhead associated with networking. We will cover this more in a future blog.
Conclusion
The demands on networking infrastructure are intensifying as data-intensive workloads become more prevalent in HPC and AI environments. Kubernetes-based compute offers the scalability and flexibility needed to manage these workloads, but it also presents unique networking challenges. Azure CNI, with its eBPF-based architecture, addresses these challenges by providing high-performance networking dataplane, advanced security, and comprehensive observability. So, why wait, give it a try and let us know (Azure Kubernetes Service Roadmap (Public) · Azure Kubernetes Service Roadmap (Public) (github.com)) how we can evolve our roadmap to help you build the best with Azure. In the next blog, we will focus on how you extend your security controls from Layer4 to Layer7, along with configuration simplifications. So, stay tuned!
Microsoft Tech Community – Latest Blogs –Read More