Leave Your Message
Ops4AI Enables High-Performance AI Data Centres to Accelerate Value Realisation While Fully Reducing O&M Costs and Outages

Industry News

Ops4AI Enables High-Performance AI Data Centres to Accelerate Value Realisation While Fully Reducing O&M Costs and Outages

2024-11-06

In January 2024, Juniper Networks launched its AI Native Networking Platform, which leverages the right data with the right infrastructure to deliver the right response required for the best user experience and operations staff experience. By using AI to simplify network operations (AI for Networking) and using AI-optimised Ethernet switching matrices to improve AI workloads and GPU performance (Networking for AI), Juniper is delivering on its unchanging commitment to provide customers with an experience-first network.

With a long history of delivering high-performance, secure data centre infrastructure, including QFX switches, PTX routers and SRX firewalls, Juniper Networks is proud to announce the availability of an extension to our Artificial Intelligence Native Network Architecture, which enables customers to deliver end-to-end AI-enabled data centres with multi-vendor operations. The new Ops4AI solution offers impactful enhancements to further deliver superior value to customers.

Ops4AI includes a unique combination of the following Juniper components:

  • Data centre AIOps built on the Marvis Virtual Network Assistant;
  • Intent-based automation through Juniper Apstra multi-vendor data centre switching matrix management;
  • AI-optimised Ethernet features, including RoCEv2 for IPv4/v6, congestion management, efficient load balancing and telemetry.

By integrating these components, Ops4AI enables high-performance AI data centres to accelerate value while reducing O&M costs and streamlining processes. The solution is now even better with the addition of several enhancements: a new multi-vendor Juniper Ops4AI Lab open to customers for testing open source and private AI models and workloads; Juniper Validated Designs featuring technologies from Juniper, Nvidia, Broadcom, Intel, Weka, and other partners to provide assurance on Networking for AI; and AI optimised designs that provide assurance for Networking for AI; and Junos software and Apstra enhancements for AI-optimised datacentre networks - the main focus of this blog post.

Let's take a look at the new enhancements introduced for Junos® software and Juniper Apstra. These features include:

AI Switching Matrix Auto-Tuning

Remote Dynamic Memory Access (RDMA) for GPUs drives significant network traffic in AI networks. Despite the use of various congestion avoidance techniques, such as load balancing, congestion still occurs (e.g., multiple GPUs transferring data to the same GPU at the last leap point switch can cause congestion).

When this happens, customers can use congestion control techniques such as Data Center Quantitative Congestion Notification (DCQCN), which uses several features such as Explicit Congestion Notification (ECN) and Priority Based Flow Control (PFC) to calculate and configure parameter settings for optimal performance on every queue on every port across all switches. Setting these parameters manually across thousands of queues across all switches is difficult and cumbersome.

To solve this challenge, Apstra periodically collects telemetry data from every queue on every port. This telemetry information is used to calculate the optimal ECN and PFC parameter settings for each queue on each port. Using closed-loop automation, optimal settings can be configured on all switches in the network.

This solution provides optimal congestion control settings that significantly simplify operations and maintenance, reduce latency, and shorten job completion time (JCT). Our customers have invested heavily in their AI infrastructure, which allows us to offer these features for free in Juniper Apstra. Watch the latest Cloud Field Day demo to learn more about how these features work. We've also uploaded this app to GitHub.

Global Load Balancing

AI network traffic has unique characteristics. These flows are primarily driven by RDMA traffic generated by the GPUs, which results in large bandwidth hogs and a reduced number of streams that are individually large (often referred to as ‘elephant streams’). Therefore, static load balancing based on quintuple hashing is ineffective. Multiple ‘elephant flows’ map onto the same link and cause congestion. This leads to longer JCTs, which can be very disruptive to large GPU investments.

The solution to this problem is dynamic load balancing (DLB), which takes into account the state of the uplink on the local switch.

DLB can significantly increase switch matrix bandwidth utilisation compared to traditional static load balancing . However, DLB has limitations, one of which is that it only tracks the quality of the local link, rather than knowing the quality of the complete path from the ingress node to the egress node.

Suppose we have a CLOS topology where both server 1 and server 2 are trying to send data flows named flow-1 and flow-2 respectively. If DLB is used, leaf-1 only knows the usage of the local link and makes decisions based only on the local switch quality table, so the local link may be in perfect state. However, if GLB is used, the quality of the entire path is known, including congestion issues that occur in the trunk-leafing hierarchy.

This is similar to Google Maps, where the selected route is based on an end-to-end view.

This selects the best network path and provides lower latency, higher network utilisation and faster JCTs, which from an AI workload perspective leads to better AI workload performance and makes expensive GPUs more efficient.

End-to-End Visibility from the Network to the SmartNIC

Today, administrators can discover where congestion is occurring just by looking at network switches. But they can't see which endpoints (or GPUs if using an AI data centre) are affected by congestion. This leads to challenges in identifying and resolving performance issues. In a multi-training job environment, it is impossible to determine which training jobs are slowed by congestion simply by looking at switch telemetry data without manually checking the NIC RoCE v2 statistics on all servers. However, manually checking all of these statistics is highly impractical.

To solve this problem, the rich RoCE v2 streaming telemetry data from the AI server SmartNIC can be integrated with Juniper Apstra and correlated with existing network switch telemetry data to dramatically increase the ability to observe and debug workflows in the event of performance issues.

This correlation provides a more comprehensive view of the network to better understand the relationship between AI servers and network behaviour. Real-time data provides insight into network performance, traffic patterns, potential congestion points and affected endpoints, helping to identify performance bottlenecks and anomalies.

This capability enhances network observation, simplifies debugging of performance issues, and helps improve overall network performance by taking closed-loop actions. For example, monitoring for garbled packets in the SmartNIC can help tune the parameters of the Smart Load Balancing feature on the switch. As a result, end-to-end visibility can help users run their AI infrastructure at peak performance.