Leave Your Message
Packet Loss and Congestion Notification: A smarter, faster and synchronised approach to accelerate your AI task completion.
News

Packet Loss and Congestion Notification: A smarter, faster and synchronised approach to accelerate your AI task completion.

2025-09-24

Driven by technological advances in machine learning, natural language processing, generative AI, robotics and autonomous systems, artificial intelligence (AI) and high-performance computing (HPC) are experiencing significant growth.

At the heart of these innovations lie large-scale distributed training models, typically comprising billions or even trillions of parameters distributed across multiple GPUs. During the training process, these nodes synchronise by exchanging vast amounts of data and real-time updates via a backend AI Ethernet switch fabric. However, packet loss severely compromises this synchronisation, leading to retransmissions or communication blockages. This ultimately results in increased latency, extended job completion times (JCT), and inefficient utilisation of costly GPU resources.

Silent packet loss in AI data centre switching matrices

JCT serves as a critical metric, whilst modern AI workloads—particularly large-scale training and inference tasks—rely on tight synchronisation between clusters. Even a single packet loss may significantly impact performance and increase operational costs.

For instance, when switch buffers overflow due to traffic congestion, RoCE v2 packets may experience packet loss within the AI Ethernet/IP switching fabric. These discarded packets must be retransmitted, leading to latency and interruptions in training or inference processes.

Although Explicit Congestion Notification (ECN) signals congestion by flagging bits within the IP header, it cannot identify which specific packets were discarded due to congestion. Consequently, it cannot determine which packets require retransmission.

Packet Loss and Congestion Notification (DCN) Solution

To address this issue, Juniper Networks introduced Drop Congestion Notification (DCN), a new congestion management feature developed for Junos OS(TM) Evolved software version 23.4x100d40 on the QFX5240-OD and QFX5240-QD (64 x 800GbE Ethernet/IP platforms) based on the Tomahawk 5 chip.

Upon congestion occurrence, the switch transmits packet loss notifications by reducing packet payloads and forwarding this information to receiving hosts via high-priority queues. Transit switches within the Network Switching fabric identify these trimmed packets bearing DCN markers and direct them to high-priority queues.

Consequently, the destination host must process these trimmed DCN packets, determine which packets were explicitly discarded due to congestion, and immediately request retransmission of the lost packets from the source.

However, these pruned packets are not sent to the target server's memory. Instead, they are used to precisely identify which packets require selective retransmission. This helps avoid the default retransmission process becoming excessively time-consuming, thereby achieving shorter end-to-end latency and ensuring tasks are completed successfully.

The diagram below illustrates a simplified topology: when packets enter the first switch, if extreme congestion occurs (exceeding the ECN threshold), these packets are not discarded but are pruned before being sent to the target GPU server NIC card. Although the pruning operation is performed by the first switch, intermediate switches may also recognise the pruned frames and immediately send them to the output interface via a high-priority queue. Upon arrival at the target NIC card, the system initiates a retransmission request to the source server.

In the QFX5240-OD and QFX5240-QD switches, a dedicated queue operates independently of the packet queue to handle DCN-related packets. This separation enables users to manage latency and bandwidth allocated to DCN packets more effectively.

微信图片_2025-09-24_085153_692.jpg

Within AI Ethernet switching fabric, maintaining consistent performance and synchronised operation is paramount, particularly when workloads scale across distributed GPU clusters. DCN addresses this critical challenge by providing real-time visibility into packet loss during severe congestion. By alerting endpoints to packet loss, DCN enables faster recovery, minimises hidden latency, and helps maintain AI JCT.

Ultimately, DCN bridges the visibility gap between network switching fabric and AI workloads, establishing itself as a foundational capability for building scalable, high-performance AI infrastructure.