The Elephant Flows Through the Data Center Network

Techniques for optimizing data center networks to support AI workloads are not intuitive. You first need a baseline understanding of how AI workloads behave in the data center and how that’s different from non-AI or traditional workloads.

In this blog, we’ll explore how AI workloads behave in the data center and which networking features support this use case. We’ll start with some axiomatic one-liners, followed by more in-depth explanations of more complex processes—graphical processing unit (GPU) clustering, synchronicity, tolerance, subscription architectures, and knock-on effects. Lastly, we’ll describe features that data center switching solutions can offer to support organizations that are developing and deploying AI applications.

AI Traffic Patterns in Data Center Networks

The Basics

To form a baseline for understanding AI traffic patterns in data center networks, let’s consider the following postulates:

  • The most computationally intensive (and implicitly, network-heavy) phase of AI applications is the training phase. This is where data center network optimization must focus.
  • AI data centers are dedicated. You don’t run other applications on the same infrastructure.
  • During the training phase, all traffic is east-west.
  • Leaf-spine is still the most suitable architecture.

More Complex Processes

GPU Clustering
In most cases today, AI is trained on clusters of GPUs. This helps break up large data sets across GPU servers, each handling a subset. Once a cluster is finished processing a batch of data, it sends all the output in a single burst to the next cluster. These large bursts of data are dubbed “elephant flows,” which means that network utilization nears 100% when data is transmitted. These fabrics of GPU clusters connect to the network with very high bandwidth network interface controllers (NICs), ranging from 200 Gbps up to 800 Gbps.

Asynchronous workloads are common in non-AI workloads, such as end-users making database queries or requests of a web server, and are fulfilled upon request. AI workloads are synchronous, which means that the clusters of GPUs must receive all the data before they can start their own job. Output from previous steps like gradients, model parameters, and so on become vital inputs to subsequent phases.

Low Tolerance
Given that GPUs require all data before starting their job, there is no acceptable tolerance for missing data or out-of-order packets. Packets are sometimes dropped, which causes added latency and higher utilization, and packets may arrive out of order as a result of using per-packet load balancing.

For non-AI workloads, networks can be configured with a 2:1, 3:1, or 4:1, oversubscription tiers working on the assumption that not all connected devices communicate at maximum bandwidth all the time. For AI workloads, there’s a 1:1 ratio of each leaf’s capacity facing the servers and the spines, as we expect nearly 100% utilization.

Knock-On Effect
Latency, missing packets, or out-of-order packets have a huge knock-on effect on the overall job completion time; stalling one GPU will stall all the subsequent ones. This means that the slowest performing subtask dictates the performance of the whole system.

Networking Features that Support AI Workloads

General-purpose advice for supporting AI workloads includes focusing on end-to-end telemetry, higher port speeds, and the scalability of the system. While these are key components for supporting AI workloads, they are just as important for any type of workload.
To minimize tail latency and ensure network performance, data center switching solutions must support and develop new protocols and optimization mechanisms. Some of these include:

RoCE (RDMA Over Converged Ethernet) and Infiniband

Both technologies use remote direct memory access (RDMA), which provides memory-to-memory transfers without involving the processor, cache, or operating system of either network appliance. RoCE supports the RDMA protocol over Ethernet connections, while Infiniband uses a non-Ethernet based networking stack.

Congestion Management

Ethernet is a lossy protocol, by which packets are dropped when queues overflow. To prevent packets from dropping, data center networks can employ congestion management techniques such as:

  • Explicit congestion notification (ECN): a technique whereby routers indicate congestion by setting a label in packet headers when thresholds are crossed, rather than just dropping packets to proactively throttle sources before queues overflow and packet loss occurs.
  • Priority Flow Control (PFC): provides an enhancement to the Ethernet flow control pause command. The Ethernet Pause mechanism stops all traffic on a link, while PFC controls traffic only in one or several priority queues of an interface, rather than on the entire interface. PFC can pause or restart any queue without interrupting traffic in other queues.

Out-of-Order Packet Handling

Re-sequencing of packet buffers properly orders packets that arrive out of sequence before forwarding them to applications.

Load Balancing

We’ll need to compare different flavors of load balancing:

  • Equal cost multipath (ECMP): Routing uses a hash on flows, sending entire flows down one path, which will load-balance entire flows from the first packet to the last, rather than each individual packet. This can result in collisions and ingestion bottlenecks.
  • Per-packet ECMP: Per-packet mode hashes each individual packet across all available paths. Packets of the same flow may traverse multiple physical paths, which achieves better link utilization but can reorder packets.
  • Dynamic or adaptive load balancing: This technique inputs next-hop path quality as a consideration for pathing flows. It can adjust paths based on factors like link load, congestion, link failures, or other dynamic variables. It can change routing or switching decisions based on the current state and conditions of the network.

I recommend this whitepaper from the Ultra Ethernet Consortium as further reading on the topic.

Next Steps

Designing network architectures and features to cater to AI workloads is an emerging technology. While non-specialized networks are still suitable for AI workloads, optimizing the data center switching process will bring considerable returns on investment because more and larger AI deployments inevitably are on the way.

To learn more, take a look at GigaOm’s data center switching Key Criteria and Radar reports. These reports provide a comprehensive overview of the market, outline the criteria you’ll want to consider in a purchase decision, and evaluate how a number of vendors perform against those decision criteria.

If you’re not yet a GigaOm subscriber, you can access the research using a free trial.