RTGWG Working Group P. Huo Internet Draft G. Chen Intended status: Informational ByteDance Expires: December 12, 2024 C. Lin New H3C Technologies June 14, 2024 Gap Analysis, Problem Statement, and Requirements in AI Networks draft-hcl-rtgwg-ai-network-problem-00 Abstract This document provides the gap analysis of AI networks, describes the fundamental problems, and defines the requirements for technical improvements. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on December 12, 2024. Copyright Notice Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. hcl, et al. Expires December 12, 2024 [Page 1] Internet-Draft Gap Analysis, Problem Statement, and Requirements In AI networks June 2024 Table of Contents 1. Introduction...................................................3 1.1. Requirements Language.....................................4 1.2. Terminology...............................................4 2. Existing Mechanisms............................................4 2.1. Load Balance..............................................4 2.2. congestion control........................................7 2.3. Network reliability.......................................8 3. Gap Analysis...................................................9 3.1. Gap Analysis of Load Balancing............................9 3.2. Gap Analysis of Congestion Control.......................10 3.3. Gap Analysis of Fast Failover............................10 4. Problem Statement.............................................11 5. Requirements for AI network Mechanisms........................11 6. Security Considerations.......................................12 7. IANA Considerations...........................................12 8. References....................................................12 8.1. Normative References.....................................12 8.2. Informative References...................................12 Authors' Addresses...............................................13 hcl, et al. Expires December 12, 2024 [Page 2] Internet-Draft Gap Analysis, Problem Statement, and Requirements In AI networks June 2024 1. Introduction Artificial Intelligence (AI), is a discipline and technology that studies how to enable machines to imitate and perform human intelligent activities. It involves simulating human thinking and decision-making processes, as well as analyzing and interpreting large amounts of data, allowing computer systems to learn, reason, judge, and predict automatically. The development of AI has achieved significant breakthroughs, including machine learning, deep learning, natural language processing, computer vision, and other fields. AI has a wide range of applications, covering areas such as healthcare, financial services, transportation, smart manufacturing, social media, and many more. In the future, AI will continue to advance and be applied, bringing more convenience and intelligent solutions to people's lives and work. AI training network is a critical component in the field of artificial intelligence. It is a computer network system specifically designed for training and optimizing AI models. With large-scale datasets and optimization algorithms, AI training networks continuously drive the learning and evolution of AI models to adapt to changing environments and demands. In the field of AI, training networks play a crucial role, providing strong support and foundations for technologies such as deep learning, machine learning, and neural networks. The development of AI training networks lays a solid foundation for the progress and application of AI technology, while also promoting its widespread use and development in various industries. With the development of AI networks, the model parameters for AI training are becoming increasingly large. To meet the demands of large-scale AI training, AI training networks typically adopt a distributed cluster approach, which brings forth the following new requirements: a. Ultra-high bandwidth demand: In AI training scenarios with large models, there will be a massive amount of communication data, which imposes higher bandwidth requirements on the network. b. Stability demand: Due to the long training time of large models, any failure during the training process can result in prolonged downtime, significantly affecting the efficiency of AI training. Therefore, it is necessary to quickly recover from failures and minimize their impact on AI training efficiency. c. Low latency demand: In large-scale AI training, which employs parallel distributed computing across multiple GPU nodes, each computation sub-process requires the completion of all participating GPUs. Higher network latency indicates a lower proportion of time hcl, et al. Expires December 12, 2024 [Page 3] Internet-Draft Gap Analysis, Problem Statement, and Requirements In AI networks June 2024 spent on GPU computing. Thus, minimizing latency is crucial in AI training networks, often caused by network congestion. Regarding traffic characteristics, compared to traditional network communication traffic, the flow of AI training for large models has the following features: a. Fewer flows: Traditional networks often have a large number of small flows, whereas AI training processes with large models have a smaller number of flows, each with a substantial load. b. Bursty traffic: Traditional networks have a more balanced overall traffic distribution due to the predominance of small flows. However, in AI training for large models, individual flows have significant loads, resulting in bursts of high-intensity traffic. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 1.2. Terminology TBD 2. Existing Mechanisms Based on the previous description, it is evident that the requirements of AI training for large models are primarily reflected in terms of bandwidth, stability, and low latency. The following specifically illustrates the existing disparities in the actual capabilities of networks in these aspects. 2.1. Load Balance The commonly used load balancing method currently is typically the N-tuple hash algorithm, which forwards traffic flow by flow. However, due to the characteristics of traffic in AI training networks, where there are few flows, it becomes difficult to evenly distribute the load. hcl, et al. Expires December 12, 2024 [Page 4] Internet-Draft Gap Analysis, Problem Statement, and Requirements In AI networks June 2024 The diagram below illustrates a typical AI training network, utilizing a Spin-Leaf network architecture. +---------+ +---------+ | R11 | | R12 | +-#--#-#--+ +#---#--#-+ | | | | | | | | | | | | | | +-----------------------------)-+ | | | | | | | | | | +---------------------------+ | | | | | | | | | | +---)----------+ +------------)-+ | | | | | | | +-#------#+ +-#-----#-+ +--#----#-+ | R21 | | R22 | | R23 | +-#------#+ +-#------#+ +-#------#+ | | | | | | +-#+ +-#+ +-#+ +-#+ +-#+ +-#+ |H1| |H2| |H3| |H4| |H5| |H6| +--+ +--+ +--+ +--+ +--+ +--+ Figure 1: AI network diagram In AI training networks, congestion is generally classified into three categories: The first type is congestion from S-Leaf to Spin. For instance, if traffic flow 1 is from node H1 to node H5, and traffic flow 2 is from node H2 to node H6. Based on network bandwidth calculations, this should not cause network congestion. However, if the load balancing algorithm is inappropriate and leads to the selection of the same link from S-Leaf to Spin, it can result in congestion on the link from S-Leaf to Spine. hcl, et al. Expires December 12, 2024 [Page 5] Internet-Draft Gap Analysis, Problem Statement, and Requirements In AI networks June 2024 +---------+ +---------+ | R11 | | R12 | +-#--#-#--+ +#---#--#-+ | | | | | | | | | | | | | | +-----------------------------)-+ | | x | | | | | | | +---------------------------+ | | | | | | | | | | +---)----------+ +------------)-+ | | | | | | | +-#------#+ +-#-----#-+ +--#----#-+ | R21 | | R22 | | R23 | +-#------#+ +-#------#+ +-#------#+ | | | | | | +-#+ +-#+ +-#+ +-#+ +-#+ +-#+ |H1| |H2| |H3| |H4| |H5| |H6| +--+ +--+ +--+ +--+ +--+ +--+ Figure : S-Leaf Spin The second type is the link congestion from Spin to D-Leaf. For example, if traffic flow 1 is from H1 to H5, traffic flow 2 is from H2 to H6, traffic flow 3 is from H3 to H5, and traffic flow 4 is from H4 to H6. Based on network bandwidth calculations, this should not cause network congestion. However, due to inappropriate load balancing algorithms, if the traffic is directed to the same Spine, it can result in congestion on the link from Spin to D-Leaf. hcl, et al. Expires December 12, 2024 [Page 6] Internet-Draft Gap Analysis, Problem Statement, and Requirements In AI networks June 2024 +---------+ +---------+ | R11 | | R12 | +-#--#-#--+ +#---#--#-+ | | | | | | | | | | | | | | +-----------------------------)-+ | | | | | | | | | | +---------------------------+ | | | | | | | | | | +---)----------+ +------------)-+ | | | | | | | +-#------#+ +-#-----#-+ +--#----#-+ | R21 | | R22 | | R23 | +-#------#+ +-#------#+ +-#------#+ | | | | x | +-#+ +-#+ +-#+ +-#+ +-#+ +-#+ |H1| |H2| |H3| |H4| |H5| |H6| +--+ +--+ +--+ +--+ +--+ +--+ Figure : S-Leaf Spin The third type is the congestion of network edge exit links. For example, if traffic flow 1 is from H1 to H5, traffic flow 2 is from H2 to H5, traffic flow 3 is from H3 to H6, and traffic flow 4 is from H4 to H6. Although there is no congestion within the network, due to uneven traffic planning, flow 1 and flow 2 occupy a large bandwidth, while flow 3 and flow 4 occupy a small bandwidth, resulting in congestion on the exit link from R23 to H5. The above three scenarios illustrate that the flow-based load balancing strategy can easily lead to uneven load distribution, resulting in network congestion. While packet-based load balancing techniques can alleviate the uneven load distribution to some extent, they can cause packets of the same flow to arrive out of order due to different paths, necessitating network handling of packet reordering. The inherent drawback of existing load balancing technologies is that they cannot perceive the actual utilization and congestion status of the network, thus leading to frequent congestion. Consequently, AI training networks require a more fine-grained load balancing capability to address these issues. 2.2. congestion control The current mainstream network congestion control methods include ECN (Explicit Congestion Notification) and PFC (Priority-based Flow Control) technologies. These two techniques, while similar in principle, essentially represent a form of unidirectional congestion control. The underlying principle is to notify the sending end to reduce transmission speed when the receiving queue at the receiving hcl, et al. Expires December 12, 2024 [Page 7] Internet-Draft Gap Analysis, Problem Statement, and Requirements In AI networks June 2024 end reaches a threshold, thereby preventing congestion. In this context, ECN detects the state of the packet queue cache in the outgoing direction, while PFC detects the state of the packet queue cache in the incoming direction. One issue with this method is the setting of queue thresholds. If the threshold is set too low, it can impact packet throughput and fail to effectively utilize the available bandwidth. On the other hand, setting the threshold too high may not effectively prevent network congestion. Another issue relates to the extent of reduction in transmission speed when the sending end receives congestion notifications. Any significant reduction in speed can result in suboptimal network utilization, while a minimal reduction may not sufficiently address congestion. Furthermore, when signaling the upstream sender to reduce transmission speed and adjust the network, this adjustment affects all traffic, rather than providing specific control for individual flows. Additionally, congestion can only gradually propagate upstream, leading to low adjustment efficiency. Therefore, AI training networks require global congestion control mechanisms that can effectively manage congestion. 2.3. Network reliability The methods for responding to local link faults and performing switchover. Equal-Cost Multipath (ECMP): ECMP allows for fast fault switching by distributing traffic across multiple equal-cost paths. In the event of a failure on one path, traffic can be quickly redirected to an alternate path. Fast Reroute (FRR): FRR is a mechanism that enables rapid switching to precomputed backup paths upon failure detection. It reduces the convergence time by bypassing the traditional control plane route convergence process. The methods for responding to remote link faults and performing switchover. BGP PIC (Prefix Independent Convergence): BGP PIC is a technique for fast iterative switching during network failures. hcl, et al. Expires December 12, 2024 [Page 8] Internet-Draft Gap Analysis, Problem Statement, and Requirements In AI networks June 2024 3. Gap Analysis The training of large-scale AI models forms the foundation of artificial intelligence development. In comparison to small models, large models place stronger demands on large-scale distributed parallel training. On one hand, this is due to the sheer size of the models, which, limited by today's GPU memory, necessitates partitioning a single model across numerous GPUs for storage. On the other hand, training a larger number of parameters requires increased computational power, mandating the introduction of a larger scale of GPUs for acceleration. Consequently, there is a need for a significant increase in the quantity of GPUs. Currently, training scale is generally denoted based on the number of GPU cards employed for a task. For instance, we refer to small- scale training for tasks involving fewer than a hundred cards, medium-scale for tasks involving a hundred to a thousand cards, and large-scale for tasks involving over a thousand cards. Models utilizing over ten thousand GPU cards are considered to be at an extremely large scale. The large scale of AI networks gives rise to numerous challenges. 3.1. Gap Analysis of Load Balancing As mentioned earlier, the current load balancing technologies primarily focus on per-flow hash-based forwarding and per-packet forwarding. Technically, almost all network transmissions face an inherent issue: the need to avoid packet reordering within the network, as reordering triggers retransmission logic leading to reduced speeds at the receiving end. Consequently, when switches forward packets within the network, packets from the same connection are directed along a specific path, and the selection of this path relies on hash algorithms. It is well-known that hash algorithms inevitably encounter collisions. If the distribution of hash algorithm is uneven, resulting in the majority of the traffic choosing the same link, congestion will occur on that link, while others remain underutilized. This issue is quite common in large-scale training scenarios. Furthermore, due to the characteristics of traffic during AI training, bursts of high-bandwidth traffic often occur between the same connections. Consequently, selecting hash-based paths for these bursts of traffic can lead to severe hash conflicts, resulting in network congestion. hcl, et al. Expires December 12, 2024 [Page 9] Internet-Draft Gap Analysis, Problem Statement, and Requirements In AI networks June 2024 AI training networks require a new form of load balancing that can mitigate the impact of uneven loads caused by bursts of traffic. This new approach should aim to achieve load balancing as much as possible. For instance, breaking the assumption that packets from a single connection must be directed along a single path, allowing for out-of-order packet reception, thus fully utilizing the network's multipath forwarding capabilities. 3.2. Gap Analysis of Congestion Control When a network experiences congestion, it is essential to promptly adjust traffic, directing it to paths with a larger available bandwidth while reducing the traffic on congested paths. This allows for the full utilization of the available bandwidth on idle paths. Current congestion control methods mainly involve local congestion detection and adjustment at the congestion point, and they do not achieve global congestion control. While these methods have some effect, in certain specific situations, their efficiency is low. This is due to the need to wait for the congestion state to propagate upstream until it reaches a certain point in the upstream path, several hops away, before congestion control takes place to alleviate the current congestion. This results in inefficient congestion control. Furthermore, this type of congestion control mechanism impacts the forwarding of all traffic and cannot achieve targeted congestion control. For AI training networks, a new congestion control routing protocol is needed. When congestion is locally detected, it should be swiftly communicated, allowing for global congestion control. This approach is more efficient than local congestion control and requires a global end-to-end congestion control mechanism. 3.3. Gap Analysis of Fast Failover Maintaining uninterrupted tasks for long periods is crucial for AI training of large models. However, hardware is prone to failures, and as the scale of training networks increases, the likelihood of network failures rises due to an increasing number of switches, network interface cards, and GPUs. Therefore, AI training networks require the capability for rapid fault recovery. For instance, if a link in the network experiences a fault, packets transmitted through this link will be lost. It is essential to ensure that the duration of packet loss is shorter than the timeout period typically set by communication libraries to prevent task interruption. For AI training networks, fast fault hcl, et al. Expires December 12, 2024 [Page 10] Internet-Draft Gap Analysis, Problem Statement, and Requirements In AI networks June 2024 recovery is generally required to be within a millisecond range to ensure uninterrupted training. The current network fault switchover time includes fault detection, notification, and switchover time. For local fault scenarios with a backup link, fast switchover can achieve fault recovery at a millisecond level. However, for remote fault scenarios, the only option currently available is software convergence through routing protocols, resulting in fault recovery times in the seconds range, which does not meet the demand for rapid switchover in AI training networks. For AI training networks, there is a need for a mechanism to handle remote fault points based on rapid fault detection, notification, and response to achieve fast fault recovery at a global level. 4. Problem Statement The main issues in the current AI training scenarios include: Load Imbalance: There is a lack of more appropriate load balancing mechanisms to handle hash imbalances and bursty traffic in AI training networks. Improved load balancing is needed to fully utilize the numerous links in AI networks, including optimal and non-optimal paths. Reliability: There is a lack of fast handling mechanisms for remote network faults. A new global fast fault handling method is required, including fault detection, fault propagation, and routing protocol fast processing. Fast Failover: The current fast failover mechanisms primarily respond to local faults and cannot achieve global fast response. Additionally, the performance of the failover mechanisms does not meet the requirements of AI training networks. These issues need to be addressed to enhance the efficiency and reliability of AI training networks. 5. Requirements for AI network Mechanisms For the existing AI training networks, new requirements include: * New Load Balancing Mechanisms: Capable of performing load balancing based on data packets to avoid the imbalance caused by the relatively small number of flows and bursty traffic in AI training networks. hcl, et al. Expires December 12, 2024 [Page 11] Internet-Draft Gap Analysis, Problem Statement, and Requirements In AI networks June 2024 * New Congestion Control Mechanisms: To avoid the inflexibility of current congestion control mechanisms and achieve global, end-to-end congestion control. * Fast Failover Mechanisms: The need for new fast failover mechanisms that can quickly detect faults, rapidly notify remote endpoints, and enable rapid global fault handling mechanisms. 6. Security Considerations TBD. 7. IANA Considerations This document does not request any IANA allocations. 8. References 8.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . 8.2. Informative References TBD hcl, et al. Expires December 12, 2024 [Page 12] Internet-Draft Gap Analysis, Problem Statement, and Requirements In AI networks June 2024 Authors' Addresses PengFei Huo ByteDance China Email: huopengfei@bytedance.com Gang Chen ByteDance China Email: chengang.gary@bytedance.com Changwang Lin New H3C Technologies China Email: linchangwang.04414@h3c.com hcl, et al. Expires December 12, 2024 [Page 13]