Routing Area Working Group                                         C. Li
Internet-Draft                                                     S. Ji
Intended status: Standards Track                          Chinat Telecom
Expires: 24 April 2025                                            K. Zhu
                                                     Huawei Technologies
                                                         21 October 2024


                 Framework of Distributed AIDC Network
            draft-li-rtgwg-distributed-lossless-framework-00

Abstract

   With the rapid development of large language models, it puts forward
   higher requirements for the networking scale of data centers.
   Distributed model training has been proposed to shorten the training
   time and relieve the resource demand in a single data center.This
   document proposes a framework to address the challenge of efficient
   lossless interconnection and reliable data transmission between
   multiple data centers, which can connect multiple data centers to
   form a larger cluster through network connection.  The document
   further conducts in-depth research on the key technologies and
   application scenarios of this distributed AIDC network.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 24 April 2025.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.


Li, et al.                Expires 24 April 2025                 [Page 1]

Internet-Draft     Distributed AIDC Network Framework       October 2024


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . .   3
     3.1.  Scenario 1:Distributed Model Training . . . . . . . . . .   4
     3.2.  Distributed storage and computation . . . . . . . . . . .   4
   4.  Framework . . . . . . . . . . . . . . . . . . . . . . . . . .   4
     4.1.  Overview  . . . . . . . . . . . . . . . . . . . . . . . .   4
     4.2.  Technical Requirements  . . . . . . . . . . . . . . . . .   6
     4.3.  Key Mechanisms  . . . . . . . . . . . . . . . . . . . . .   8
       4.3.1.  Collective Communication mechanism in heterogeneous
               networks  . . . . . . . . . . . . . . . . . . . . . .   8
       4.3.2.  Global load balancing . . . . . . . . . . . . . . . .   8
       4.3.3.  Precise flow-control technology . . . . . . . . . . .   9
       4.3.4.  Packet loss detection . . . . . . . . . . . . . . . .   9
   5.  Conclusion  . . . . . . . . . . . . . . . . . . . . . . . . .  10
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  10
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  10
   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  10
     8.1.  Normative References  . . . . . . . . . . . . . . . . . .  10
     8.2.  Informative References  . . . . . . . . . . . . . . . . .  10
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  10

1.  Introduction

   In the realm of artificial intelligence (AI), the computational
   demands associated with training such models have grown
   exponentially, posing significant challenges in terms of hardware
   resources, data storage, and training time.  For example, training
   chatGPT-4 requires parameters of over 1.8 trillion and 20000 Nvidia
   A100 chips.  Obviously, these models require immense computational
   power and memory capacity to perform effectively.  This means that
   training very large AI models requires a high-speed interconnected
   network of thousands even millions of GPUs in a cluster.

   From a technical perspective, building a unified ultra large scale
   data center is the most ideal solution.  This solution has efficient
   data processing capabilities and storage efficiency.  However, in the


Li, et al.                Expires 24 April 2025                 [Page 2]

Internet-Draft     Distributed AIDC Network Framework       October 2024


   actual process, this solution will encounter various challenges and
   constraints.  The first challenge is the investment cost.  The cost
   includes hardware equipment procurement, infrastructure construction,
   software and hardware platform development, etc., and the
   construction period often ranges from several months to several
   years.  At the same time, the construction of large-scale data
   centers puts forward high requirements for the construction area of
   the data center, for example, 10000 GPUs require at least 5000 square
   meters of data center.  In addition, there are challenges in energy
   consumption such as heat dissipation and power supply.  The large
   model is truly a ”power consumer”, placing more than 100000 H100 GPUs
   in the same area would cause power grid paralysis.

   To make full use of the resources of some small data centers,
   multiple data centers can be connected through the network to provide
   infrastructure services for AI tasks.To shorten the training time and
   relieve the resource demand on a single data center, distributed
   model training has been proposed in the form of cloud-network
   coordination.

2.  Terminology

   The following terms are used in this document:

   AI: Artificial Intelligence

   DC: Data Center.

   LLM：Large Language Model

   GPU：Graphics Processing Unit

   OTN：Optical Transmission Network

   RDMA：Remote Direct Memory Access

3.  Scenarios

   To solve the problems of insufficient space/power and heat
   dissipation in the data center of building a 10,000 GPUs or even
   100,000 GPUs cluster, distributed networking of smart computing
   centers can connect multiple smart computing centers into a large
   virtual smart computing cluster.  At present, the lossless network
   between distributed AI data centers is mainly suitable for the
   following two types of scenarios.


Li, et al.                Expires 24 April 2025                 [Page 3]

Internet-Draft     Distributed AIDC Network Framework       October 2024


3.1.  Scenario 1:Distributed Model Training

   At present, the scale of most AIDC is between 100-300 PFlops, which
   is difficult to meet the requirement of training large-scale model
   training.  Distributed model training collaborate computing of
   multiple intelligent computing centers in a region , so that larger
   models can be trained without building super-scale intensive
   intensive intelligent computing centers.In the process of computing
   resource usage, the computing power demand of tenants is often
   inconsistent with the actual deployment of computing power, resulting
   in the fragmentation of computing resources.  Some AI data centers
   are often faced with insufficient resource utilization, resulting in
   waste of computing resources.  In this scenario, the distributed AI
   data center networking can provide a lossless network connection
   between servers in remote data centers.  This can make full use of
   the fragmentation resources of data centers in different geographical
   locations to perform appropriate model training tasks and improve
   system resource utilization.

3.2.  Distributed storage and computation

   High performance and high reliable storage is one of the most basic
   services of public cloud.  At present, the storage and computing
   separation architecture is widely used in public clouds, that is,
   computing clusters and storage clusters may be located in different
   DCS within a Region.  The network connecting computing clusters and
   storage clusters becomes the key to achieve high performance and high
   reliability of cloud storage services.  The distributed AIDC network
   can connect computing clusters and storage clusters in a Region to
   meet the requirements of data localization and ensure data security.

4.  Framework

   This document proposes a framework to address the challenge of
   efficient lossless interconnection and reliable data hosting between
   multiple data centers, which is suitable for the above scenarios such
   as large model inference training.

4.1.  Overview

   The distributed AIDC lossless network architecture is composed of
   multiple independent data center networks, and multiple AIDCs are
   interconnected through the wide area interconnection area to jointly
   support the operation of the multiple AIDCs.  The AI cluster network
   architecture is divided into five layers from bottom to top:


Li, et al.                Expires 24 April 2025                 [Page 4]

Internet-Draft     Distributed AIDC Network Framework       October 2024


                      +------------------------------+
Control layer         |          Controller          |
                      +-+-------+-------------+----+-+
------------------------|-------|-------------|----|------------------------
       +----------------+       |             |    + -------------+
       |                        |             |                   |
 +-----+-----+        +---------+-+     +-----+-----+       +-----+-----+
 |    OTN    |        |    OTN    |     |    OTN    |       |    OTN    |
 +----+-+----+        +---+-+-----+     +----+-+----+       +----++-----+
      |  \               /  |                |  \               / |  Interconnection layer
      |   \             /   |                |   \             /  |
      |    +-----------/----|--+             |    +-----------/---|---+
      |   +-----------+     |  |             |   +-----------+    |   |
 +----+---+--+        +-----+--+--+     +----+---+-+        +-----+---+-+
 |  S-Spine  |        |  S-Spine  |     |  S-Spine |        |   S-Spine |
 +----+-+----+        +---+-+-----+     +----+-+---+        +----++-----+
      |  \               /  |                |  \               / |     Cluster egress layer
      |   \             /   |                |   \             /  |
      |    +-----------/----|--+             |    +-----------/---|---+
      |   +-----------+     |  |             |   +-----------+    |   |
 +----+---+-+        +------+--+-+     +-----+---+-+        +-----+---+-+
 |   Spine   |       |   Spine   |     |   Spine   |        |   Spine   |
 +----+-+----+       +---+-+-----+     +----+-+----+        +----++-----+
      |  \               /  |                |  \               / |  Aggregation layer
      |   \             /   |                |   \             /  |
      |    +-----------/----|--+             |    +-----------/---|---+
      |   +-----------+     |  |             |   +-----------+    |   |
+-----+---+-+        +------+--++      +-----+---+-+        +-----+---+-+
|    leaf   |        |    leaf  |      |    leaf   |        |    leaf   |
+--+----+---+        +--+----+--+      +--+----+---+        +--+----+---+
   |    |               |    |            |    |               |    |   Access layer
   H1  H2               H3   H4           H1   H2              H3   H4

            Cluster A                              Cluster N

                Figuer1: Framework of distributed AIDC network

   • The access layer is composed of server leaf switches, which
   supports high-density scale access of AI servers, and the uplink and
   downlink bandwidth convergence ratio is recommended 1:1.  Each
   interface of the AI Server is configured with an independent IP
   address, and it is connected to the Server Leaf switch in an
   independent link mode without link bundling.  The access layer
   supports the optical module fault protection mechanism to avoid the
   training interruption caused by the access side link failure.


Li, et al.                Expires 24 April 2025                 [Page 5]

Internet-Draft     Distributed AIDC Network Framework       October 2024


   • The aggregation layer consists of a Spine switch, which is
   connected to a Server Leaf switch downlink and a DCI gateway uplink.
   The number of Spine switches determines the total size of the AI
   cluster of the node.  Depending on the choice of the training
   business model, the convergence layer can have a certain convergence
   ratio.

   • The cluster egress layer consists of DCI gateways.  As the exit of
   the AI cluster, the DCI gateway is fully interconnected with multiple
   Spine switches in the downlink and interconnected with OTN and other
   nodes in the uplink.  The cluster exit layer can also converge
   according to the choice of business model.  In addition, the exit
   layer of the cluster needs to support technologies such as computing
   network business awareness and precise flow control, realize network
   load balancing and long-distance lossless, and provide basic network
   guarantee for LLM efficient training.

   • The wide area interconnection layer consists of routers, OTNs and
   other devices.  Multiple AIDCs are connected through an OTN with high
   throughput.  OTN uses high-speed and large-capacity technology to
   provide high-quality large-bandwidth connections and realize cross-DC
   interconnection of AI cluster training networks.  The wide area
   interconnection layer has intelligent operation and maintenance
   capabilities to ensure high reliability of the interconnection and
   support flexible link disassembly and construction according to
   services.  Based on these, the distributed AI cluster network
   architecture can provide stable and efficient data transmission
   capabilities in long-distance and large-scale distributed computing
   environments.

   • The control layer consists of controllers.  The forwarding and
   control of the distributed AIDC network are separated.  The
   controller collects the network topology and the network traffic
   reported by the network devices.  At the same time, the controller
   also collects the service information reported by the server such as
   model segmentation and parallel mode.  The controller synthetically
   calculates the global traffic path and sends it to the network
   devices.

4.2.  Technical Requirements

   Distributed AIDC lossless network extends DC lossless network from
   data center network to wide area network.  From the perspective of
   network operation, the following requirements should be met by
   distributed AIDC network.

   Requirement 1: Long Distance lossless Interconnection


Li, et al.                Expires 24 April 2025                 [Page 6]

Internet-Draft     Distributed AIDC Network Framework       October 2024


   RDMA is used as the input-output protocol during large model
   training.  Since RDMA is very sensitive to network congestion and
   packet loss, even a small number of packet losses will cause a sharp
   performance degradation.  Therefore, the underlying network must have
   lossless transmission capabilities to ensure that there is no
   congestion or packet loss during data transmission, so as to avoid
   the performance degradation of upper layer protocols.

   Requirement 2: Large-capacity interconnection bandwidth

   Large-capacity interconnection bandwidth can ensure the rapid
   transmission of large amounts of data between distributed smart
   computing centers, accelerating the training and inference process of
   AI models.  With the increase of data volume, efficient
   synchronization of data and model parameters among distributed smart
   computing centers is required, which requires the network to provide
   sufficient throughput to avoid network congestion and performance
   degradation.

   Requirement 3: Ultra-high reliability

   To ensure long-term stable training between distributed AI data
   centers and prevent training interruptions caused by external factors
   such as network failures, the transmission network needs to have high
   reliability.  For example, the network can quickly recover when a
   link failure occurs to ensure that the AI service is not interrupted,
   so as to avoid the back-off of smart computing training and the
   decrease of computational efficiency caused by link interruption.

   Requirement 4: Elastic and agile

   The distributed AIDC lossless network needs to be able to flexibly
   set up different sizes and types of clusters according to the
   different needs of multi-tenants.  This means that the network needs
   to have the ability of elastic and agile disassembly and construction
   on demand, which can adjust quickly according to the change of
   computing requirements and dynamically allocate large bandwidth
   resources.

   Requirement 5: Intelligent network operation


Li, et al.                Expires 24 April 2025                 [Page 7]

Internet-Draft     Distributed AIDC Network Framework       October 2024


   In the model training scenario, packet loss leads to a large decrease
   in training performance.  Therefore, this has put forward higher
   requirements for network quality and operation and maintenance, and
   it is necessary to monitor network links in real time.  The lossless
   network of the distributed intelligent computing center needs to have
   the ability of intelligent operation and maintenance, which can
   quickly and accurately locate and solve problems, improve the
   accuracy of fault location, and ensure the stable operation of the
   network.

4.3.  Key Mechanisms

4.3.1.  Collective Communication mechanism in heterogeneous networks

   The communication pattern for large model training is collective
   communication.  Collective communication among GPUs refers to the
   information exchange conducted by multiple GPU nodes following
   specific rules, utilizing a series of predefined communication
   patterns such as all-reduce, all-gather, reduce, broadcast, etc.,to
   achieve data synchronization and sharing.  Each iteration of LLM
   training will synchronize parameters through collective
   communication, and there are multiple rounds of data interaction and
   multiple cross-long distance communications within collective
   communication.  Long distance links lead to increased communication
   delay, which affects the training efficiency of LLM.  It is necessary
   to optimize the design of collective communication mechanism in
   heterogeneous networks, aiming at reducing the amount of data
   transmitted on long-distance links and the number of transmission, so
   as to greatly reduce the possibility of long-distance link
   congestion.

4.3.2.  Global load balancing

   Load balancing mainly solves the problem of congestion and packet
   loss in non-faulty and homogeneous networks in smart computing
   business scenarios.  LLM training traffic has the characteristics of
   high synchronization, large flow, and periodic appearance.  At the
   same time, there are flows on every equivalent path in the network.
   The traditional load balancing technology based on ECMP hash can not
   achieve perfect balance of all paths.  Network-level load balancing
   technology can pre-plan the whole network traffic uniformly through
   the traffic characteristics, so that all paths are perfectly balanced
   and conflict-free to avoid congestion and packet loss.


Li, et al.                Expires 24 April 2025                 [Page 8]

Internet-Draft     Distributed AIDC Network Framework       October 2024


4.3.3.  Precise flow-control technology

   To solve the problem of performance degradation caused by packet loss
   in intelligent computing business, the distributed AIDC network
   architecture adopts a precise flow control technology that can
   transfer the "congestion point" occurs on the long-distance link to
   the network device closest to the transmitting node (first-hop
   device), effectively alleviating network congestion by shortening the
   congestion feedback path.  Specifically, the network device
   determines whether congestion occurs on the links by checking the
   network state, such as the queue accumulation condition and buffer
   usage of each port.  If congestion occurs and the device is not the
   first-hop device for the congested traffic flows, the congestion
   message is notified to the first-hop device.  Subsequently, the
   first-hop device runs an algorithm based on the degree of congestion
   to determine the proportion of the congested traffic flows to be
   limited.  Finally, the first-hop device controls the rate of traffic
   flows by sending PFC/CNP or other flow control protocol packets.

4.3.4.  Packet loss detection

   The packet loss monitoring technology supports the following
   capabilities:

   (1) Fast fault localization: real-time monitoring of traffic delay,
   packet loss and other indicators;

   (2) Visualization: the centralized control of the network supports
   the visualization of flow paths;

   (3) Packet loss statistics: In a certain statistical period, the
   difference between all the traffic entering the network and the
   traffic leaving the network is counted, which is the packet loss
   number of the carrying network in the statistical period;

   (4) Delay statistics: In a certain statistical period, the difference
   between the time when the same flow enters the network and the time
   when it leaves the network is calculated between the specified two
   network nodes, which is the delay of the network in the statistical
   period.


Li, et al.                Expires 24 April 2025                 [Page 9]

Internet-Draft     Distributed AIDC Network Framework       October 2024


5.  Conclusion

   This document proposes a distributed AIDC network framework and
   provides an in-depth introduction to key technologies such as
   collective communication mechanism, global load balancing technology
   and precise flow-control technology under this framework, in order to
   further promote the verification of distributed AIDC interconnection
   in the future.

6.  Security Considerations

   There is no additional security risk introduced by this design.

7.  IANA Considerations

   This document introduces no additional considerations for IANA.

8.  References

8.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

8.2.  Informative References

   [I-D.huang-rtgwg-wan-lossless-uc]
              Zhengxin, H., He, T., Huang, H., and T. Zhou, "Use Cases
              and Requirements for Implementing Lossless Techniques in
              Wide Area Networks", Work in Progress, Internet-Draft,
              draft-huang-rtgwg-wan-lossless-uc-01, 8 July 2024,
              <https://datatracker.ietf.org/doc/html/draft-huang-rtgwg-
              wan-lossless-uc-01>.

Authors' Addresses

   Cong Li
   Chinat Telecom
   Beiqijia Town, Changping District
   Beijing, 102209
   China
   Email: licong@chinatelecom.cn


Li, et al.                Expires 24 April 2025                [Page 10]

Internet-Draft     Distributed AIDC Network Framework       October 2024


   Siwei Ji
   Chinat Telecom
   Beiqijia Town, Changping District
   Beijing, 102209
   China
   Email: jisw@chinatelecom.cn


   Keyi Zhu
   Huawei Technologies
   Huawei Campus, No.156 Beiqing Road
   Beijing, 100095
   China
   Email: zhukeyi@huawei.com


Li, et al.                Expires 24 April 2025                [Page 11]