Internet-Draft Collective Communication Optimizations: July 2024
Yao, et al. Expires 9 January 2025 [Page]
Workgroup:
Transport Area Working Group
Internet-Draft:
draft-yao-tsvwg-cco-requirement-and-analysis-02
Published:
Intended Status:
Informational
Expires:
Authors:
K. Yao
China Mobile
S. Xu
China Mobile
C. Liu
China Mobile
Y. Li
Huawei Technologies
H. Huang
Huawei Technologies
W. Wang
New H3C Technologies Co., Ltd
D. KUTSCHER
THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY (GUANGZHOU)

Collective Communication Optimizations: Requirement and Analysis

Abstract

Gernerative AI applications depend on large scale parallel computing clusters for model training and inference. Existing implementations of collective communication in parallel computing is built on top of RDMA, the most adoptable AI transport protocol. However, One-to-Many, Many-to-One, and Many-to-Many collective operations all depend on point-to-point transport semantics of RDMA, which inevitably introduces more bandwidth occupancy and transmission overhead. Emerging approaches for collective communication optimization focus on network-assisted collective acceleration and can work compatibly with RDMA. This document analyzes different technical schemes for network-assisted collective acceleration based on RDMA, and presents the gap between these work and current IETF standards, notably iWARP. Requirements for designing new standards are proposed accordingly.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 9 January 2025.

Table of Contents

1. Introduction

With the development of distributed applications, especially High Performance Computing (HPC) and Artificial Intelligence (AI), the scale of parallel computing clusters is constantly expanding, and the pressure brought by collective communication to the network is also increasing. Existing implementations of collective communication are based on RDMA(Remote Direct Memory Access), however, the most obvious problem is that the point-to-point transmission semantic of RDMA is not well aligned with logical communication patterns defined in collective communication, which incurs more bandwidth occupancy,more memory copies at endpoints and more data movements, thus lowering the overall parallel computing efficiency. Detailed use cases and problems are proposed in [I-D.yao-tsvwg-cco-problem-statement-and-usecases].

Emerging collective communication optimization technical schemes focus on network assisted collective acceleration, which can greatly alleviate network pressure, improve transmission efficiency, and shorten flow completion time (FCT). Some of these approaches can also work compatibly with RDMA, which raises new standardization design space for extended RDMA-based protocols for collective communication optimization. In following sections, this document analyzes different technical schemes for network-assisted collective acceleration based on RDMA, and presents the gap between these work and current IETF standards. Requirements for designing new standards are proposed accordingly.

2. Definition of Terms

*Collective communication: A set of communication patterns that application processes follow to communicate with each other in a parallel computing cluster. These patterns include One-to-Many, Many-to-one, or Many-to-Many delivery mode.

*Network Assisted Collective Acceleration(NACA): Using network devices, like switches, to offload and perform collective operations, so as to improve the overall collective communication efficiency.

3. Existing Work and Analysis

NACA offloads collective operations to switch to implement. For example, Allreduce is done in switch for aggregation. Broadcast is offloaded to switch for data copy, and Scatter leverages switch for data tailoring. Detailed collective operations are listed in [I-D.yao-tsvwg-cco-problem-statement-and-usecases]. NACA can be built on RDMA so as to optimize collective communication. RDMA allows endpoints to directly read and write memory data from other endpoints at high speed without requiring kernel processing and CPU resources. Memory zero copy and kernel bypass have gradually made RDMA the mainstream communication technology for HPC and AI applications in data centers. This draft mainly focuses on the analysis of two different communication modes for RDMA-based network-assisted collective acceleration.

3.1. Classification and Analysis of Two Modes for NACA

When using network devices to offload collective operations, RDMA communication modes can be divided into server-to-sever mode and server-to-switch mode.

3.1.1. NACA Based on Server-to-server RDMA Connection

The server-to-server RDMA connection mode means not to change the logic of the existing applications. and switches that participate in collective communication cannot be seen by applications. The destination of RDMA connection is set to be another server endpoint, but switches can participate in the collective operations during the data transmission. In this communication mode, native transport protocols can cause false positives. For example, when the switch help perform data aggregation during Allreduce, in each round, there will be only one aggregated packet sent from the switch to the destination server. Packets from multiple senders are dropped after the aggregation. And at the destination side, it will be judged as packet loss and trigger packet retransmission [I-D.yao-tsvwg-cco-problem-statement-and-usecases]. Some modification on reliability mechanisms of native RDMA transport should be improved.

+------------+           +----------------------+            +------------+
|            <-----------------------------------------------> receiver 1 |
|            |           |                      |            +------------+
|            |           |        switch        |
|   sender   |           | forwarding and NACA  |
|            |           |                      |            +------------+
|            <-----------------------------------------------> receiver 2 |
|            |           |   RDMA connection    |            +------------+
|            |           |                      |
|            |           |                      |
|            |           |                      |            +------------+
|            <-----------------------------------------------> receiver 3 |
+------------+           +----------------------+            +------------+
Figure 1: NACA Based on Server-to-server RDMA Connection

3.1.2. NACA Based on Server-to-switch RDMA Connection

The server-to-switch mode means that switches act as RDMA endpoints, and RDMA connection is built between the sender, i.e. server, and the destination, i.e. switch. In this case, hop-by-hop RDMA connections are built for end-to-end data transmission. It is necessary to define what RDMA functions should be offloaded to switches, since offloading all RDMA functions to the network would bring heavy burdens to switches not only on memory and buffer space, but also on protocol complexity.

                                               RDMA connection
                      +------------------------+             +------------+
                      |                        <-------------> receiver 1 |
                      |                        |             +------------+
               RDMA   |                        |
            Connection|                        |
+----------+          |         switch         |             +------------+
|  sender  <---------->  forwarding and NACA   <-------------> receiver 1 |
+----------+          |                        |             +------------+
                      |                        |
                      |                        |
                      |                        |             +------------+
                      |                        <-------------> receiver 1 |
                      +------------------------+             +------------+
Figure 2: NACA Based on Server-to-switch RDMA Connection

3.2. Gap Analysis of Existing Solutions

3.2.1. Infiniband SHARP

Scalable Hierachical Aggregation and Reduction Protocol(SHARP) [SHARP]SHARP breaks the end-to-end transport rule by implementing Target Channel Adapter(TCA) in switches. The TCA supports both Reliable Connection (RC) transport to enable reliable delivery of data through the aggregation tree as well as Unreliable Datagram (UD) transport to enable Multicast distribution of the aggregation result. SHARP has been realized in Infiniband commodity switches, but as is stated, SHARP is based on Infiniband architecture. Currently it cannot work interoperably with the other network architectures, thus limiting its applicability.

Figure 3 shows the SHARP protocol, it has two primary phases, Aggregation Request and Aggregation Response. SHARP header is designed over IBA header, followed by aggregation operations and other data description information.

+---------+---------+-------+------+-----------+--------+--------+----+
|   IBA   |  SHARP  | Tuple | User | Operation | Target | SHARP  | CRC|
| Header  |  Header | Header| Data |  Header   | Header | Payload|    |
+---------+---------+--------------+-----------+--------+--------+----+

           Aggregation Request Pkt

+---------+---------+-------+------+--------+----+
|   IBA   |  SHARP  | Tuple | User | SHARP  | CRC|
| Header  |  Header | Header| Data | Payload|    |
+---------+---------+--------------+--------+----+

           Aggregation Response Pkt
Figure 3: SHARP Protocol

3.2.2. RoCEv2 Solutions

RDMA over Converged Ethernet version 2(RoCEv2) is an RDMA scheme based on the UDP protocol over Ethernet. Its core design uses InfiniBand's transport layer, where data is transmitted in sequence and retransmitted using go-back-n. Therefore, a lossless and ordered network is required to achieve ideal performance. The network has introduced Priority Flow Control (PFC) and IP based Explicit Congestion Notification (ECN) to ensure lossless transmission. Technical schemes of NACA based on RoCEv2 have been analyzed in both academia and industry, but compared with Infiniband SHARP, there is currently no commercial solutions, which means there are a lot of standardization space in this area.

Take [NetReduce] and [Cepheus] as two examples for server-to-server communication mode of RoCEv2-based NACA.

[NetReduce] is designed to offload Allreduce to switches. For ring-Allreduce, workers establish RDMA connections with front and rear workers, using RDMA write to send parameters, and RDMA read to receive aggregation results. The switch is a man-in-the-middle who receives data and aggregates it locally, then returns the results to workers in RDMA read way. This approach has little impact on applications. And it improves the performance since it reduces aggregation rounds compared to traditional ring-Allreduce method. However, mechanisms such as transport reliability and flow control are designed based on an server-to-server communication model, so they need to be redesigned or adapted accordingly.

Figure 5 shows the three phases of NetReduce protocol. First packet, middle packet and the last packet. NetReduce header is built over RoCEv2. NetReduce has similar function as SHARP, but it is designed for aggregation used in ring-Allreduce, so it contains ring information, message information, and rank information.

+---------------------------------------------------------+
|                       Switch                            |
|                  man-in-the-middle                      |
|                                                         |
|  +---------------------------------------------------+  |
|  |                                                   |  |
|  |      +--------------+      +--------------+       |  |
|  |      |              |      |              |       |  |
+---------------------------------------------------------+
   |      |              |      |              |       |
   |      |              |      |              |       |
   |      |              |      |              |       |
+--v------+-+          +-v------+--+         +-v-------+-+
|  worker 1 |          |  worker 2 |         |  worker 3 |
+-----------+          +-----------+         +-----------+
Figure 4: NetReduce Illustration
                        First Pkt
+---------+---------+---------+---------------+----------+------+
|   UDP   | IB BTH  | IB RETH | NetReduce Hdr |  Payload | ICRC |
+---------+---------+---------+---------------+----------+------+

                       Middle Pkt
+---------+---------+----------+------+
|   UDP   | IB BTH  |  Payload | ICRC |
+---------+---------+----------+------+

                       Last Pkt
+---------+---------+---------+---------+------+
|   UDP   | IB BTH  | IB IMM  | Payload | ICRC |
+---------+---------+---------+---------+------+
Figure 5: NetReduce Protocol

The design objective of [Cepheus] is for offloading Broadcast operations. Multiple receivers first send RDMA related information, like Queue Pair(QP) number and destination address, to the sender host for registration, and Multicast Forwarding Tree(MFT) is built on these information. Intermediate switches will make decisions based on their downstream connectors. If leaf switch is directly connected with the receiver host, it will work as a RDMA bridge by modifying data packets. In this way, multicast is done in the forward direction, and Acknowledge signals are aggregated in reverse direction to realize reliability. This kind of implementation incur less modification to native RDMA and has better compatibility.

Figure 6 shows the Cepheus Multicast Registration Protocol(MRP). Before starting to implement the Broadcast operations, the source of the multicast propagates the MRP into the entire fabric to install multicast forwarding table in each switch and build a MFT. Recevier's RDMA information, i.e, QP number and destination IP address are predefined before the MFT is set up. During multicast, there is no real RDMA communication. Switches that are directly connected with receivers will modify the packet header to make the logical RDMA connection complete.

             Cepheus MRP
+---------+----------+--------------+
|   UDP   | Metadata | Node Payload |
+---------+----------+--------------+

              Metadata
+---------------+---------------+
|  Total | Seq  |  Node Numbers |
+---------------+---------------+

            Node Payload
+----------+---------+---------+
| Node QPN | Node IP | Reserve |
+----------+---------+---------+
Figure 6: Cepheus Multicast Registration Protocol

In RoCEv2 network ,there is also server-to-swtich mode where switches implement RDMA protocol stack and workers establish RDMA connections with switches. This approach is similar to InfiniBand SHARP, but based on Ethernet. Due to capacity limitations, network devices do not need to support complete RDMA transport protocol. Similarly, the shortcoming of this mode is that it requires network devices to support RDMA.

3.2.3. iWARP

iWARP[RFC5040] is another RDMA scheme for Ethernet based on TCP protocol. Like RoCEv2, iWARP uses InfiniBand Verbs to interact with applications.RDMAP (Remote Direct Memory Access Protocol) provides RDMA semantic support for upper layer requests such as RDMA_Send, RDMA_Read, RDMA_Write. DDP (Data Placement Protocol) implements zero copy function. DDP Packet contains information describing the memory area. Hardware can directly move data in the DDP Packet to the destination in memory through DMA based on the control information in the DDP Packet . The above process does not require the involvement of the CPU. MPA (Marker Protocol Data Unit Aligned Framing) is responsible for adding control information to the TCP flow according to a certain algorithm at the sending end, so that the receiving end can recognize the boundaries of DDP Packet in the flow according to the algorithm.

+--------------+--------------+--------------+-----------------+------------+----------+
|  TCP Header  |  MPA Header  |  DDP Header  |   RDMA Header   |  Payload   |  MPA CRC |
+--------------+--------------+--------------+-----------------+------------+----------+
Figure 7: iWARP Protocol

Due to TCP ensuring packet ordered delivering and transmission reliability, iWARP could adapt to larger network scales compared to RoCEv2, but its performance is lower. Because of the high cost of offloading complete TCP/IP stack to hardware and the resource intensive maintenance of TCP protocol status, the use of iWARP is not as widespread as RoCEv2.

In the server-to-server NACA based on iWARP, any change in Payload may be considered as an interruption of the flow, and any packet loss must be retransmitted. The transport layer mechanism is too complex and difficult to modify.

In the server-to-switch NACA based on iWARP mode, due to resource limitations, network devices do not need to implement a complete protocol stack. It is necessary to clarify which parts of existing protocols must be implemented. Meanwhile, if network devices maintain TCP connections, they need to manage resources reasonably.

4. Requirements

4.1. NACA Function Design and Header Definition

NACA offloads collective operations with low computational precision and high I/O communication to network device. Network devices not only complete packet routing and forwarding, but also need to process collective messages. Therefore, NACA functions should be designed to instruct network devices to distinguish and process different traffic. Accordingly, an NACA header should be designed over the transport layer to complete the mapping mechanism between packets and collective messages. Therefore, the following requirements are proposed to support collective communication optimization:

R1: MUST define a NACA header to indicate what collective operations that switches need to offload, together with relevant information, for example, message id, sequence number, job id etc.

R2: SHOULD support fallback mechanism, in case network devices are not sufficient for processing complete collective operations.

4.2. Bridge RDMA Transport Semantics

As has been explained in previous sections, the major gap between native RDMA and NACA is reflected in the transport semantics. There need mechanisms for transport semantic bridging in order to combine the high-performance transport capability of RDMA and NACA functionality. Besides, NACA may not need full functionality of native RDMA, and it is not ideal to implement full RDMA functionality within switches, because of limited hardware resources. For example, most of RDMA-based NACA solutions only call RDMA read, write, send, and receive operations. Accordingly, the following requirements need to be met:

R3: Transport layer MUST support RDMA function.

R4: SHOULD allow for different RDMA communication modes for NACA as described in section 3.

R5: In server-to-swtich mode, SHOULD clarify which part of the RDMA functions the switch supports, in order to establish a RDMA connection with the server and complete NACA.

As it has been analyzed in section 3 that IWARP solutions can not work well with NACA, because it builds RDMA functions on top of TCP which are too complex to implement in switches. The most promising solution is RDMA over UDP, for example, RoCEv2. However, native RoCEv2 has several limitations and can not work very well with NACA in large scale clusters. These limitations are reflected in the mechanisms of reliability, flow control, and congestion control. For reliability, go-back-n packet retransmission is low efficient, and it may incur much buffer occupancy in NACA switches. Priority Flow Control(PFC) also has high requirement for buffer space, and for Many-to-one collective operations, PFC will take up even more buffer space. As for congestion control, there are lots of algorithms and not all of them work well with NACA. A common congestion control mechanism need to be designed. Thus, there are following requirements:

R6: NACA MUST be designed with reliability, and the reliability mechanism of RoCEv2 SHOULD be modified to be more efficient.

R7: Flow control SHOULD be optimized in order to save more buffer and memory space for NACA functions.

R8: The congestion control of NACA SHOULD work compatibly with other congestion control mechanisms applied for other network traffic that runs in the same fabric.

4.4. Proper Data Input and Output Methods

As has been notified, switches or intermediate network nodes have relatively limited ressources, i.e., memories. Thus, the resource management should be coordinated with data input and output methods on the dataplane, especially in the massive-scale distributed AI training application. If they do not match, problemss like packet loss and high communication latency may happen. Therefore,

R9: Suitable data PUSH/PULL methods are REQUIRED to be designed for overcoming the limited-memory problem of switches.

4.5. Joint Design of NACA Task Assignment and Routing Policies

Since AI model training tasks usually follow a predefined rule that task as well as the training group are settle, and once the training starts, there will be no more new comers to join the group. On basis of this, NACA task assignment usually follows a centralized pattern. For example, NACA support Allreduce by following Aggregation tree, and support broadcast by building a multicast forwarding tree. While some routing policies may follow distributed patterns. For example, Adaptive Routing(AR) selects the optimal path at each network node distributedly. These solutions may not co-exist with each other. In order to better balance traffic management and task assignment:

R10: NACA Task assignment SHOULD be co-designed with routing policies for joint optimization.

4.6. Security and Traffic Isolation

Due to situations of multi-tenancy, a single switch may need to perform different NACA functions and forward normal traffic. Since NACA header contains collective operations metadata and payload parameters, if the switch logic designed for NACA is incorrectly applied on normal traffic, there will be very severe security issues. Hence, security requirements are as follows:

R11: Resources MUST be isolated on switches to ensure that different tasks do not interfere each other, and NACA functions do not operate on normal traffic.

4.7. Fault Tolerance

Packet drops due to congestion can cause some negative impact in collective communication. It needs to depend on design of retransmission mechanisms of transport layer to respond to this kind of errors, or design of congestion avoiding mechanisms. These tolerant mechanisms usually are designed for transient errors, and are dependent on transport-related requirements. While errors on one or some switches in aggregation-tree/broadcast-tree fabric in NACA may cause long-period or permanent failure, so there is a need to design some asynchronous notification mechanisms to mitigate the catastrophic impact of these errors on NACA.

R12: The mechanism of choosing alternative node for implementing NACA functions MUST be designed, to ensure system robustness and reliability. Some asynchronous notification mechanisms are REQUIRED to designed for mitigating errors on switches in NACA fabric.

5. Security Considerations

Some security concerns have been described in the [I-D.yao-tsvwg-cco-problem-statement-and-usecases].

6. Operational Considerations

Use cases like AI model training, distributed storage, and big data analysis usually need infrastructure to be deployed in clusters which are operated by single entities, for example, limited domain [RFC8799]. In this case, not only the compute and network infrastructure, but also the application could be owned by single service providers. These use cases are typically performance-driven, which means they need application and infrastructure to be co-designed to reach optimization. However, applications are not co-designed with underlying network protocols case-by-case, as long as the definition and realization of certain collective operations that would be offloaded can be reached in consensus across vendors, like unified primitives used for implementing the collective communication, applications can leverage on the standardized north bound API to improve performance, albeit the applications do not belong to the same service providers.

7. IANA Considerations

TBD.

8. References

8.1. Normative References

[I-D.yao-tsvwg-cco-problem-statement-and-usecases]
Yao, K., Shiping, X., Li, Y., Huang, H., and D. KUTSCHER, "Collective Communication Optimization: Problem Statement and Use cases", Work in Progress, Internet-Draft, draft-yao-tsvwg-cco-problem-statement-and-usecases-00, , <https://datatracker.ietf.org/doc/html/draft-yao-tsvwg-cco-problem-statement-and-usecases-00>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC5040]
Recio, R., Metzler, B., Culley, P., Hilland, J., and D. Garcia, "A Remote Direct Memory Access Protocol Specification", RFC 5040, DOI 10.17487/RFC5040, , <https://www.rfc-editor.org/info/rfc5040>.
[RFC8799]
Carpenter, B. and B. Liu, "Limited Domains and Internet Protocols", RFC 8799, DOI 10.17487/RFC8799, , <https://www.rfc-editor.org/info/rfc8799>.

8.2. Informative References

[Cepheus]
Zhang, W. L. A. J., "Cepheus: Accelerating Datacenter Applications with High-Performance RoCE-Capable Multicast", .
[NetReduce]
Liu, S., "In-Network Aggregation with Transport Transparency for Distributed Training", DOI 10.1145/3582016.3582037, , <https://doi.org/10.1145/3582016.3582037>.
[SHARP]
Graham, R. L., "Scalable Hierarchical Aggregation and Reduction Protocol (SHARP): A Hardware Architecture for Efficient Data Reduction", DOI 10.1109/COMHPC.2016.006, , <https://doi.org/10.1109/COMHPC.2016.006>.

Authors' Addresses

Kehan Yao
China Mobile
Beijing
100053
China
Shiping Xu
China Mobile
Beijing
100053
China
Chang Liu
China Mobile
Beijing
100053
China
Yizhou Li
Huawei Technologies
Nanjing, Jiangsu
China
Hongyi Huang
Huawei Technologies
Beijing
China
Weifeng Wang
New H3C Technologies Co., Ltd
Beijing
China
Dirk KUTSCHER
THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY (GUANGZHOU)
Guangzhou
China