Internet-Draft DMSC Architecture March 2025
Yang, et al. Expires 2 September 2025 [Page]
Workgroup:
DMSC Working Group
Internet-Draft:
draft-yang-dmsc-distributed-model-02
Published:
Intended Status:
Standards Track
Expires:
Authors:
H. Yang
Beijing University of Posts and Telecommunications
T. Yu
Beijing University of Posts and Telecommunications
Q. Yao
Beijing University of Posts and Telecommunications
Z. Zhang
Beijing University of Posts and Telecommunications

Distributed AI model architecture for microservices communication and computing power scheduling

Abstract

This document describes the distributed AI micromodel computing power scheduling service architecture.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 2 September 2025.

Table of Contents

1. Introduction

The Distributed AI Micromodel Computing Power Scheduling Service Architecture is a structured framework designed to address the challenges of scalability, flexibility, and efficiency in modern AI systems. By integrating model segmentation, micro-model deployment, and microservice orchestration, this architecture enables the effective allocation and management of computing resources across distributed environments. The primary focus lies in leveraging model segmentation to decompose large AI models into smaller, modular micro-models, which are executed collaboratively across distributed nodes.

The architecture is organized into four tightly integrated layers, each with distinct roles and responsibilities that together ensure seamless functionality:

Business Layer: This layer acts as the interface between the user-facing applications and the underlying system. It encapsulates AI capabilities as microservices, enabling modular deployment, elastic scaling, and independent version control. By routing user requests through service gateways, it ensures efficient interaction with back-end micro-models while balancing workloads. The business layer also facilitates collaboration between multiple micro-models, allowing them to function as part of a cohesive distributed system.

Control Layer: The control layer is the central coordination hub, responsible for task scheduling, resource allocation, and the implementation of model segmentation strategies. It decomposes large AI models into smaller, manageable components, assigns tasks to specific nodes, and ensures synchronized execution across distributed environments. This layer dynamically balances compute and network resources while adapting to system demands, ensuring high efficiency for training and inference workflows.

Computing Power Layer: As the execution core, this layer translates the decisions made by the control layer into distributed computation. It executes segmented micro-models on diverse hardware resources such as GPUs, CPUs, and accelerators, optimizing parallelism and fault tolerance. By coordinating with the control layer, it ensures that tasks are executed efficiently while leveraging distributed orchestration frameworks to handle diverse workloads.

Data Layer: The data layer underpins the entire system by managing secure storage, access, and transmission of data. It provides the necessary datasets, intermediate results, and metadata required for executing segmented micro-models. Privacy protection mechanisms, such as federated learning and differential privacy, ensure data security and compliance, while distributed database operations guarantee consistent access and high availability across nodes.

At the heart of this architecture is model segmentation, which serves as the foundation for effectively distributing computation and optimizing resource utilization. The control layer breaks down models into smaller micro-models using strategies such as layer-based, business-specific, or block-based segmentation. These micro-models are then deployed as independent services in the business layer, where they are dynamically scaled and orchestrated to meet real-time demands. The computing power layer executes these tasks using parallel processing techniques and advanced scheduling algorithms, while the data layer ensures secure and efficient data flow to support both training and inference tasks.

By tightly integrating these layers, the architecture addresses critical challenges such as balancing compute and network resources, synchronizing distributed micro-models, and minimizing communication overhead. This cohesive design enables AI systems to achieve high performance, scalability, and flexibility across dynamic and resource-intensive workloads.

This document outlines the design principles, key components, and operational advantages of the Distributed AI Micromodel Computing Power Scheduling Service Architecture, emphasizing how model segmentation, micro-models, and microservices form the foundation for scalable and efficient distributed AI systems.

2. Conventions used in this document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

3. Terminology

TBD

4. Scenarios and requirements

4.1. AI Microservice model scenario requirements

In contemporary times, as artificial intelligence technology evolves at an accelerated pace, the scale and intricacy of AI models are continuously expanding. The traditional monolithic application or centralized reasoning and training model is progressively becoming inadequate to meet the swiftly changing business demands. Encapsulating AI capabilities within a microservices architecture can confer substantial advantages in terms of system flexibility, scalability, and service governance. By decoupling models through microservices, an independent AI model service can circumvent potential bottlenecks that arise from deep coupling with other business logic components, and it can also achieve elastic scaling during surges in requests or training loads. Given the rapid iteration and upgrade cycles of AI models, a microservice architecture facilitates the coexistence of multiple model versions, enables gray-scale releases, and supports rapid rollbacks, thereby minimizing the impact on the overall system.

The requirements for computing power in AI microservice models are often extremely demanding. On the one hand, the training or inference process usually involves massive data processing and high-density parallel computing, requiring the collaborative work of various hardware resources such as GPU, CPU, FPGA, NPU, etc; On the other hand, if the model scale is large or the request volume is high, the computing power of a single machine is often insufficient to meet business needs. It is necessary to perform parallel computing on multiple nodes through a distributed mode and release resources reasonably during idle time to improve utilization. This type of distributed training or inference typically relies on efficient communication strategies to synchronize model parameters or gradients, and methods such as AllReduce or All to All are often used to reduce communication overhead and ensure model consistency.

In the distributed system, the network plays a crucial role. A large number of model parameters and gradients need to be exchanged frequently during the calculation process, which puts forward high requirements for network bandwidth and delay. In the large-scale cluster scenario, the reasonable design of the network topology and the choice of the communication framework can not be ignored. Only in the high-bandwidth, low-latency network environment, combined with the appropriate communication library (such as NCCL, MPI, etc.), can the cluster fully exploit the potential of computing power and avoid communication becoming the bottleneck of global performance.

4.2. Distributed Micro model Service Flow

In the distributed AI micro-model computing power scheduling service architecture, the core of the business process is how to realize the multi-node layout and collaborative work of the model to ensure efficient parameter synchronization and communication. Typically, a model is trained and evaluated using a deep learning framework during development, and then container-ized or mirrored to package the model and its dependencies into a service that can be deployed independently. Then, these encapsulated model services are registered to the system's microservice management platform for subsequent unified scheduling and access.

The micromodel is deployed to a distributed cluster, computing power orchestration and resource scheduling allocates computing resources such as GPU or CPU according to real-time load, business priority and hardware topology, and uses container orchestration tools (such as Kubernetes) to start corresponding service instances on each node. When distributed cooperation is needed, NCCL, Horovod and other frameworks are used to complete inter-process communication. Requests from upper business systems or users usually arrive at API Gateway or service gateway first, and then are distributed to the target service instance according to load balancing or other routing policies. If distributed reasoning is needed, multiple nodes cooperate to perform model segmentation reasoning and summarize the results, and finally return the reasoning results to the requester. In this process, real-time monitoring and elastic scaling mechanism play an important role in ensuring system stability and optimizing resource utilization. On the monitoring level, through a unified data acquisition and analysis platform, the system can track core indicators such as GPU utilization, network traffic, and request latency of each service node, so as to provide timely alarms in case of failures, performance bottlenecks, or insufficient resources, and perform automatic failover or node offline processing.

In addition, the distributed micromodel business flow needs to be combined with the data backflow mechanism. A large number of logs, user feedbacks and interactive information generated in the inference process can be further used for the training of new models or the performance optimization of existing models if they can be returned to the data platform under the premise of meeting privacy and compliance requirements.

5. Key issues and challenges

5.1. Balancing Compute and Network Resources under Constraints

With the continuous growth of AI model size and business demand, the computing power resources of a single node or single cluster are often difficult to support high-intensity training and inference tasks, and it is prone to the problem of insufficient computing power or sharp rise in cost. Through the distributed architecture to coordinate computing resources between multiple nodes and multiple regions, it can improve the overall efficiency and fault tolerance to a certain extent. However, distributed deployment also brings higher complexity, which not only considers the differences of heterogeneous hardware (such as GPU, CPU, FPGA, etc.), but also needs to balance the allocation of computing power under different network topology and bandwidth conditions.

When computing network resources are scarce, it is necessary to dynamically schedule and allocate computing power according to business priority, model scale and real-time load conditions, and combine strategic queuing, elastic scaling and scaling, and cross-cluster resource collaboration to improve the overall service efficiency. In this process, Model Partitioning/Parallelism scheme plays a key role. On the one hand, the model can be decomposed among multiple nodes by means of "tensor segmentation" or "computing power pipelining", and each node is only responsible for a specific submodule or specific slice. On the other hand, for reasoning scenarios, the input data can also flow through a series of model microservice nodes to form a pipelined processing mode, so as to make full use of scattered computing resources. Through this strategy of splitting the model into parallel execution, it can not only avoid too much computing pressure on a single server, but also maximize the use of GPU/CPU computing power of idle nodes when network resources permit, so as to achieve balance and optimization between computing network resources.

5.2. Data Collaboration Challenges under Block Isolation

In many distributed systems, large-scale data is usually split into multiple data blocks, which are stored and processed separately. Although this improves data security and processing efficiency, it also brings challenges to data coordination. When multiple nodes or microservice modules need to share or exchange data, the interface and call sequence must be defined in advance, and the consistency and concurrency control level must be managed. Especially when different data blocks have cross-node dependencies, how to effectively schedule, load and distribute data has become one of the key bottlenecks of system scalability and computational efficiency.

A key difficulty lies in synchronizing data across distributed nodes while minimizing latency and avoiding bottlenecks. Cross-node dependencies require precise scheduling to ensure data arrives at the correct location and time without conflicts. As the scale of data and the number of nodes grow, the management overhead for maintaining these dependencies can increase exponentially, particularly when network bandwidth or latency constraints exacerbate delays. Additionally, ensuring data consistency across multiple data blocks during concurrent access or updates adds another layer of complexity. High levels of concurrency can increase the risk of inconsistencies, data races, and synchronization issues, demanding advanced mechanisms to enforce data integrity.

Traditional distributed communication strategies, such as AllReduce and All-to-All, are widely used and remain effective in addressing certain data collaboration needs in training and inference tasks. For example, AllReduce is well-suited for data parallel scenarios, where all nodes compute on the same model with different data splits, and gradients or weights are synchronized via aggregation and broadcast. Similarly, All-to-All is valuable in more complex distributed tasks that require frequent intermediate data exchanges across nodes. However, these methods are not without limitations. As data and system complexity grow, they can lead to increased communication overhead, especially in scenarios where synchronization is uneven or poorly timed.

The effectiveness of traditional methods relies on fine tuning and precise execution. Improper timing of data exchange can lead to long waiting times, underutilization of resources, and even data mismatch. Although approaches such as AllReduce and All-to-All provide reliable communication frameworks, their scalability and efficiency are often limited by challenges such as synchronization across nodes, network variations, and system heterogeneity. Therefore, there is a need for continuous improvement and innovation in distributed communication and data collaboration strategies to overcome the challenges posed by block isolation.

6. Distributed solution based on model segmentation

Based on the key problems and challenges, a distributed AI micro-model computing power scheduling service architecture is proposed, which can be divided into four layers: business layer, control layer, computing power layer, and data layer. The hierarchical relationship is shown in Figure 1. The specific architecture diagram is shown in Figure. 2. The function module can realize the soft cooperation of the control layer and the hard isolation of the data layer, and the specific structure is shown in Figure 3.

 ---------------------------------
|          Business layer         |
|                 |               |
|           Control layer         |
|                 |               |
|      Computing power layer      |
|                 |               |
|             Data layer          |
 ---------------------------------
 -----------------------------------------------------------------------------------------------------------------------------------------------------------
|                       -----------      -----------                                      -----------      -----------                                      |
|                      |Service A/1|    |Service B/1|                                    |Service A/2|    |Service B/2|                                     |
|                       -----|-----      -----|-----                                      -----|-----      -----|-----                                      |
|                            |                |                                                |                |                                           |
|                            |                |                                                |                |                                           |
|                       -----------------------------                                    -----------------------------                                      |
|                      |  Microservices Gateway -1   |                                  |  Microservices Gateway -2   |                                     |
|                       ------------|----------------                                    -----------|-----------------                                      |
|                                   |                                                               |                                                       |
|                              -----|-----                                                     -----|-----                                                  |
|                             | Interface |                                                   | Interface |                                                 |
|                             | address 1 |- - - - - - - - - - - - - - - - - - - - - - - - - -| address 2 |----------------------------------               |
|                              -----\-----                                                     -----/-----            Address caching        |              |
|                                     \                                                            /                                         |              |
|                                       \                                                        /                                           |              |
|           --------------------        --\-------------                          -------------/--       --------------------                |              |
|          | Functional modules |------| Service Router |------------------------| Service Router |-----| Functional modules |               |              |
|           --------------------        -------\--------                          --------/-------       --------------------                |              |
|                                                \                                      /                                                    |              |
|                                                  \                                  /                                            ----------------------   |
|                                                    \                              /                                --------     | Service Registration |  |
|                                                      \                          /                                 |  Feign | ---| and Discovery Centre |  |
|                                                        \                      /                                    --------      ----------------------   |
|                                                          \                  /                                                              |              |
|                                                            \              /                                                                |              |
|                                --------------------        --\----------/--                                                                |              |
|                               | Functional modules |------| Service Router |                                                               |              |
|                                --------------------        --------|-------                                                                |              |
|                                                                    |                                                                       |              |
|                                                                    |                                                                       |              |
|                                                               -----|-----                                                                  |              |
|                                                              | Interface |                                           Address caching       |              |
|                                                              | address 3 |-----------------------------------------------------------------               |
|                                                               -----|-----                                                                                 |
|                                                                    |                                                                                      |
|                                                        ------------|----------------                                                                      |
|                                                       |  Microservices Gateway -3   |                                                                     |
|                                                        -----------------------------                                                                      |
|                                                             |                |                                                                            |
|                                                             |                |                                                                            |
|                                                        -----|-----      -----|-----                                                                       |
|                                                       |Service A/3|    |Service B/3|                                                                      |
|                                                        -----------      -----------                                                                       |
|                                                                                                                                                           |
|                                                                                                                                                           |
 -----------------------------------------------------------------------------------------------------------------------------------------------------------
                                           RPC  | REST API
                                                |
 -----------------------------------------------|---------------------------------------
|                      -|-* * *-----------------|---------------------|-                |
|                     |        Task         management      module      |               |
|                      -|---|-------------------|-----------------------                |
|                       |   |                   |                                       |
|                 ------    |                   |                                       |
|                |          |                   |                                       |
|     ---|-* * *-|-|--      |         -|-* * *--|---|-----|---                          |
|    |  Asynchronous  |     |        |  AI Model Segmentation |                         |
|    |   task queue   |     |        |     and aggregation    |                         |
|    |     module     |     |        |          module        |                         |
|     ---|-* * *-|-|--      |         -|-* * *--|---|-----|---                          |
|                |          |          |        |   |     |                             |
|                |          |  --------         |   |     |                             |
|                |          | |                 |   |      -------------                |
|               -|-* * *----|-|-                |   |                   |               |
|              | Log management |               |  -|-* * *--|-|---    -|-* * *---|-|-  |
|              |    system      |               | | Fault-tolerant |  | Model storage | |
|               ----------------                | |    mechanism   |  |     module    | |
|                                               |  ----------------    -|-* * *---|-|-  |
|                                               |                                       |
|    Control layer                              |           (Soft collaboration)        |
------------------------------------------------|---------------------------------------
                                                |
 -----------------------------------------------|---------------------------------------
|                                               |                                       |
|                                               |                                       |
|                                     -|-* * *--|-----|-|--                             |
|                                    |     Distributed     |                            |
|                                    | unified cooperation |                            |
|                                    |       module        |                            |
|                                     -|-* * *---|----|-|--                             |
|                                                |                                      |
|                                     -|-* * *---|----|-|--                             |
|                                    |   Load balancing    |                            |
|                                    |    and resource     |                            |
|                                    |allocation mechanism |                            |
|                                     -|-* * *---|----|-|--                             |
|                                                |                                      |
|                                     -|-* * *---|----|-|--                             |
|                                    |   Computing power   |                            |
|                                    |  execution  module  |                            |
|                                     -|-* * *---|----|-|--                             |
|                                                |    |                                 |
|                                                |     -------------                    |
|                                                |                  |                   |
|                                     -|-* * *---|----|-|--        -|-* * *----|-|-     |
|                                     |         Data       |      | Fault tolerance|    |
|                                     |  management module |      |  and recovery  |    |
|                                     -|-* * *---|----|-|--       |     module     |    |
|                                                |                 ----------------     |
|                                     -|-* * *---|----|-|--                             |
|                                    |   Computing power   |                            |
|                                    |    resource pool    |                            |
|                                     -|-* * *---|----|-|--                             |
|                                                |                                      |
|  Computing power layer                         |                                      |
|                                                |                                      |
 ------------------------------------------------|--------------------------------------
 ------------------------------------------------|--------------------------------------
|                                                | Packing data                         |
|                                                |                                      |
|                                       -|-* * *-|-|-                                   |
|       Data layer                     |   Database  |                                  |
|                                       -------------               (Hard isolation)    |
 ---------------------------------------------------------------------------------------

6.1. Business layer

The business layer is the core of the whole system and hosts the main business logic and microservice components. It interacts with the user-side front-end presentation layer, receives requests from various channels, processes them according to models or business rules, and returns the results to the upper layer or synchronizes them to other microservices. Typically, the business layer is deployed on a microservice container platform (such as Kubernetes), managed by a service gateway or API gateway, and a service registry and discovery center to maintain communication and load balancing between microservices. Internal communication can include RPC, REST API, or Feign based remote calls.

6.1.1. Microservices and Micromodels

Microservices and micromodels exist between multiple services (e.g. "Service 1", "Service 2", "Service 3" and even "Service n") that invoke each other at the business layer and logical layer. Each service encapsulates a separate model or a functional slice of a model, and when these services communicate with each other in the real world via RPC, REST API, or internal event bus, the overall effect of distributed micro-model coordination is formed. Through service registration and discovery center, these micro-models can automatically discover each other's available instances when needed, so as to flexibly scale and balance computing power and network resources in large-scale concurrent scenarios.

6.1.2. Microservice Gateway and API Gateway

The microservice gateway and API gateway assume the function of traffic scheduling and unified entry in the business layer. The microservice gateway is mainly for internal service calls. Through load balancing, routing rules and security policy configuration, the communication between various business modules is more efficient and stable.

API gateways face external clients or front-end layers, providing users with a consistent HTTP or gRPC interface. At the same time, it is responsible for authentication, flow limiting, fusing and monitoring functions to ensure that when external requests surge, the impact on internal services is controllable.

6.1.3. Service Registration and Discovery Center

The service registration and discovery center records the network address, version information and health status of all available microservices in the system, so that other modules or gateways in the business layer can find the correct target instance in time when they need to call the relevant microservices. For example, in the "real-time recommendation and user behavior analysis" business, when the "user portrait generation" microservice needs to be called, the system will first inquire the load status and the available instance list of the service from the registration and discovery center, and then select the appropriate node to call according to the load balancing strategy. This not only prevents a single point of failure, but also automatically updates the routing information as the microservice adds or loses instances.

Service registration and discovery center allows business function modules to avoid manually maintaining complex service addresses or dependencies. Each microservice only needs to actively register its own information after startup, and when the execution goes offline or crashes, the registry will update the state accordingly. Common implementations include Eureka, Consul, Zookeeper, etc. These registration and discovery centers can be deeply integrated with microservice gateways or load balancing layers to achieve high availability governance in distributed environments.

Each service registers the interface address to the registry, and the service finds the interface address of the calling service through the registry to initiate the use of the calling interface. The interfaces are called peer-to-peer, and although there is a registry, it only plays the role of controlling the flow.

6.2. Control layer

The control layer is mainly responsible for scheduling and managing various tasks and resources in the distributed AI system, including task creation, allocation, exception handling, and key processes such as model segmentation, training, and aggregation. Through the fine design of the control layer, it can realize the parallel operation of multiple models with high efficiency and high availability, and make timely scheduling and fault tolerance when the computing power is insufficient.

6.2.1. Task management module

The task management module is the "hub" of the control layer, which is responsible for receiving different types of task requests from the business layer or data layer, such as model training, model reasoning, or batch data processing, etc., and allocating tasks to nodes for execution according to real-time load conditions and computing power resource information. The task management module usually maintains a task queue or task priority queue to sort tasks by FCFS (First come first served), FIFO (First in First out) or weight-based scheduling policy. At the same time, the module internally interfaces with a service registry and discovery center or resource orchestration system (e.g. Kubernetes) to dynamically obtain key metrics such as health, bandwidth, and memory usage of available nodes (GPU/CPU). Some advanced implementations also use load balancing strategies or node affinity algorithms to choose the best location for tasks and trigger auto-scaling or resource recycling when the overall cluster load reaches a threshold.

6.2.2. Exception task queue module

The exception task queue module plays the role of "fault buffer", which is used to capture and store exceptions that occur during the execution of tasks. In distributed AI systems, network jitter, node failure or data exception often cause some tasks to fail or hang for a long time. The exception task queue module is designed to collect and isolate these abnormal tasks, so that they do not block the main task queue and affect the overall performance. This module continuously monitors the error logs and timeouts during the training or inference process.When an exception is found, the detailed information of the corresponding task (e.g., task ID, exception type, execution log, etc.) is transferred to a separate exception queue and recorded in the fault tracking system.

6.2.3. Log management system

The log management module is responsible for tracking all critical operations and events during the distributed training, inference and scheduling process. This module usually uses a centralized log storage and analysis framework to efficiently retrieve and aggregate log data even when the system is large. This module not only records the timestamps and execution results of events such as model segmentation, computing power allocation, and communication synchronization, but also collects hardware metrics (such as GPU utilization, memory usage, and I/O throughput) of each node during execution. When failure symptoms or performance bottlenecks are detected in the logs, such as slow training or frequent node timeouts, the log management module pushes the information to the abnormal task queue module or alert system, which assists the operations and Development teams to make timely diagnosis and troubleshooting. Through the centralized management and visual analysis of log data, it can also provide reliable data basis for subsequent model optimization, resource budgeting and business decision-making.

6.2.4. Model segmentation interface

This interface is mainly used to receive configuration information related to segmentation strategy or algorithm. Through this interface, the caller (e.g., a task management module, a business layer, or a scheduling system) can specify the splitting mode (per layer, per service, per block, etc.) and the corresponding parameter restrictions for each policy, such as the range of the number of layers to be split, the heuristic rules of the tabu search algorithm, the number of shared layers for multiple tasks, and the privacy protection requirements. The interface is typically provided in the form of a REST API, gRPC, RPC, or messaging middleware, giving the upstream system the flexibility to send or update policies.

6.2.5. Model segmentation module

Model segmentation is a key innovation in distributed AI architectures, offering a more efficient and flexible way to allocate computational resources and manage workloads. Within the control layer, segmentation strategies are carefully selected based on specific objectives, such as improving parallelism, optimizing resource utilization, or meeting privacy requirements. These strategies are tightly integrated into the system, with each segmented component packaged as a modular microservice to ensure seamless deployment and operation in distributed environments.FIG. 4 shows the framework diagram of the model segmentation and aggregation module.

Layer-based segmentation divides a model according to its structural hierarchy, segmenting the network layer by layer. Each resulting sub-model, typically consisting of one or more layers, is assigned to different nodes for parallel execution. This method is particularly effective for deep neural networks with significant depth and computational complexity. For example, in a deep convolutional neural network (CNN) for image classification, the initial convolutional layers responsible for extracting features might be executed on Node A, the intermediate fully connected layers on Node B, and the output classification layer on Node C. To enhance efficiency, heuristic or tabu search algorithms can determine optimal segmentation points by considering factors like computational load, inter-node communication overhead, and overall network latency. This strategy is especially valuable in real-time inference scenarios, such as autonomous driving, where computational throughput and low latency are critical for decision-making.

Business segmentation is usually applied to multi-task learning scenarios, where the same "backbone" model is derived into several sub-models (or sub-tasks) according to business requirements, and the co-training or inference of multiple tasks is realized by sharing part of the network structure or parameters. For example, an e-commerce platform may care about recommendations, AD click prediction, and user personas at the same time, and these requirements can be split into different "branches" on the "common part" of the same model, which share feature extraction layers, and each have task-specific output or fine-tuning layers.

Block-based segmentation provides maximum flexibility by dividing the model into smaller, independent chunks of computation that can be executed on separate nodes. Unlike layer-based or business-based segmentation, this approach does not adhere to the structural hierarchy or task boundaries of the model. Instead, it focuses on resource adaptability and efficient computation in heterogeneous environments. For example, in a federated learning system for healthcare, hospitals can train local model blocks on sensitive patient data. These blocks securely perform computations in the field and only encrypted intermediate results are globally aggregated. Similarly, in high-density cloud environments, block-based segmentation can dynamically allocate computational tasks to available hardware.

In addition to the above common segmentation methods, for scenarios that need to take into account data privacy or compliance requirements, privacy protection logic can also be built into the segmentation strategy, such as putting sensitive data related calculations into a separate secure node, or performing differential privacy processing on gradient information and then aggregating. Through the multi-level and multi-angle model segmentation scheme, the control layer can maximize the use of distributed computing power, and flexibly schedule AI tasks in a multi-business and multi-data source environment.

 ---------------------------------------------------------------------------
|                                -----------------------------------------  |
|    --|-* * *---------|-|--    | Task requests are collected and stored  | |
|   | AI Model Segmentation |   |                       |                 | |
|   |    and aggregation    | --|      The feature algorithm extracts     | |
|   |         module        |   |           the generated features        | |
|    --|-* * *---------|-|--    |                       |                 | |
|                               |          The data matching algorithm    | |
|                               |           performs the task grouping    | |
|    -----------------------    |                       |                 | |
|   | Layer segmentation    |   |                 Model training          | |
|   | Business segmentation |---|                       |                 | |
|   | Block segmentation    |   |         Model parameter aggregation     | |
|    -----------------------     -----------------------------------------  |
 ---------------------------------------------------------------------------

6.2.6. Model segmentation scheduling

After model segmentation, the control layer undertakes the key task of scheduling the execution of the segmented sub-models. Scheduling is more than just assigning tasks to nodes; It must optimize collaboration efficiency, minimize resource idleness, and reduce data bias across distributed systems. The scheduling process requires careful consideration of factors such as task timing, resource availability, data dependency, and system load to determine the optimal execution order and synchronization strategy for each submodel.

To manage incoming requests effectively, the scheduling algorithm must decide how tasks are prioritized and allocated. For instance, using a First Come, First Serve (FCFS) strategy ensures that tasks are executed in the order they arrive. However, this approach may leave some nodes underutilized if tasks vary significantly in complexity or resource requirements. To address such inefficiencies, advanced scheduling methods like priority queues or dynamic insertion algorithms can be employed. These methods prioritize tasks based on urgency, computational cost, or value to the system, ensuring that high-priority or time-sensitive tasks are assigned computational resources more quickly. For example, in a real-time fraud detection system, high-risk transactions can be processed immediately by prioritizing their execution, while lower-risk transactions are queued for later.

At the same time, in order to ensure the correctness and consistency in the distributed environment, it is necessary to arrange the appropriate communication time after each step of fragment calculation to avoid data disorder or excessive delay. For those scenarios where the training or inference process is very time-sensitive, we can also reserve exclusive GPU/CPU nodes for critical tasks at the scheduling level, or enable timing synchronization mechanisms to ensure that all sub-models complete updates and feedback in the same iteration cycle.

6.2.7. Model segmentation aggregation

Once all calculations distributed across different nodes or sub-models are completed, the intermediate results or parameters must be aggregated to produce the final output, whether it is a model prediction result or updated model parameters. The aggregation module plays a pivotal role in consolidating these outputs into a unified result, ensuring consistency and accuracy in distributed AI workflows.

The aggregation process typically employs strategies such as voting, weighted averaging, or attention mechanisms to combine the outputs of sub-models. For instance, in an ensemble-based recommendation system, each sub-model might provide a recommendation score, and the aggregation module could compute a weighted average based on the performance or confidence of each sub-model. Similarly, in distributed neural networks, attention mechanisms can be used to assign different importance to outputs from various nodes, enabling more precise aggregation based on task-specific contexts. These strategies ensure that the aggregated result reflects the strengths and contributions of individual sub-models while maintaining overall coherence.

However, aggregation in distributed systems is inherently challenging due to the possibility of node failures or delays. Network jitter, node outages, or computation delays can prevent certain nodes from returning their results in time, potentially disrupting the aggregation process. To address this, the control layer incorporates fault-tolerant mechanisms such as timeout retries, data playback, or redundant computation strategies. For example, if a node fails to provide its result within a specified time frame, the system might either retry the computation on the same node or reassign the task to a different node. In scenarios where redundancy is feasible, multiple nodes can perform the same computation, ensuring that at least one result is available for aggregation.

The aggregation module also monitors system-wide performance to evaluate the trade-off between computational benefits and coordination overhead. By refining fault-tolerant logic and aggregation strategies, the control layer ensures that the advantages of distributed computation—such as scalability and parallelism—are not offset by excessive synchronization or error-handling delays. For example, in large-scale model training, the aggregation process might include gradient averaging or parameter summation across nodes, with mechanisms to handle delayed or missing gradients, ensuring that the global model converges effectively despite intermittent node failures.

6.3. Computing power layer

The computing power layer is the execution core of the distributed artificial intelligence system, which converts the strategies and decisions of the control layer into actual calculations. This layer processes tasks, manages resources, and executes distributed models across nodes, ensuring that the computational benefits of model segmentation are fully realized. By integrating advanced scheduling, resource allocation and fault tolerance mechanisms, the computing capacity layer ensures the efficient execution of tasks while maintaining the stability of the system under dynamic loads.

The model segmentation strategy at the control layer determines how the sub-models or operators are distributed over the nodes. The computing power layer, in turn, optimizes resource allocation and execution to align with the segmentation design, ensuring that data dependencies and computational workflows are effectively managed. Through dynamic orchestration, parallel processing, and feedback mechanisms, this layer provides high performance and scalability for large-scale distributed AI systems.

6.3.1. Calculation of micro-model parameters

In the phase of micro-model parameter calculation, the computing power layer receives the scheduling instructions from the control layer and obtains the aggregated model information provided by the distributed unified cooperation module. The input usually includes the structural description of the micromodel (e.g., different network topologies such as convolutional networks, DNNS, Transformers, etc.), and the corresponding data fragments or data blocks. In addition, the compute layer takes into account the requirements of the business layer, such as inference latency, training accuracy, and throughput, to pre-allocate and schedule resources before execution.

When the micromodel and data are ready, the computing power execution module will load the corresponding operators into GPU, CPU or other hardware acceleration units according to the pre-selected computing framework (such as TensorFlow, PyTorch or self-developed lightweight AI inference engine), and perform parallel computing according to the parallelism configuration provided by the distributed unified collaboration module. For larger convolutional layers or attention mechanisms, the system may adopt AllReduce, All-to-All and other modes to distribute computing tasks, and perform synchronization or gradient updates after each iteration is completed. For the lightweight AI model, the computing power layer will give priority to the nodes with fast response to meet the low latency application scenarios. In the whole process, the load balancing and resource allocation mechanism will monitor the load of each resource pool (such as "computing power resource pool 1", "computing power resource pool 2", etc.) in real time, and make dynamic adjustments when the node has performance bottlenecks or idle resources, so as to reduce the calculation waiting time and improve the overall throughput.

When the calculation is finished, the computing power layer will summarize the execution of each micro-model, generate records including calculation delay, model metrics (such as Loss or Accuracy) and hardware utilization, and archive these records through the data management module to prepare for the next distributed computing power parameter update.

6.3.2. Distributed computing power parameter update

In the stage of distributed computing power parameter update, the computing power layer needs to globally merge and synchronize the intermediate results or model gradients calculated in the previous step, and then feed back the updated model parameters to the control layer or data layer. The input usually includes information such as training gradients uploaded by each node, model weight chunks, and node health status. The distributed unified coordination module combines fault tolerance and recovery mechanisms to ensure that parameters can be smoothly aggregated in the case of delay or failure of some nodes.

According to the business requirements and model scale, the computing power layer will choose the optimal parallel communication strategy, such as Ring AllReduce, Tree AllReduce or gradient compression followed by aggregation, to reduce network bandwidth consumption and accelerate the synchronization of model parameters. In the scenario of large model using Transformer or Attention structure, the computing power layer can allocate model parameters to different resource pools to be updated in parallel with the help of block or pipeline parallel technology, and then centralized and summarized to the master node or master process.

After the distributed parameter update is completed, the computing power layer will send the final model weights or inference engine image back to the control layer to be registered in the model warehouse as the "latest version of the model", and may also synchronize some intermediate features or labels to the data layer for subsequent analysis. At the same time, the fault tolerance and recovery module evaluates the stability and performance of the node according to the monitoring data collected during the training and update process, and provides a decision basis for the next iteration cycle or new task scheduling.

6.3.3. Distributed unified Collaboration module

The distributed unified collaboration module is located at the core of the entire computing power layer, which is responsible for receiving and integrating task instructions (such as model segmentation strategy, training or inference goals, etc.) from the control layer, and effectively docking with the underlying computing power resource pool. Its inputs include information about the architecture of the individual micromodels or aggregated models, the type of computation to be performed (training or inference), and an overview of the hardware available in the current cluster. The output is a global choreography instruction for computing resources and computing processes, which is used to guide the computing power execution module and other functional modules to work together. A distributed unified collaboration module will typically work with a service registry or cluster orchestration system (e.g., Kubernetes, Yarn), or may have a built-in distributed communication framework (e.g., NCCL, Horovod) to manage and synchronize multiple Gpus or multiple nodes. Its most prominent feature is that it can dynamically map different sub-models or operators to the most appropriate node according to the model block information and computing requirements, so that the distributed computing can maintain higher throughput and scalability in the multi-task and multi-model environment.

6.3.4. Load balancing and resource allocation mechanism

The load balancing and resource allocation mechanism monitors the load of each computing resource pool (such as GPU cluster, CPU cluster, heterogeneous accelerator, etc.) in real time, and combines the task scheduling strategy given by the distributed unified collaboration module to decide how to distribute the computing load between nodes. The input mainly includes the real-time status information of each node (free degree, free memory, computing power utilization) and the description of the hardware requirements of the task to be assigned (e.g., how many Gpus are needed, whether mixed-precision training is supported). The output is the specific node allocation scheme and task routing instructions, which guide the computing power execution module to deliver computing tasks to the optimal location.

6.3.5. Computing power execution module

According to the instructions from the distributed unified cooperation module and the load balancing module, the computing power execution module loads the specific micro model or operator to the corresponding node to run. The inputs include model parameters, network topology, and data blocks, and the outputs are computed inference results or intermediate training gradients. The module can run on multiple servers through containerization (e.g., Docker, Kubernetes Pods), and combine with AI frameworks (TensorFlow, PyTorch, etc.) or self-developed inference engines to flexibly switch execution environment and underlying computing power.

6.3.6. Data management module

The necessary characteristics, tags and metadata information are transferred between the data management module and the control layer or the business layer. The input sources usually include already chunked or segmented data sets, as well as intermediate results generated during model execution (e.g., local gradients, temporary features, etc.). Output updated snapshots of model parameters, or preprocessed feature data, for later use. The data management module can support high concurrent reads and writes with the help of distributed file system (HDFS), object storage (S3, etc.) or message queue (Kafka, RabbitMQ). It also performs small-scale and high-frequency data queries with database or cache systems.

6.3.7. Fault tolerance and recovery module

The fault tolerance and recovery module continuously monitors the heartbeat, load and network status of each node while the system is running. Once an anomaly is detected, the fault information will be reported to the distributed unified cooperation module, and the automatic fault tolerance logic will be triggered. The inputs are real-time cluster health data, task execution logs, and node failure reports. The output is a series of decision instructions including restarting tasks, reallocating resources, or rolling back to the last stable snapshot. This often includes self-healing from automation scripts (Ansible, salt, etc.) or cluster orchestration (Kubernetes), or it may include a stop-start training process that records the current iteration number and intermediate parameters when a crash occurs and waits for the node to recover before continuing execution.

6.3.8. Computing resource pool

A pool of computing resources represents a collection of underlying hardware that actually provides computing power. Each pool may correspond to different types or specifications of hardware, such as GPU server farms, CPU clusters, FPGA/ASIC accelerator cards, or even hybrid computing power across cloud or local data centers. Their inputs are usually task assignments and model execution requirements from load balancing and resource allocation mechanisms, and their outputs are inference results or training data after calculation, and relevant performance indicators (such as temperature, power consumption, throughput, etc.) are fed back to upper modules for analysis.

6.4. Data layer

The data layer is the backbone of distributed AI systems, enabling efficient data management while ensuring privacy protection, scalability, and seamless integration with other layers, including control, computing, and business layers. It plays a pivotal role in storing, transmitting, and processing diverse datasets, supporting distributed training, inference, and model segmentation workflows. Through its robust design, the data layer balances security and performance while maintaining the flexibility required by dynamic, large-scale AI systems.

6.4.1. Privacy protection

Privacy protection is at the core of the data layer, ensuring secure data handling across the entire AI workflow. Multiple databases (e.g., DB1, DB2, ..., DBn) store datasets from various business domains or sensitivity levels, enabling the system to manage and segregate data efficiently. For high-sensitivity scenarios, such as healthcare or financial applications, only encrypted or desensitized data fields are stored and transmitted. For instance, patient medical records might be encrypted locally, and only aggregated gradients or anonymized insights are shared during federated learning tasks.

When the system executes model training or inference, the control layer determines the appropriate data transmission strategy based on predefined privacy policies. Federated learning ensures that raw data remains localized, sharing only intermediate model gradients or parameters, while differential privacy adds noise to data or computations to prevent individual information leakage.

To further strengthen security, the data layer integrates advanced privacy-preserving technologies, such as homomorphic encryption, multi-party secure computation, and differential privacy injection. These techniques enable micro-models and segmented workflows to process data securely while complying with privacy regulations. For instance, in a cross-database integration scenario, the data layer ensures that access control policies and metadata updates prevent unauthorized sharing of sensitive data, maintaining compliance without hindering system performance.

6.4.2. Database maintenance and update

The data layer's database infrastructure ensures reliable storage, high availability, and scalability, supporting the execution of micro-models and model segmentation workflows. Distributed databases are deployed to manage datasets associated with various system segments, enabling parallel operations and efficient data provisioning for training and inference tasks.

To handle high-concurrency environments, the data layer leverages distributed database architectures such as NoSQL, NewSQL, and relational databases, each selected based on the nature of the workload:

NoSQL databases (e.g., HBase, Cassandra) are ideal for handling unstructured or semi-structured data, such as logs and user behavior data, offering high write throughput and horizontal scalability.

NewSQL systems (e.g., TiDB) provide a hybrid solution, balancing transactional consistency with scalability, making them suitable for workloads requiring real-time updates, such as model parameter synchronization.

Relational databases (e.g., MySQL, PostgreSQL) handle structured datasets, such as model version histories or feature engineering outputs, ensuring strong consistency and query efficiency.

The data layer ensures data consistency and fault tolerance through mechanisms such as master-slave replication, shard-based architectures, and automated failover. For example, if a database shard responsible for storing training gradients becomes unavailable, the system redirects queries to backup replicas or initiates a failover process to restore service. Regular incremental backups and disaster recovery protocols safeguard critical data against long-term loss due to network or hardware failures.

Real-time monitoring tools, such as Prometheus and ELK Stack, track database performance metrics, including query latency, synchronization delays, and disk usage. If anomalies are detected, automated alerts trigger recovery actions such as reallocating workloads, rerouting queries, or scaling database resources to prevent bottlenecks. For instance, during a high-demand scenario like a shopping festival, the data layer may dynamically scale up storage resources to accommodate surging user activity logs, ensuring uninterrupted data availability for recommendation models.

7. IANA Considerations

TBD

8. Acknowledgement

TBD

9. References

9.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.

9.2. Informative References

[InfRef]
"", .

Appendix A. An Appendix

Authors' Addresses

Hui Yang
Beijing University of Posts and Telecommunications
10 Xitucheng Road, Haidian District
Beijing
Beijing, 100876
China
Tiankuo Yu
Beijing University of Posts and Telecommunications
10 Xitucheng Road, Haidian District
Beijing
Beijing, 100876
China
Qiuyan Yao
Beijing University of Posts and Telecommunications
10 Xitucheng Road, Haidian District
Beijing
Beijing, 100876
China
Zepeng Zhang
Beijing University of Posts and Telecommunications
10 Xitucheng Road, Haidian District
Beijing
Beijing, 100876
China