Internet-Draft | IPv6 Packet Marking | July 2024 |
Carder, et al. | Expires 4 January 2025 | [Page] |
This document describes an experimentally deployed approach currently used within the Worldwide Large Hadron Collider Computing Grid (WLCG) to mark packets with their project (experiment) and application. The marking uses the 20-bit IPv6 Flow Label in each packet, with 15 bits used for semantics (community and activity) and 5 bits for entropy. Alternatives, in particular use of IPv6 Extension Headers (EH), were considered but found to not be practical. The WLCG is one of the largest worldwide research communities and has adopted IPv6 heavily for movement of many hundreds of PB of data annually, with the ultimate goal of running IPv6 only.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 4 January 2025.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
High Energy Physics (HEP) experiments such as those using the Large Hadron Collider, as well as many similar data intensive global science domains, rely on networks as one of the critical components of their infrastructure both within the laboratories as well as globally to interconnect participating sites, data centers and experiment instrumentation.¶
The Worldwide Large Hadron Collider Computing Grid (WLCG) as a specific (and very large) example of HEP research infrastructure supports multiple CERN experiments, with a reported 200PB of data generated annually and distributed to over 170 computing centers in 42 countries. As a massively distributed infrastructure with approximately 1.4 million cpu cores and 1.5 exabytes of storage, WLCG makes use of Research and Education (R&E) networks which have been highly engineered to handle this as well as other data-intensive sciences. Within the connected R&E networks, WLCG further makes use of the Large Hadron Collider Optical Private Network (LHCOPN) consisting of dedicated physical and virtual links, as well as a global-scale L3VPN overlay called the Large Hadron Collider Open Network Environment (LHCONE) which provides additional dedicated resources and segmentation from other R&E traffic.¶
IPv6 is used heavily by the WLCG, with over 90% of the main storage facilities now supporting it, and a significant percentage of traffic flows being IPv6. The ultimate goal is to run the WLCG IPv6-only. While WLCG transfers may aggregate to hundreds of Gbit/s, the constituent flows are usually not that large, a few hundred Mbit/s or very low Gbit/s. Large 5-10G+ flows are unusual.¶
Analyzing the pattern of traffic flows in detail is critical for understanding how the various complex systems developed are actually using the network. The motivation for the use of packet marking is to label traffic to indicate the user community and application workflow it is a part of so that the purpose of data transfers may be understood. This capability is especially important for sites which support many simultaneous experiments' workflows where any worker node or storage system may quickly change between different users. With a standardized way of marking traffic, any intermediate network or end-site could quickly provide detailed visibility into the nature of the HEP traffic running to and from their site.¶
Backbone networks may also use this metadata in order to summarize traffic as belonging to certain science experiments and their applications. HEP user communities may then use the data provided by participating backbone networks to characterize the scientific workloads running at global scale, measuring for example the impact of tradeoffs between storage and workload placement, or to examine that scarce resources such as undersea cables are used efficiently.¶
While the initial rationale for the packet marking was better understanding of the flow of traffic belonging to certain experiments around Research and Education (R&E) networks, there is also the potential for traffic to be steered by its Flow Label value and some early implementations are exploring this.¶
This document describes a packet marking scheme currently being applied and tested within the WLCG community, but the approach is extensible (given the number of bits available to mark applications and experiments) to other HEP and R&E communities. To accommodate such future use cases, we refer in the remainder of this document to activities and communities rather than applications and experiments.¶
A classic network flow is defined as a five tuple, i.e. source IP, destination IP, source port, destination port and protocol (TCP, UDP, ...). The packet marking is intended to complement the five tuple by denoting the packet owner community and the traffic type (application). One application may source multiple network flows for example from multiple source ports or to multiple destination IPs but for accounting purposes they may all be of the same application "type" of traffic (activity) and corresponding to the same owner (community), and inherently asking to be treated the same by the network. The applications would have, as part of their configuration, the owner and the type of traffic marking to set. A given host may be running multiple such applications.¶
Summarization of this data is expected to be coarse. A set of applications working on the same activity on different hosts would likely all use the same packet marking. Traffic "type" needs to be defined and agreed upon within a specific user community, the set of application owners, or users, need to be agreed upon within a limited domain. But it would be considered normal for multiple network flows (in the five tuple sense) to share a common marking if they belong to the same community and activity.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The format of the IPv6 packet header is described in Section 3 of [RFC8200], and includes the 20 bit IPv6 Flow Label field.¶
The packet marking approach uses all 20 bits of the flow label field available in the IPv6 header.¶
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version| Traffic Class | Flow Label | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload Length | Next Header | Hop Limit | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+¶
The packet marking has the following characteristics and subfields, containing the activity and community identifiers encoded as bits in the Flow label in the following way:¶
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |Flow Label Bits | |01|02|03|04|05|06|07|08|09|10|11|12|13|14|15|16|17|18|19|20| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |E | E| C| C| C| C| C| C| C| C| C| E| A| A| A| A| A| A| E| E| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+¶
The flow label is set on each packet that is sent by a given activity. Network flows belonging to the same community and activity may thus have 32 different flow label values.¶
As the initial work is to be applicable within the global R&E user community, the majority of the bits available are used to indicate the science community (owner) and, therefore, fewer bits are available to denote traffic type (activity).¶
A registry, known as [SciTags] has been proposed as an authoritative reference for agreed community and activity values for the marking. For the current experimental use, this is effectively operating as a centralized resource and API. Future work may include a more complex system for broader distribution.¶
A given intermediate network capturing the flow data doesn't necessarily need to decode this information, as it's only truly relevant to the end sites using this scheme.¶
Part of the reason for documenting this use of the IPv6 Flow Label was to note that, at least in the domain of certain HEP research networks, the IPv6 Flow Label is not being used exactly as specified, and to record the reason why.¶
Section 6 of [RFC8200] states that "the 20-bit Flow Label field in the IPv6 header is used by a source to label sequences of packets to be treated in the network as a single flow".¶
Section 3 of [RFC6437] states that "It is therefore RECOMMENDED that source hosts support the flow label by setting the flow label field for all packets of a given flow to the same value chosen from an approximation to a discrete uniform distribution" and that the algorithm (for setting the Flow Label value) "SHOULD ensure that the resulting flow label values are unique with high probability."¶
Section 1 of [RFC6437] further adds that "a specific goal is to enable and encourage the use of the flow label for various forms of stateless load distribution, especially across Equal Cost Multi-Path (ECMP) and/or Link Aggregation Group (LAG) paths."¶
In this packet marking scheme, all traffic belonging to the same community and activity will carry a flow label with 15 fixed, common bits and 5 varying (entropy) bits. Given use of the Flow label as described above should use 20 entropy bits (with a uniform distribution), it is not the case here that the flow label values will be unique with such a high probability, i.e., 1 in 32 network flows will in principle be unique rather than around 1 in a million.¶
The 5 entropy bits are used to still support a level of conformance with the requirement stated in RFC 6437 to support traffic distribution in ECMP and LAG scenarios. The number of bits chosen is a tradeoff between the number of bits available for the community and activity labeling and the number of entropy bits. Section 1.2 of [RFC6437] specifically quotes Section 2 of [RFC3697] and is worth restating here. "Router performance SHOULD NOT be dependent on the distribution of the Flow Label values. Especially, the Flow Label bits alone make poor material for a hash key." Section 3 of [RFC6438] clarifies that intermediate routers using ECMP or LAG "MUST minimally include the 3-tuple {dest addr, source addr, flow label}" and recommends additional sources of entropy to be considered. Additional clarifications are also made for tunneled traffic. In neither case is the flow label exclusively used.¶
As packets are marked using the IPv6 flow label, it is possible for intermediate routers to sample traffic in the forwarding hardware and send this data off to central collectors for analysis. In many network environments, the standard approach is for a hardware-specific implementation on a router to sample the traffic and use the IPFIX protocol [RFC7011] to send the sampled data to a collector. Section 5.4.21 of [RFC5102] defines field #31 for carrying the IPv6 Flow Label information, and major router hardware and collector software implementations are known to support this.¶
Some hardware platforms, primarily with a lineage more firmly rooted in switching vs routing, support traffic sampling via sflow [RFC3176]. Unlike IPFIX, the traffic is not summarized by the router/switch, but a significant part of the sampled raw packet is encapsulated and sent to the collector for analysis. sFlow datagrams include one or more packet flow records which in turn include the original datagram header [SFlow]. Individual fields such as the IPv6 flow label are able to be collected.¶
Traffic mirroring and/or optical taps can also be used to copy raw traffic to a server for analysis. The data rates, number of links, and power availability to run servers for large scale collection may make traditional packet capture and analysis impractical in many environments such as international R&E networks, though there has been initial success with the P4 implementation running on the FPGA platform deployed throughout ESnet (the US Energy Sciences network).¶
Corresponding to the expectations in Section 4 of [RFC6436], a brief, unscientific sampling of non-MPLS encapsulated traffic collected via IPFIX on ESnet does show that there is a mix of [RFC3697] compliant hosts where all-zero flow labels are used, as well as updated [RFC6437] compliant hosts that by default choose uniformly distributed labels between 1 and 0xFFFFF. A traffic analysis system may need to know which specific endpoints are using the packet marking meaning of the flow label and that the field's values are relevant. As the deployments are for rather narrow accounting use cases within specific user communities, it has been practical to match for known flow labels vs trying to keep the accounting state for 2^20 possible labels in use for each link of the network.¶
If there are concerns about preserving entropy and reducing the possible collisions with the standard use of the IPv6 Flow Label, we could potentially use the "entropy" bits defined above to instead calculate a Hamming Code. A Hamming Code calculates a set of Parity Bits to be used to extend a set of Message (Data) Bits, that will maximize the number of bits that are different between "valid" messages. This may better support existing use of the flow label for ECMP as described in [RFC6437]. This suggestion is currently open for further discussion.¶
Extension headers are known to be problematic in that they have a history of being filtered or dropped in transit, as measured in [RFC7872] with substantial further discussion in [RFC9098]. In our testing, these issues are no less common in Research and Education networks. As an example, [RFC9343] defines an alternate marking encoding for use in either hop-by-hop or destination options headers. [RFC7837] defines a marking in support of congestion control (ConEx), and [RFC8250] is a Standards Track document that defines a destination option for Performance and Diagnostic Metrics (PDM) for IPv6. There is also this draft defining an option [I-D.ietf-ippm-ioam-ipv6-options], for the use case of carrying OAM information.¶
The Destination option header could therefore be a logical choice to place activity-specific telemetry identifiers, as there is less of a constraint on space than the IPv6 Flow Label, less history of defined pre-existing intentions from the standards body, and low deployed usage on the Internet. However, at present, the linux implementation in particular requires either setuid 0 or CAP_NET_RAW capability to be able to call setsockopt(s, IPPROTO_IPV6, IPV6_DSTOPTS, ext_hdr_p, ext_hdr_size), making it unusable by typical userspace activities. There has been a set of patches made that could address this as well as extend the functionality, though they have not been met with support from the linux network maintainers. Additionally, extracting that field by intermediate routers and exporting it via IPFIX may be further subject to lack of support compared to the fixed field and known position of the flow label.¶
While in principle it's possible, it is less practical to use a Hop-by-Hop option, for the reasons discussed in [I-D.krishnan-ipv6-hopbyhop]. However, there is a recent example of its use in [RFC9268] where a host can signal this option, routers will not process it unless configured to do so, and if not, they may well drop the packet according to Section 4.8 of [RFC8200].¶
Given the size of IPv6 addresses, it is possible to mark or "color" packets by using specific site network prefixes (within a site /64) or values in (a part of) the host identifier part of an address (typically 64 bits). Hosts already currently use multiple IPv6 source addresses. Applications supporting specific activities would need to bind sockets to the correct source address, per flow, corresponding to the accounting details to be conveyed. Dispatching computation jobs into a high-throughput computational cluster along with network-specific metadata has for example been explored in [Lark].¶
Hosts serving different communities and activities would need multiple addresses, one for each possible, configured in advance of an activity's application requiring it. Adding an IP address onto a host requires root level access to a system and is typically not available as a dynamic function available for userspace. There also may be limits on the number of source addresses able to be concurrently configured, so a garbage collection process may need to deprovision addresses no longer in use. This dynamic use of source addresses also may cause operational issues around access-control list management, and security implementations at a site.¶
The use of marked source and destination addresses in communications could facilitate the routing of packets in different routing domains (or VPNs), if needed. Unfortunately, depending on the position of the marking in the address, it may not be possible to use it for policy routing, since very few network hardware implement bitmask packet matching for IPv6, leaving this likely feasible for host-initiated tunnels.¶
Marking in the payload has been considered to be out of scope given the prevalence of TLS/SSL/etc, which means that payloads cannot be inspected on path.¶
A recently published IETF personal draft documents the concept of "Network Tokens", see [I-D.yiakoumis-network-tokens].¶
"A network token is a small piece of data that end users attach to their packets. As packets flow through the network, intermediate nodes MAY detect tokens, interpret them, and apply the desired service to the packets that carry them (and possibly to all other packets from the same flow). For example, a token might just state the name of an activity's application that a packet originates from." The draft proposes a 28-bit token ID field and discusses multiple mechanisms for tokens to be conveyed. [RFC9419] puts this work into a broader context.¶
[Firefly] packets are an approach to mark flows by sending separate telemetry packets alongside activity traffic from the source to the same destination node, but always to a specific destination port. These packets are large enough to contain rich metadata about the flow, formatted as a json payload carried in syslog. The firefly packets can be collected en route by participating networks, by the end host, or sent to a central collector.¶
While the packet marking approach described in this document is IPv6-specific, as it uses the IPv6 Flow Label field, fireflies can be used for IPv4 or IPv6 network flows, to mark flows. A firefly would typically be sent at the start and end of each flow.¶
[RFC7098] describes use of the flow label in server load balancing environments. Minimally a 2-tuple {source address, flow label} could be used as hash input for a basic implementation. If this is not adequate in practice, it very well might be that a load balancer must process the packet deeper to examine a larger n-tuple and/or watch state to more aggressively tear down stale sessions. While use of server load balancers are currently rare in our specific operating environment and there is typically a large number of source hosts with many small flows, any potential future use of load balancers should still carefully examine the adequacy of the implementation for the given mix of traffic.¶
[PLB] (Protective Load Balancing) is a mechanism where hosts experiencing congestion on a flow change the IPv6 Flow Label. Devices in the network path between the hosts use the flow label as part of the entropy in hashing traffic for load balancing across multiple links such as with ECMP or LACP. When the flow label is changed, the flow is rehashed by the devices onto a new set of links, presumably avoiding the congestion hotspot. This capability, targeted at datacenter or limited-domain use along with explicit congestion signaling, is a feature available in the Linux kernel configurable via a sysctl and disabled by default. Hosts with applications implementing flow label marking should avoid use of PLB, though it could be theoretically feasible to manipulate the reserved entropy bits for a comparable outcome.¶
This memo includes no request to IANA.¶
The security considerations in Section 6 of [RFC6437] still apply. It states that "third parties should be unlikely to be able to guess the next value that a source of flow labels will choose", but this use case specifically requires common marking for the majority of the bits for a specific pairing of community and activity.¶
A related consideration is that well-known flow labels could further encourage pervasive monitoring attacks described in [RFC7258], but our use case for the flow labels is to intentionally permit monitoring use cases. This use of the flow label is directly controlled by the end hosts choosing to participate.¶
Members of the Worldwide LHC Computing Grid (WLCG) Research Networking Technical Working Group: Garhan Attebury (UNL) Marian Babik (CERN), Joe Breen (Univ of Utah), Eric Brown (Virginia Tech), Dale W. Carder (LBNL/ESnet), Tim Chown (Jisc), Eli Dart (LBNL/ESnet), Mariam Kiran (ESnet), Michael Lambert (PSC), Mario Lassnig (CERN), Jason Lomonaco (Internet2), Joe Mambretti (StarLight, iCAIR NU, MREN), Edoardo Martelli (CERN), Shawn McKee (Michigan), Karl Newell (Internet2), Casey Russell (KanREN), Marcos Schwarz (RNP), Pavlo Svirin (CERN/UTA), Fatema Bannat Wala (LBNL/ESNet), Alexandr Zaytsev (BNL)¶
The V6OPS working group and comments from Brian E. Carpenter and Tom Herbert.¶