Use of the IPv6 Flow Label for WLCG Packet Marking

Internet-Draft	IPv6 Packet Marking	July 2024
Carder, et al.	Expires 4 January 2025	[Page]

Abstract

This document describes an experimentally deployed approach currently used within the Worldwide Large Hadron Collider Computing Grid (WLCG) to mark packets with their project (experiment) and application. The marking uses the 20-bit IPv6 Flow Label in each packet, with 15 bits used for semantics (community and activity) and 5 bits for entropy. Alternatives, in particular use of IPv6 Extension Headers (EH), were considered but found to not be practical. The WLCG is one of the largest worldwide research communities and has adopted IPv6 heavily for movement of many hundreds of PB of data annually, with the ultimate goal of running IPv6 only.¶

1. Introduction

High Energy Physics (HEP) experiments such as those using the Large Hadron Collider, as well as many similar data intensive global science domains, rely on networks as one of the critical components of their infrastructure both within the laboratories as well as globally to interconnect participating sites, data centers and experiment instrumentation.¶

1.1. About the Worldwide Large Hadron Collider Computing Grid (WLCG)

The Worldwide Large Hadron Collider Computing Grid (WLCG) as a specific (and very large) example of HEP research infrastructure supports multiple CERN experiments, with a reported 200PB of data generated annually and distributed to over 170 computing centers in 42 countries. As a massively distributed infrastructure with approximately 1.4 million cpu cores and 1.5 exabytes of storage, WLCG makes use of Research and Education (R&E) networks which have been highly engineered to handle this as well as other data-intensive sciences. Within the connected R&E networks, WLCG further makes use of the Large Hadron Collider Optical Private Network (LHCOPN) consisting of dedicated physical and virtual links, as well as a global-scale L3VPN overlay called the Large Hadron Collider Open Network Environment (LHCONE) which provides additional dedicated resources and segmentation from other R&E traffic.¶

IPv6 is used heavily by the WLCG, with over 90% of the main storage facilities now supporting it, and a significant percentage of traffic flows being IPv6. The ultimate goal is to run the WLCG IPv6-only. While WLCG transfers may aggregate to hundreds of Gbit/s, the constituent flows are usually not that large, a few hundred Mbit/s or very low Gbit/s. Large 5-10G+ flows are unusual.¶

1.2. The rationale for Packet Marking

Analyzing the pattern of traffic flows in detail is critical for understanding how the various complex systems developed are actually using the network. The motivation for the use of packet marking is to label traffic to indicate the user community and application workflow it is a part of so that the purpose of data transfers may be understood. This capability is especially important for sites which support many simultaneous experiments' workflows where any worker node or storage system may quickly change between different users. With a standardized way of marking traffic, any intermediate network or end-site could quickly provide detailed visibility into the nature of the HEP traffic running to and from their site.¶

Backbone networks may also use this metadata in order to summarize traffic as belonging to certain science experiments and their applications. HEP user communities may then use the data provided by participating backbone networks to characterize the scientific workloads running at global scale, measuring for example the impact of tradeoffs between storage and workload placement, or to examine that scarce resources such as undersea cables are used efficiently.¶

While the initial rationale for the packet marking was better understanding of the flow of traffic belonging to certain experiments around Research and Education (R&E) networks, there is also the potential for traffic to be steered by its Flow Label value and some early implementations are exploring this.¶

1.3. Packet Marking and Network Flows

This document describes a packet marking scheme currently being applied and tested within the WLCG community, but the approach is extensible (given the number of bits available to mark applications and experiments) to other HEP and R&E communities. To accommodate such future use cases, we refer in the remainder of this document to activities and communities rather than applications and experiments.¶

A classic network flow is defined as a five tuple, i.e. source IP, destination IP, source port, destination port and protocol (TCP, UDP, ...). The packet marking is intended to complement the five tuple by denoting the packet owner community and the traffic type (application). One application may source multiple network flows for example from multiple source ports or to multiple destination IPs but for accounting purposes they may all be of the same application "type" of traffic (activity) and corresponding to the same owner (community), and inherently asking to be treated the same by the network. The applications would have, as part of their configuration, the owner and the type of traffic marking to set. A given host may be running multiple such applications.¶

Summarization of this data is expected to be coarse. A set of applications working on the same activity on different hosts would likely all use the same packet marking. Traffic "type" needs to be defined and agreed upon within a specific user community, the set of application owners, or users, need to be agreed upon within a limited domain. But it would be considered normal for multiple network flows (in the five tuple sense) to share a common marking if they belong to the same community and activity.¶

1.4. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶

2. Use of the IPv6 Flow Label

The format of the IPv6 packet header is described in Section 3 of [RFC8200], and includes the 20 bit IPv6 Flow Label field.¶

2.1. Setting the Flow Label bits

The packet marking approach uses all 20 bits of the flow label field available in the IPv6 header.¶

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| Traffic Class |           Flow Label                  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         Payload Length        |  Next Header  |   Hop Limit   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The packet marking has the following characteristics and subfields, containing the activity and community identifiers encoded as bits in the Flow label in the following way:¶

+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|Flow Label Bits                                            |
|01|02|03|04|05|06|07|08|09|10|11|12|13|14|15|16|17|18|19|20|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|E | E| C| C| C| C| C| C| C| C| C| E| A| A| A| A| A| A| E| E|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

The activity identifier uses 6 bits (A), and is encoded in bits 13-18¶
Entropy bits (E) are 5 bits in positions 1, 2, 12, 19 and 20, and are set at random once per network flow for the duration of its lifetime.¶
The community identifier uses 9 bits (C) and is encoded in bits 3-11, and these bits are used in reversed order to allow for possible future adjustments of the bit boundary.¶

The flow label is set on each packet that is sent by a given activity. Network flows belonging to the same community and activity may thus have 32 different flow label values.¶

As the initial work is to be applicable within the global R&E user community, the majority of the bits available are used to indicate the science community (owner) and, therefore, fewer bits are available to denote traffic type (activity).¶

2.2. The SciTags registry

A registry, known as [SciTags] has been proposed as an authoritative reference for agreed community and activity values for the marking. For the current experimental use, this is effectively operating as a centralized resource and API. Future work may include a more complex system for broader distribution.¶

A given intermediate network capturing the flow data doesn't necessarily need to decode this information, as it's only truly relevant to the end sites using this scheme.¶

2.3. Deviation from IPv6 Specifications

Part of the reason for documenting this use of the IPv6 Flow Label was to note that, at least in the domain of certain HEP research networks, the IPv6 Flow Label is not being used exactly as specified, and to record the reason why.¶

Section 6 of [RFC8200] states that "the 20-bit Flow Label field in the IPv6 header is used by a source to label sequences of packets to be treated in the network as a single flow".¶

Section 3 of [RFC6437] states that "It is therefore RECOMMENDED that source hosts support the flow label by setting the flow label field for all packets of a given flow to the same value chosen from an approximation to a discrete uniform distribution" and that the algorithm (for setting the Flow Label value) "SHOULD ensure that the resulting flow label values are unique with high probability."¶

Section 1 of [RFC6437] further adds that "a specific goal is to enable and encourage the use of the flow label for various forms of stateless load distribution, especially across Equal Cost Multi-Path (ECMP) and/or Link Aggregation Group (LAG) paths."¶

In this packet marking scheme, all traffic belonging to the same community and activity will carry a flow label with 15 fixed, common bits and 5 varying (entropy) bits. Given use of the Flow label as described above should use 20 entropy bits (with a uniform distribution), it is not the case here that the flow label values will be unique with such a high probability, i.e., 1 in 32 network flows will in principle be unique rather than around 1 in a million.¶

The 5 entropy bits are used to still support a level of conformance with the requirement stated in RFC 6437 to support traffic distribution in ECMP and LAG scenarios. The number of bits chosen is a tradeoff between the number of bits available for the community and activity labeling and the number of entropy bits. Section 1.2 of [RFC6437] specifically quotes Section 2 of [RFC3697] and is worth restating here. "Router performance SHOULD NOT be dependent on the distribution of the Flow Label values. Especially, the Flow Label bits alone make poor material for a hash key." Section 3 of [RFC6438] clarifies that intermediate routers using ECMP or LAG "MUST minimally include the 3-tuple {dest addr, source addr, flow label}" and recommends additional sources of entropy to be considered. Additional clarifications are also made for tunneled traffic. In neither case is the flow label exclusively used.¶

2.4. Traffic inspection and collection on path

As packets are marked using the IPv6 flow label, it is possible for intermediate routers to sample traffic in the forwarding hardware and send this data off to central collectors for analysis. In many network environments, the standard approach is for a hardware-specific implementation on a router to sample the traffic and use the IPFIX protocol [RFC7011] to send the sampled data to a collector. Section 5.4.21 of [RFC5102] defines field #31 for carrying the IPv6 Flow Label information, and major router hardware and collector software implementations are known to support this.¶

Some hardware platforms, primarily with a lineage more firmly rooted in switching vs routing, support traffic sampling via sflow [RFC3176]. Unlike IPFIX, the traffic is not summarized by the router/switch, but a significant part of the sampled raw packet is encapsulated and sent to the collector for analysis. sFlow datagrams include one or more packet flow records which in turn include the original datagram header [SFlow]. Individual fields such as the IPv6 flow label are able to be collected.¶

Traffic mirroring and/or optical taps can also be used to copy raw traffic to a server for analysis. The data rates, number of links, and power availability to run servers for large scale collection may make traditional packet capture and analysis impractical in many environments such as international R&E networks, though there has been initial success with the P4 implementation running on the FPGA platform deployed throughout ESnet (the US Energy Sciences network).¶

2.5. Implications for traffic analysis

Corresponding to the expectations in Section 4 of [RFC6436], a brief, unscientific sampling of non-MPLS encapsulated traffic collected via IPFIX on ESnet does show that there is a mix of [RFC3697] compliant hosts where all-zero flow labels are used, as well as updated [RFC6437] compliant hosts that by default choose uniformly distributed labels between 1 and 0xFFFFF. A traffic analysis system may need to know which specific endpoints are using the packet marking meaning of the flow label and that the field's values are relevant. As the deployments are for rather narrow accounting use cases within specific user communities, it has been practical to match for known flow labels vs trying to keep the accounting state for 2^20 possible labels in use for each link of the network.¶

2.6. Additional Considerations

If there are concerns about preserving entropy and reducing the possible collisions with the standard use of the IPv6 Flow Label, we could potentially use the "entropy" bits defined above to instead calculate a Hamming Code. A Hamming Code calculates a set of Parity Bits to be used to extend a set of Message (Data) Bits, that will maximize the number of bits that are different between "valid" messages. This may better support existing use of the flow label for ECMP as described in [RFC6437]. This suggestion is currently open for further discussion.¶

3. Alternative packet marking approaches considered

3.1. IPv6 Hop-by-Hop or Destination Options

Extension headers are known to be problematic in that they have a history of being filtered or dropped in transit, as measured in [RFC7872] with substantial further discussion in [RFC9098]. In our testing, these issues are no less common in Research and Education networks. As an example, [RFC9343] defines an alternate marking encoding for use in either hop-by-hop or destination options headers. [RFC7837] defines a marking in support of congestion control (ConEx), and [RFC8250] is a Standards Track document that defines a destination option for Performance and Diagnostic Metrics (PDM) for IPv6. There is also this draft defining an option [I-D.ietf-ippm-ioam-ipv6-options], for the use case of carrying OAM information.¶

The Destination option header could therefore be a logical choice to place activity-specific telemetry identifiers, as there is less of a constraint on space than the IPv6 Flow Label, less history of defined pre-existing intentions from the standards body, and low deployed usage on the Internet. However, at present, the linux implementation in particular requires either setuid 0 or CAP_NET_RAW capability to be able to call setsockopt(s, IPPROTO_IPV6, IPV6_DSTOPTS, ext_hdr_p, ext_hdr_size), making it unusable by typical userspace activities. There has been a set of patches made that could address this as well as extend the functionality, though they have not been met with support from the linux network maintainers. Additionally, extracting that field by intermediate routers and exporting it via IPFIX may be further subject to lack of support compared to the fixed field and known position of the flow label.¶

While in principle it's possible, it is less practical to use a Hop-by-Hop option, for the reasons discussed in [I-D.krishnan-ipv6-hopbyhop]. However, there is a recent example of its use in [RFC9268] where a host can signal this option, routers will not process it unless configured to do so, and if not, they may well drop the packet according to Section 4.8 of [RFC8200].¶

3.2. IPv6 Addresses as identifiers

Given the size of IPv6 addresses, it is possible to mark or "color" packets by using specific site network prefixes (within a site /64) or values in (a part of) the host identifier part of an address (typically 64 bits). Hosts already currently use multiple IPv6 source addresses. Applications supporting specific activities would need to bind sockets to the correct source address, per flow, corresponding to the accounting details to be conveyed. Dispatching computation jobs into a high-throughput computational cluster along with network-specific metadata has for example been explored in [Lark].¶

Hosts serving different communities and activities would need multiple addresses, one for each possible, configured in advance of an activity's application requiring it. Adding an IP address onto a host requires root level access to a system and is typically not available as a dynamic function available for userspace. There also may be limits on the number of source addresses able to be concurrently configured, so a garbage collection process may need to deprovision addresses no longer in use. This dynamic use of source addresses also may cause operational issues around access-control list management, and security implementations at a site.¶

The use of marked source and destination addresses in communications could facilitate the routing of packets in different routing domains (or VPNs), if needed. Unfortunately, depending on the position of the marking in the address, it may not be possible to use it for policy routing, since very few network hardware implement bitmask packet matching for IPv6, leaving this likely feasible for host-initiated tunnels.¶

3.3. Marking in the Payload

Marking in the payload has been considered to be out of scope given the prevalence of TLS/SSL/etc, which means that payloads cannot be inspected on path.¶

3.4. Network Tokens

A recently published IETF personal draft documents the concept of "Network Tokens", see [I-D.yiakoumis-network-tokens].¶

"A network token is a small piece of data that end users attach to their packets. As packets flow through the network, intermediate nodes MAY detect tokens, interpret them, and apply the desired service to the packets that carry them (and possibly to all other packets from the same flow). For example, a token might just state the name of an activity's application that a packet originates from." The draft proposes a 28-bit token ID field and discusses multiple mechanisms for tokens to be conveyed. [RFC9419] puts this work into a broader context.¶

3.5. Firefly Packets for marking Network Flows

[Firefly] packets are an approach to mark flows by sending separate telemetry packets alongside activity traffic from the source to the same destination node, but always to a specific destination port. These packets are large enough to contain rich metadata about the flow, formatted as a json payload carried in syslog. The firefly packets can be collected en route by participating networks, by the end host, or sent to a central collector.¶

While the packet marking approach described in this document is IPv6-specific, as it uses the IPv6 Flow Label field, fireflies can be used for IPv4 or IPv6 network flows, to mark flows. A firefly would typically be sent at the start and end of each flow.¶

8. References

8.1. Normative References

[RFC2119]: Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]: Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/info/rfc8174>.
[RFC8200]: Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", STD 86, RFC 8200, DOI 10.17487/RFC8200, July 2017, <https://www.rfc-editor.org/info/rfc8200>.
[RFC6437]: Amante, S., Carpenter, B., Jiang, S., and J. Rajahalme, "IPv6 Flow Label Specification", RFC 6437, DOI 10.17487/RFC6437, November 2011, <https://www.rfc-editor.org/info/rfc6437>.

8.2. Informative References

[RFC3176]: Phaal, P., Panchen, S., and N. McKee, "InMon Corporation's sFlow: A Method for Monitoring Traffic in Switched and Routed Networks", RFC 3176, DOI 10.17487/RFC3176, September 2001, <https://www.rfc-editor.org/info/rfc3176>.
[RFC3697]: Rajahalme, J., Conta, A., Carpenter, B., and S. Deering, "IPv6 Flow Label Specification", RFC 3697, DOI 10.17487/RFC3697, March 2004, <https://www.rfc-editor.org/info/rfc3697>.
[RFC5102]: Quittek, J., Bryant, S., Claise, B., Aitken, P., and J. Meyer, "Information Model for IP Flow Information Export", RFC 5102, DOI 10.17487/RFC5102, January 2008, <https://www.rfc-editor.org/info/rfc5102>.
[RFC6436]: Amante, S., Carpenter, B., and S. Jiang, "Rationale for Update to the IPv6 Flow Label Specification", RFC 6436, DOI 10.17487/RFC6436, November 2011, <https://www.rfc-editor.org/info/rfc6436>.
[RFC6438]: Carpenter, B. and S. Amante, "Using the IPv6 Flow Label for Equal Cost Multipath Routing and Link Aggregation in Tunnels", RFC 6438, DOI 10.17487/RFC6438, November 2011, <https://www.rfc-editor.org/info/rfc6438>.
[RFC7011]: Claise, B., Ed., Trammell, B., Ed., and P. Aitken, "Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information", STD 77, RFC 7011, DOI 10.17487/RFC7011, September 2013, <https://www.rfc-editor.org/info/rfc7011>.
[RFC7098]: Carpenter, B., Jiang, S., and W. Tarreau, "Using the IPv6 Flow Label for Load Balancing in Server Farms", RFC 7098, DOI 10.17487/RFC7098, January 2014, <https://www.rfc-editor.org/info/rfc7098>.
[RFC7258]: Farrell, S. and H. Tschofenig, "Pervasive Monitoring Is an Attack", BCP 188, RFC 7258, DOI 10.17487/RFC7258, May 2014, <https://www.rfc-editor.org/info/rfc7258>.
[RFC7837]: Krishnan, S., Kuehlewind, M., Briscoe, B., and C. Ralli, "IPv6 Destination Option for Congestion Exposure (ConEx)", RFC 7837, DOI 10.17487/RFC7837, May 2016, <https://www.rfc-editor.org/info/rfc7837>.
[RFC7872]: Gont, F., Linkova, J., Chown, T., and W. Liu, "Observations on the Dropping of Packets with IPv6 Extension Headers in the Real World", RFC 7872, DOI 10.17487/RFC7872, June 2016, <https://www.rfc-editor.org/info/rfc7872>.
[RFC8250]: Elkins, N., Hamilton, R., and M. Ackermann, "IPv6 Performance and Diagnostic Metrics (PDM) Destination Option", RFC 8250, DOI 10.17487/RFC8250, September 2017, <https://www.rfc-editor.org/info/rfc8250>.
[RFC9098]: Gont, F., Hilliard, N., Doering, G., Kumari, W., Huston, G., and W. Liu, "Operational Implications of IPv6 Packets with Extension Headers", RFC 9098, DOI 10.17487/RFC9098, September 2021, <https://www.rfc-editor.org/info/rfc9098>.
[RFC9268]: Hinden, R. and G. Fairhurst, "IPv6 Minimum Path MTU Hop-by-Hop Option", RFC 9268, DOI 10.17487/RFC9268, August 2022, <https://www.rfc-editor.org/info/rfc9268>.
[RFC9343]: Fioccola, G., Zhou, T., Cociglio, M., Qin, F., and R. Pang, "IPv6 Application of the Alternate-Marking Method", RFC 9343, DOI 10.17487/RFC9343, December 2022, <https://www.rfc-editor.org/info/rfc9343>.
[RFC9419]: Arkko, J., Hardie, T., Pauly, T., and M. Kühlewind, "Considerations on Application - Network Collaboration Using Path Signals", RFC 9419, DOI 10.17487/RFC9419, July 2023, <https://www.rfc-editor.org/info/rfc9419>.
[I-D.ietf-ippm-ioam-ipv6-options]: Bhandari, S. and F. Brockners, "In-situ OAM IPv6 Options", Work in Progress, Internet-Draft, draft-ietf-ippm-ioam-ipv6-options-12, 7 May 2023, <https://datatracker.ietf.org/doc/html/draft-ietf-ippm-ioam-ipv6-options-12>.
[I-D.yiakoumis-network-tokens]: Yiakoumis, Y., McKeown, N., and F. Sorensen, "Network Tokens", Work in Progress, Internet-Draft, draft-yiakoumis-network-tokens-02, 22 December 2020, <https://datatracker.ietf.org/doc/html/draft-yiakoumis-network-tokens-02>.
[I-D.krishnan-ipv6-hopbyhop]: Krishnan, S., "The case against Hop-by-Hop options", Work in Progress, Internet-Draft, draft-krishnan-ipv6-hopbyhop-05, 22 October 2010, <https://datatracker.ietf.org/doc/html/draft-krishnan-ipv6-hopbyhop-05>.
[Lark]: Zhang, Z., Bockelman, B., Carder, D., and T. Tannenbaum, "Lark: An effective approach for software-defined networking in high throughput computing clusters", Future Generation Computer Systems, Volume 72, Pages 105-117, ISSN 0167-739X, DOI 10.1016/j.future.2016.03.010, 2017, <https://doi.org/10.1016/j.future.2016.03.010>.
[Firefly]: "Identifying and Understanding Scientific Network Flows", 26th International Conference on Computing in High Energy & Nuclear Physics (CHEP 2023), <https://indico.jlab.org/event/459/contributions/11321/>.
[SciTags]: "Scientific network tags (scitags) website and accompanying registry", <https://www.scitags.org/>.
[XRootD]: "XRootD software framework", <https://xrootd.slac.stanford.edu/>.
[FlowD]: "FlowD software", <https://github.com/scitags/flowd>.
[Iperf3]: "Iperf3 software", <https://github.com/esnet/iperf>.
[PerfSONAR]: "PerfSONAR performance Service-Oriented Network monitoring ARchitecture", <https://www.perfsonar.net/>.
[SFlow]: "sFlow telemetry streaming of SciTags", <https://blog.sflow.com/2022/11/scientific-network-tags-scitags.html>.
[PLB]: Qureshi, M. A., Cheng, Y., Yin, Q., Fu, Q., Kumar, G., Moshref, M., Yan, J., Jacobson, V., Wetherall, D., and A. Kabbani, "PLB: congestion signals are simple and effective for network load balancing", Proceedings of the ACM SIGCOMM 2022 Conference, pages 207–218, DOI 10.1145/3544216.3544226, 2022, <https://doi.org/10.1145/3544216.3544226>.