--- 1/draft-ietf-mboned-dc-deploy-02.txt 2018-06-29 13:13:14.447590625 -0700 +++ 2/draft-ietf-mboned-dc-deploy-03.txt 2018-06-29 13:13:14.487591599 -0700 @@ -1,464 +1,695 @@ MBONED M. McBride Internet-Draft Huawei -Intended status: Informational February 28, 2018 -Expires: September 1, 2018 +Intended status: Informational O. Komolafe +Expires: December 31, 2018 Arista Networks + June 29, 2018 Multicast in the Data Center Overview - draft-ietf-mboned-dc-deploy-02 + draft-ietf-mboned-dc-deploy-03 Abstract - There has been much interest in issues surrounding massive amounts of - hosts in the data center. These issues include the prevalent use of - IP Multicast within the Data Center. Its important to understand how - IP Multicast is being deployed in the Data Center to be able to - understand the surrounding issues with doing so. This document - provides a quick survey of uses of multicast in the data center and - should serve as an aid to further discussion of issues related to - large amounts of multicast in the data center. + The volume and importance of one-to-many traffic patterns in data + centers is likely to increase significantly in the future. Reasons + for this increase are discussed and then attention is paid to the + manner in which this traffic pattern may be judiously handled in data + centers. The intuitive solution of deploying conventional IP + multicast within data centers is explored and evaluated. Thereafter, + a number of emerging innovative approaches are described before a + number of recommendations are made. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on September 1, 2018. + This Internet-Draft will expire on December 31, 2018. Copyright Notice Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 - 2. Multicast Applications in the Data Center . . . . . . . . . . 3 - 2.1. Client-Server Applications . . . . . . . . . . . . . . . 3 - 2.2. Non Client-Server Multicast Applications . . . . . . . . 4 - 3. L2 Multicast Protocols in the Data Center . . . . . . . . . . 5 - 4. L3 Multicast Protocols in the Data Center . . . . . . . . . . 6 - 5. Challenges of using multicast in the Data Center . . . . . . 7 - 6. Layer 3 / Layer 2 Topological Variations . . . . . . . . . . 8 - 7. Address Resolution . . . . . . . . . . . . . . . . . . . . . 9 - 7.1. Solicited-node Multicast Addresses for IPv6 address - resolution . . . . . . . . . . . . . . . . . . . . . . . 9 - 7.2. Direct Mapping for Multicast address resolution . . . . . 9 - 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 - 9. Security Considerations . . . . . . . . . . . . . . . . . . . 10 - 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 - 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 - 11.1. Normative References . . . . . . . . . . . . . . . . . . 10 - 11.2. Informative References . . . . . . . . . . . . . . . . . 10 - Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10 + 2. Reasons for increasing one-to-many traffic patterns . . . . . 3 + 2.1. Applications . . . . . . . . . . . . . . . . . . . . . . 3 + 2.2. Overlays . . . . . . . . . . . . . . . . . . . . . . . . 5 + 2.3. Protocols . . . . . . . . . . . . . . . . . . . . . . . . 5 + 3. Handling one-to-many traffic using conventional multicast . . 5 + 3.1. Layer 3 multicast . . . . . . . . . . . . . . . . . . . . 6 + 3.2. Layer 2 multicast . . . . . . . . . . . . . . . . . . . . 6 + 3.3. Example use cases . . . . . . . . . . . . . . . . . . . . 8 + 3.4. Advantages and disadvantages . . . . . . . . . . . . . . 9 + 4. Alternative options for handling one-to-many traffic . . . . 9 + 4.1. Minimizing traffic volumes . . . . . . . . . . . . . . . 9 + 4.2. Head end replication . . . . . . . . . . . . . . . . . . 10 + 4.3. BIER . . . . . . . . . . . . . . . . . . . . . . . . . . 11 + 4.4. Segment Routing . . . . . . . . . . . . . . . . . . . . . 12 + 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 12 + 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 + 7. Security Considerations . . . . . . . . . . . . . . . . . . . 13 + 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 13 + 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 + 9.1. Normative References . . . . . . . . . . . . . . . . . . 13 + 9.2. Informative References . . . . . . . . . . . . . . . . . 13 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 1. Introduction - Data center servers often use IP Multicast to send data to clients or - other application servers. IP Multicast is expected to help conserve - bandwidth in the data center and reduce the load on servers. IP - Multicast is also a key component in several data center overlay - solutions. Increased reliance on multicast, in next generation data - centers, requires higher performance and capacity especially from the - switches. If multicast is to continue to be used in the data center, - it must scale well within and between datacenters. There has been - much interest in issues surrounding massive amounts of hosts in the - data center. There was a lengthy discussion, in the now closed ARMD - WG, involving the issues with address resolution for non ARP/ND - multicast traffic in data centers. This document provides a quick - survey of multicast in the data center and should serve as an aid to - further discussion of issues related to multicast in the data center. + The volume and importance of one-to-many traffic patterns in data + centers is likely to increase significantly in the future. Reasons + for this increase include the nature of the traffic generated by + applications hosted in the data center, the need to handle broadcast, + unknown unicast and multicast (BUM) traffic within the overlay + technologies used to support multi-tenancy at scale, and the use of + certain protocols that traditionally require one-to-many control + message exchanges. These trends, allied with the expectation that + future highly virtualized data centers must support communication + between potentially thousands of participants, may lead to the + natural assumption that IP multicast will be widely used in data + centers, specifically given the bandwidth savings it potentially + offers. However, such an assumption would be wrong. In fact, there + is widespread reluctance to enable IP multicast in data centers for a + number of reasons, mostly pertaining to concerns about its + scalability and reliability. - ARP/ND issues are not addressed in this document except to explain - how address resolution occurs with multicast. + This draft discusses some of the main drivers for the increasing + volume and importance of one-to-many traffic patterns in data + centers. Thereafter, the manner in which conventional IP multicast + may be used to handle this traffic pattern is discussed and some of + the associated challenges highlighted. Following this discussion, a + number of alternative emerging approaches are introduced, before + concluding by discussing key trends and making a number of + recommendations. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. -2. Multicast Applications in the Data Center +2. Reasons for increasing one-to-many traffic patterns - There are many data center operators who do not deploy Multicast in - their networks for scalability and stability reasons. There are also - many operators for whom multicast is a critical protocol within their - network and is enabled on their data center switches and routers. - For this latter group, there are several uses of multicast in their - data centers. An understanding of the uses of that multicast is - important in order to properly support these applications in the ever - evolving data centers. If, for instance, the majority of the - applications are discovering/signaling each other, using multicast, - there may be better ways to support them then using multicast. If, - however, the multicasting of data is occurring in large volumes, - there is a need for good data center overlay multicast support. The - applications either fall into the category of those that leverage L2 - multicast for discovery or of those that require L3 support and - likely span multiple subnets. +2.1. Applications -2.1. Client-Server Applications + Key trends suggest that the nature of the applications likely to + dominate future highly-virtualized multi-tenant data centers will + produce large volumes of one-to-many traffic. For example, it is + well-known that traffic flows in data centers have evolved from being + predominantly North-South (e.g. client-server) to predominantly East- + West (e.g. distributed computation). This change has led to the + consensus that topologies such as the Leaf/Spine, that are easier to + scale in the East-West direction, are better suited to the data + center of the future. This increase in East-West traffic flows + results from VMs often having to exchange numerous messages between + themselves as part of executing a specific workload. For example, a + computational workload could require data, or an executable, to be + disseminated to workers distributed throughout the data center which + may be subsequently polled for status updates. The emergence of such + applications means there is likely to be an increase in one-to-many + traffic flows with the increasing dominance of East-West traffic. - IPTV servers use multicast to deliver content from the data center to - end users. IPTV is typically a one to many application where the - hosts are configured for IGMPv3, the switches are configured with - IGMP snooping, and the routers are running PIM-SSM mode. Often - redundant servers are sending multicast streams into the network and - the network is forwarding the data across diverse paths. + The TV broadcast industry is another potential future source of + applications with one-to-many traffic patterns in data centers. The + requirement for robustness, stability and predicability has meant the + TV broadcast industry has traditionally used TV-specific protocols, + infrastructure and technologies for transmitting video signals + between cameras, studios, mixers, encoders, servers etc. However, + the growing cost and complexity of supporting this approach, + especially as the bit rates of the video signals increase due to + demand for formats such as 4K-UHD and 8K-UHD, means there is a + consensus that the TV broadcast industry will transition from + industry-specific transmission formats (e.g. SDI, HD-SDI) over TV- + specific infrastructure to using IP-based infrastructure. The + development of pertinent standards by the SMPTE, along with the + increasing performance of IP routers, means this transition is + gathering pace. A possible outcome of this transition will be the + building of IP data centers in broadcast plants. Traffic flows in + the broadcast industry are frequently one-to-many and so if IP data + centers are deployed in broadcast plants, it is imperative that this + traffic pattern is supported efficiently in that infrastructure. In + fact, a pivotal consideration for broadcasters considering + transitioning to IP is the manner in which these one-to-many traffic + flows will be managed and monitored in a data center with an IP + fabric. - Windows Media servers send multicast streaming to clients. Windows - Media Services streams to an IP multicast address and all clients - subscribe to the IP address to receive the same stream. This allows - a single stream to be played simultaneously by multiple clients and - thus reducing bandwidth utilization. + Arguably one of the (few?) success stories in using conventional IP + multicast has been for disseminating market trading data. For + example, IP multicast is commonly used today to deliver stock quotes + from the stock exchange to financial services provider and then to + the stock analysts or brokerages. The network must be designed with + no single point of failure and in such a way that the network can + respond in a deterministic manner to any failure. Typically, + redundant servers (in a primary/backup or live-live mode) send + multicast streams into the network, with diverse paths being used + across the network. Another critical requirement is reliability and + traceability; regulatory and legal requirements means that the + producer of the marketing data must know exactly where the flow was + sent and be able to prove conclusively that the data was received + within agreed SLAs. The stock exchange generating the one-to-many + traffic and stock analysts/brokerage that receive the traffic will + typically have their own data centers. Therefore, the manner in + which one-to-many traffic patterns are handled in these data centers + are extremely important, especially given the requirements and + constraints mentioned. - Market data relies extensively on IP multicast to deliver stock - quotes from the data center to a financial services provider and then - to the stock analysts. The most critical requirement of a multicast - trading floor is that it be highly available. The network must be - designed with no single point of failure and in a way the network can - respond in a deterministic manner to any failure. Typically - redundant servers (in a primary/backup or live live mode) are sending - multicast streams into the network and the network is forwarding the - data across diverse paths (when duplicate data is sent by multiple - servers). + Many data center cloud providers provide publish and subscribe + applications. There can be numerous publishers and subscribers and + many message channels within a data center. With publish and + subscribe servers, a separate message is sent to each subscriber of a + publication. With multicast publish/subscribe, only one message is + sent, regardless of the number of subscribers. In a publish/ + subscribe system, client applications, some of which are publishers + and some of which are subscribers, are connected to a network of + message brokers that receive publications on a number of topics, and + send the publications on to the subscribers for those topics. The + more subscribers there are in the publish/subscribe system, the + greater the improvement to network utilization there might be with + multicast. - With publish and subscribe servers, a separate message is sent to - each subscriber of a publication. With multicast publish/subscribe, - only one message is sent, regardless of the number of subscribers. - In a publish/subscribe system, client applications, some of which are - publishers and some of which are subscribers, are connected to a - network of message brokers that receive publications on a number of - topics, and send the publications on to the subscribers for those - topics. The more subscribers there are in the publish/subscribe - system, the greater the improvement to network utilization there - might be with multicast. +2.2. Overlays -2.2. Non Client-Server Multicast Applications + The proposed architecture for supporting large-scale multi-tenancy in + highly virtualized data centers [RFC8014] consists of a tenant's VMs + distributed across the data center connected by a virtual network + known as the overlay network. A number of different technologies + have been proposed for realizing the overlay network, including VXLAN + [RFC7348], VXLAN-GPE [I-D.ietf-nvo3-vxlan-gpe], NVGRE [RFC7637] and + GENEVE [I-D.ietf-nvo3-geneve]. The often fervent and arguably + partisan debate about the relative merits of these overlay + technologies belies the fact that, conceptually, it may be said that + these overlays typically simply provide a means to encapsulate and + tunnel Ethernet frames from the VMs over the data center IP fabric, + thus emulating a layer 2 segment between the VMs. Consequently, the + VMs believe and behave as if they are connected to the tenant's other + VMs by a conventional layer 2 segment, regardless of their physical + location within the data center. Naturally, in a layer 2 segment, + point to multi-point traffic can result from handling BUM (broadcast, + unknown unicast and multicast) traffic. And, compounding this issue + within data centers, since the tenant's VMs attached to the emulated + segment may be dispersed throughout the data center, the BUM traffic + may need to traverse the data center fabric. Hence, regardless of + the overlay technology used, due consideration must be given to + handling BUM traffic, forcing the data center operator to consider + the manner in which one-to-many communication is handled within the + IP fabric. - Routers, running Virtual Routing Redundancy Protocol (VRRP), - communicate with one another using a multicast address. VRRP packets - are sent, encapsulated in IP packets, to 224.0.0.18. A failure to - receive a multicast packet from the master router for a period longer - than three times the advertisement timer causes the backup routers to - assume that the master router is dead. The virtual router then - transitions into an unsteady state and an election process is - initiated to select the next master router from the backup routers. - This is fulfilled through the use of multicast packets. Backup - router(s) are only to send multicast packets during an election - process. +2.3. Protocols - Overlays may use IP multicast to virtualize L2 multicasts. IP - multicast is used to reduce the scope of the L2-over-UDP flooding to - only those hosts that have expressed explicit interest in the - frames.VXLAN, for instance, is an encapsulation scheme to carry L2 - frames over L3 networks. The VXLAN Tunnel End Point (VTEP) - encapsulates frames inside an L3 tunnel. VXLANs are identified by a - 24 bit VXLAN Network Identifier (VNI). The VTEP maintains a table of - known destination MAC addresses, and stores the IP address of the - tunnel to the remote VTEP to use for each. Unicast frames, between - VMs, are sent directly to the unicast L3 address of the remote VTEP. - Multicast frames are sent to a multicast IP group associated with the - VNI. Underlying IP Multicast protocols (PIM-SM/SSM/BIDIR) are used - to forward multicast data across the overlay. + Conventionally, some key networking protocols used in data centers + require one-to-many communication. For example, ARP and ND use + broadcast and multicast messages within IPv4 and IPv6 networks + respectively to discover MAC address to IP address mappings. + Furthermore, when these protocols are running within an overlay + network, then it essential to ensure the messages are delivered to + all the hosts on the emulated layer 2 segment, regardless of physical + location within the data center. The challenges associated with + optimally delivering ARP and ND messages in data centers has + attracted lots of attention [RFC6820]. Popular approaches in use + mostly seek to exploit characteristics of data center networks to + avoid having to broadcast/multicast these messages, as discussed in + Section 4.1. - The Ganglia application relies upon multicast for distributed - discovery and monitoring of computing systems such as clusters and - grids. It has been used to link clusters across university campuses - and can scale to handle clusters with 2000 nodes - Windows Server, cluster node exchange, relies upon the use of - multicast heartbeats between servers. Only the other interfaces in - the same multicast group use the data. Unlike broadcast, multicast - traffic does not need to be flooded throughout the network, reducing - the chance that unnecessary CPU cycles are expended filtering traffic - on nodes outside the cluster. As the number of nodes increases, the - ability to replace several unicast messages with a single multicast - message improves node performance and decreases network bandwidth - consumption. Multicast messages replace unicast messages in two - components of clustering: +3. Handling one-to-many traffic using conventional multicast +3.1. Layer 3 multicast - o Heartbeats: The clustering failure detection engine is based on a - scheme whereby nodes send heartbeat messages to other nodes. - Specifically, for each network interface, a node sends a heartbeat - message to all other nodes with interfaces on that network. - Heartbeat messages are sent every 1.2 seconds. In the common case - where each node has an interface on each cluster network, there - are N * (N - 1) unicast heartbeats sent per network every 1.2 - seconds in an N-node cluster. With multicast heartbeats, the - message count drops to N multicast heartbeats per network every - 1.2 seconds, because each node sends 1 message instead of N - 1. - This represents a reduction in processing cycles on the sending - node and a reduction in network bandwidth consumed. + PIM is the most widely deployed multicast routing protocol and so, + unsurprisingly, is the primary multicast routing protocol considered + for use in the data center. There are three potential popular + flavours of PIM that may be used: PIM-SM [RFC4601], PIM-SSM [RFC4607] + or PIM-BIDIR [RFC5015]. It may be said that these different modes of + PIM tradeoff the optimality of the multicast forwarding tree for the + amount of multicast forwarding state that must be maintained at + routers. SSM provides the most efficient forwarding between sources + and receivers and thus is most suitable for applications with one-to- + many traffic patterns. State is built and maintained for each (S,G) + flow. Thus, the amount of multicast forwarding state held by routers + in the data center is proportional to the number of sources and + groups. At the other end of the spectrum, BIDIR is the most + efficient shared tree solution as one tree is built for all (S,G)s, + therefore minimizing the amount of state. This state reduction is at + the expense of optimal forwarding path between sources and receivers. + This use of a shared tree makes BIDIR particularly well-suited for + applications with many-to-many traffic patterns, given that the + amount of state is uncorrelated to the number of sources. SSM and + BIDIR are optimizations of PIM-SM. PIM-SM is still the most widely + deployed multicast routing protocol. PIM-SM can also be the most + complex. PIM-SM relies upon a RP (Rendezvous Point) to set up the + multicast tree and subsequently there is the option of switching to + the SPT (shortest path tree), similar to SSM, or staying on the + shared tree, similar to BIDIR. - o Regroup: The clustering membership engine executes a regroup - protocol during a membership view change. The regroup protocol - algorithm assumes the ability to broadcast messages to all cluster - nodes. To avoid unnecessary network flooding and to properly - authenticate messages, the broadcast primitive is implemented by a - sequence of unicast messages. Converting the unicast messages to - a single multicast message conserves processing power on the - sending node and reduces network bandwidth consumption. +3.2. Layer 2 multicast - Multicast addresses in the 224.0.0.x range are considered link local - multicast addresses. They are used for protocol discovery and are - flooded to every port. For example, OSPF uses 224.0.0.5 and - 224.0.0.6 for neighbor and DR discovery. These addresses are - reserved and will not be constrained by IGMP snooping. These - addresses are not to be used by any application. + With IPv4 unicast address resolution, the translation of an IP + address to a MAC address is done dynamically by ARP. With multicast + address resolution, the mapping from a multicast IPv4 address to a + multicast MAC address is done by assigning the low-order 23 bits of + the multicast IPv4 address to fill the low-order 23 bits of the + multicast MAC address. Each IPv4 multicast address has 28 unique + bits (the multicast address range is 224.0.0.0/12) therefore mapping + a multicast IP address to a MAC address ignores 5 bits of the IP + address. Hence, groups of 32 multicast IP addresses are mapped to + the same MAC address meaning a a multicast MAC address cannot be + uniquely mapped to a multicast IPv4 address. Therefore, planning is + required within an organization to choose IPv4 multicast addresses + judiciously in order to avoid address aliasing. When sending IPv6 + multicast packets on an Ethernet link, the corresponding destination + MAC address is a direct mapping of the last 32 bits of the 128 bit + IPv6 multicast address into the 48 bit MAC address. It is possible + for more than one IPv6 multicast address to map to the same 48 bit + MAC address. -3. L2 Multicast Protocols in the Data Center + The default behaviour of many hosts (and, in fact, routers) is to + block multicast traffic. Consequently, when a host wishes to join an + IPv4 multicast group, it sends an IGMP [RFC2236], [RFC3376] report to + the router attached to the layer 2 segment and also it instructs its + data link layer to receive Ethernet frames that match the + corresponding MAC address. The data link layer filters the frames, + passing those with matching destination addresses to the IP module. + Similarly, hosts simply hand the multicast packet for transmission to + the data link layer which would add the layer 2 encapsulation, using + the MAC address derived in the manner previously discussed. - The switches, in between the servers and the routers, rely upon igmp - snooping to bound the multicast to the ports leading to interested - hosts and to L3 routers. A switch will, by default, flood multicast - traffic to all the ports in a broadcast domain (VLAN). IGMP snooping - is designed to prevent hosts on a local network from receiving - traffic for a multicast group they have not explicitly joined. It - provides switches with a mechanism to prune multicast traffic from - links that do not contain a multicast listener (an IGMP client). - IGMP snooping is a L2 optimization for L3 IGMP. + When this Ethernet frame with a multicast MAC address is received by + a switch configured to forward multicast traffic, the default + behaviour is to flood it to all the ports in the layer 2 segment. + Clearly there may not be a receiver for this multicast group present + on each port and IGMP snooping is used to avoid sending the frame out + of ports without receivers. IGMP snooping, with proxy reporting or report suppression, actively - filters IGMP packets in order to reduce load on the multicast router. - Joins and leaves heading upstream to the router are filtered so that - only the minimal quantity of information is sent. The switch is - trying to ensure the router only has a single entry for the group, - regardless of how many active listeners there are. If there are two - active listeners in a group and the first one leaves, then the switch - determines that the router does not need this information since it - does not affect the status of the group from the router's point of - view. However the next time there is a routine query from the router - the switch will forward the reply from the remaining host, to prevent - the router from believing there are no active listeners. It follows - that in active IGMP snooping, the router will generally only know - about the most recently joined member of the group. + filters IGMP packets in order to reduce load on the multicast router + by ensuring only the minimal quantity of information is sent. The + switch is trying to ensure the router has only a single entry for the + group, regardless of the number of active listeners. If there are + two active listeners in a group and the first one leaves, then the + switch determines that the router does not need this information + since it does not affect the status of the group from the router's + point of view. However the next time there is a routine query from + the router the switch will forward the reply from the remaining host, + to prevent the router from believing there are no active listeners. + It follows that in active IGMP snooping, the router will generally + only know about the most recently joined member of the group. - In order for IGMP, and thus IGMP snooping, to function, a multicast + In order for IGMP and thus IGMP snooping to function, a multicast router must exist on the network and generate IGMP queries. The tables (holding the member ports for each multicast group) created for snooping are associated with the querier. Without a querier the - tables are not created and snooping will not work. Furthermore IGMP + tables are not created and snooping will not work. Furthermore, IGMP general queries must be unconditionally forwarded by all switches involved in IGMP snooping. Some IGMP snooping implementations include full querier capability. Others are able to proxy and retransmit queries from the multicast router. - In source-only networks, however, which presumably describes most - data center networks, there are no IGMP hosts on switch ports to - generate IGMP packets. Switch ports are connected to multicast - source ports and multicast router ports. The switch typically learns - about multicast groups from the multicast data stream by using a type - of source only learning (when only receiving multicast data on the - port, no IGMP packets). The switch forwards traffic only to the - multicast router ports. When the switch receives traffic for new IP - multicast groups, it will typically flood the packets to all ports in - the same VLAN. This unnecessary flooding can impact switch - performance. + Multicast Listener Discovery (MLD) [RFC 2710] [RFC 3810] is used by + IPv6 routers for discovering multicast listeners on a directly + attached link, performing a similar function to IGMP in IPv4 + networks. MLDv1 [RFC 2710] is similar to IGMPv2 and MLDv2 [RFC 3810] + [RFC 4604] similar to IGMPv3. However, in contrast to IGMP, MLD does + not send its own distinct protocol messages. Rather, MLD is a + subprotocol of ICMPv6 [RFC 4443] and so MLD messages are a subset of + ICMPv6 messages. MLD snooping works similarly to IGMP snooping, + described earlier. -4. L3 Multicast Protocols in the Data Center +3.3. Example use cases - There are three flavors of PIM used for Multicast Routing in the Data - Center: PIM-SM [RFC4601], PIM-SSM [RFC4607], and PIM-BIDIR [RFC5015]. - SSM provides the most efficient forwarding between sources and - receivers and is most suitable for one to many types of multicast - applications. State is built for each S,G channel therefore the more - sources and groups there are, the more state there is in the network. - BIDIR is the most efficient shared tree solution as one tree is built - for all S,G's, therefore saving state. But it is not the most - efficient in forwarding path between sources and receivers. SSM and - BIDIR are optimizations of PIM-SM. PIM-SM is still the most widely - deployed multicast routing protocol. PIM-SM can also be the most - complex. PIM-SM relies upon a RP (Rendezvous Point) to set up the - multicast tree and then will either switch to the SPT (shortest path - tree), similar to SSM, or stay on the shared tree (similar to BIDIR). - For massive amounts of hosts sending (and receiving) multicast, the - shared tree (particularly with PIM-BIDIR) provides the best potential - scaling since no matter how many multicast sources exist within a - VLAN, the tree number stays the same. IGMP snooping, IGMP proxy, and - PIM-BIDIR have the potential to scale to the huge scaling numbers - required in a data center. + A use case where PIM and IGMP are currently used in data centers is + to support multicast in VXLAN deployments. In the original VXLAN + specification [RFC7348], a data-driven flood and learn control plane + was proposed, requiring the data center IP fabric to support + multicast routing. A multicast group is associated with each virtual + network, each uniquely identified by its VXLAN network identifiers + (VNI). VXLAN tunnel endpoints (VTEPs), typically located in the + hypervisor or ToR switch, with local VMs that belong to this VNI + would join the multicast group and use it for the exchange of BUM + traffic with the other VTEPs. Essentially, the VTEP would + encapsulate any BUM traffic from attached VMs in an IP multicast + packet, whose destination address is the associated multicast group + address, and transmit the packet to the data center fabric. Thus, + PIM must be running in the fabric to maintain a multicast + distribution tree per VNI. -5. Challenges of using multicast in the Data Center + Alternatively, rather than setting up a multicast distribution tree + per VNI, a tree can be set up whenever hosts within the VNI wish to + exchange multicast traffic. For example, whenever a VTEP receives an + IGMP report from a locally connected host, it would translate this + into a PIM join message which will be propagated into the IP fabric. + In order to ensure this join message is sent to the IP fabric rather + than over the VXLAN interface (since the VTEP will have a route back + to the source of the multicast packet over the VXLAN interface and so + would naturally attempt to send the join over this interface) a more + specific route back to the source over the IP fabric must be + configured. In this approach PIM must be configured on the SVIs + associated with the VXLAN interface. - Data Center environments may create unique challenges for IP - Multicast. Data Center networks required a high amount of VM traffic - and mobility within and between DC networks. DC networks have large - numbers of servers. DC networks are often used with cloud - orchestration software. DC networks often use IP Multicast in their - unique environments. This section looks at the challenges of using - multicast within the challenging data center environment. + Another use case of PIM and IGMP in data centers is when IPTV servers + use multicast to deliver content from the data center to end users. + IPTV is typically a one to many application where the hosts are + configured for IGMPv3, the switches are configured with IGMP + snooping, and the routers are running PIM-SSM mode. Often redundant + servers send multicast streams into the network and the network is + forwards the data across diverse paths. - When IGMP/MLD Snooping is not implemented, ethernet switches will - flood multicast frames out of all switch-ports, which turns the - traffic into something more like a broadcast. + Windows Media servers send multicast streams to clients. Windows + Media Services streams to an IP multicast address and all clients + subscribe to the IP address to receive the same stream. This allows + a single stream to be played simultaneously by multiple clients and + thus reducing bandwidth utilization. - VRRP uses multicast heartbeat to communicate between routers. The - communication between the host and the default gateway is unicast. - The multicast heartbeat can be very chatty when there are thousands - of VRRP pairs with sub-second heartbeat calls back and forth. +3.4. Advantages and disadvantages - Link-local multicast should scale well within one IP subnet - particularly with a large layer3 domain extending down to the access - or aggregation switches. But if multicast traverses beyond one IP - subnet, which is necessary for an overlay like VXLAN, you could - potentially have scaling concerns. If using a VXLAN overlay, it is - necessary to map the L2 multicast in the overlay to L3 multicast in - the underlay or do head end replication in the overlay and receive - duplicate frames on the first link from the router to the core - switch. The solution could be to run potentially thousands of PIM - messages to generate/maintain the required multicast state in the IP - underlay. The behavior of the upper layer, with respect to - broadcast/multicast, affects the choice of head end (*,G) or (S,G) - replication in the underlay, which affects the opex and capex of the - entire solution. A VXLAN, with thousands of logical groups, maps to - head end replication in the hypervisor or to IGMP from the hypervisor - and then PIM between the TOR and CORE 'switches' and the gateway - router. + Arguably the biggest advantage of using PIM and IGMP to support one- + to-many communication in data centers is that these protocols are + relatively mature. Consequently, PIM is available in most routers + and IGMP is supported by most hosts and routers. As such, no + specialized hardware or relatively immature software is involved in + using them in data centers. Furthermore, the maturity of these + protocols means their behaviour and performance in operational + networks is well-understood, with widely available best-practices and + deployment guides for optimizing their performance. - Requiring IP multicast (especially PIM BIDIR) from the network can - prove challenging for data center operators especially at the kind of - scale that the VXLAN/NVGRE proposals require. This is also true when - the L2 topological domain is large and extended all the way to the L3 - core. In data centers with highly virtualized servers, even small L2 - domains may spread across many server racks (i.e. multiple switches - and router ports). + However, somewhat ironically, the relative disadvantages of PIM and + IGMP usage in data centers also stem mostly from their maturity. + Specifically, these protocols were standardized and implemented long + before the highly-virtualized multi-tenant data centers of today + existed. Consequently, PIM and IGMP are neither optimally placed to + deal with the requirements of one-to-many communication in modern + data centers nor to exploit characteristics and idiosyncrasies of + data centers. For example, there may be thousands of VMs + participating in a multicast session, with some of these VMs + migrating to servers within the data center, new VMs being + continually spun up and wishing to join the sessions while all the + time other VMs are leaving. In such a scenario, the churn in the PIM + and IGMP state machines, the volume of control messages they would + generate and the amount of state they would necessitate within + routers, especially if they were deployed naively, would be + untenable. - It's not uncommon for there to be 10-20 VMs per server in a - virtualized environment. One vendor reported a customer requesting a - scale to 400VM's per server. For multicast to be a viable solution - in this environment, the network needs to be able to scale to these - numbers when these VMs are sending/receiving multicast. +4. Alternative options for handling one-to-many traffic - A lot of switching/routing hardware has problems with IP Multicast, - particularly with regards to hardware support of PIM-BIDIR. + Section 2 has shown that there is likely to be an increasing amount + one-to-many communications in data centers. And Section 3 has + discussed how conventional multicast may be used to handle this + traffic. Having said that, there are a number of alternative options + of handling this traffic pattern in data centers, as discussed in the + subsequent section. It should be noted that many of these techniques + are not mutually-exclusive; in fact many deployments involve a + combination of more than one of these techniques. Furthermore, as + will be shown, introducing a centralized controller or a distributed + control plane, makes these techniques more potent. - Sending L2 multicast over a campus or data center backbone, in any - sort of significant way, is a new challenge enabled for the first - time by overlays. There are interesting challenges when pushing - large amounts of multicast traffic through a network, and have thus - far been dealt with using purpose-built networks. While the overlay - proposals have been careful not to impose new protocol requirements, - they have not addressed the issues of performance and scalability, - nor the large-scale availability of these protocols. +4.1. Minimizing traffic volumes - There is an unnecessary multicast stream flooding problem in the link - layer switches between the multicast source and the PIM First Hop - Router (FHR). The IGMP-Snooping Switch will forward multicast - streams to router ports, and the PIM FHR must receive all multicast - streams even if there is no request from receiver. This often leads - to waste of switch cache and link bandwidth when the multicast - streams are not actually required. [I-D.pim-umf-problem-statement] - details the problem and defines design goals for a generic mechanism - to restrain the unnecessary multicast stream flooding. + If handling one-to-many traffic in data centers can be challenging + then arguably the most intuitive solution is to aim to minimize the + volume of such traffic. -6. Layer 3 / Layer 2 Topological Variations + It was previously mentioned in Section 2 that the three main causes + of one-to-many traffic in data centers are applications, overlays and + protocols. While, relatively speaking, little can be done about the + volume of one-to-many traffic generated by applications, there is + more scope for attempting to reduce the volume of such traffic + generated by overlays and protocols. (And often by protocols within + overlays.) This reduction is possible by exploiting certain + characteristics of data center networks: fixed and regular topology, + owned and exclusively controlled by single organization, well-known + overlay encapsulation endpoints etc. - As discussed in RFC6820, the ARMD problems statement, there are a - variety of topological data center variations including L3 to Access - Switches, L3 to Aggregation Switches, and L3 in the Core only. - Further analysis is needed in order to understand how these - variations affect IP Multicast scalability + A way of minimizing the amount of one-to-many traffic that traverses + the data center fabric is to use a centralized controller. For + example, whenever a new VM is instantiated, the hypervisor or + encapsulation endpoint can notify a centralized controller of this + new MAC address, the associated virtual network, IP address etc. The + controller could subsequently distribute this information to every + encapsulation endpoint. Consequently, when any endpoint receives an + ARP request from a locally attached VM, it could simply consult its + local copy of the information distributed by the controller and + reply. Thus, the ARP request is suppressed and does not result in + one-to-many traffic traversing the data center IP fabric. -7. Address Resolution + Alternatively, the functionality supported by the controller can + realized by a distributed control plane. BGP-EVPN [RFC7432, RFC8365] + is the most popular control plane used in data centers. Typically, + the encapsulation endpoints will exchange pertinent information with + each other by all peering with a BGP route reflector (RR). Thus, + information about local MAC addresses, MAC to IP address mapping, + virtual networks identifiers etc can be disseminated. Consequently, + ARP requests from local VMs can be suppressed by the encapsulation + endpoint. -7.1. Solicited-node Multicast Addresses for IPv6 address resolution +4.2. Head end replication - Solicited-node Multicast Addresses are used with IPv6 Neighbor - Discovery to provide the same function as the Address Resolution - Protocol (ARP) in IPv4. ARP uses broadcasts, to send an ARP - Requests, which are received by all end hosts on the local link. - Only the host being queried responds. However, the other hosts still - have to process and discard the request. With IPv6, a host is - required to join a Solicited-Node multicast group for each of its - configured unicast or anycast addresses. Because a Solicited-node - Multicast Address is a function of the last 24-bits of an IPv6 - unicast or anycast address, the number of hosts that are subscribed - to each Solicited-node Multicast Address would typically be one - (there could be more because the mapping function is not a 1:1 - mapping). Compared to ARP in IPv4, a host should not need to be - interrupted as often to service Neighbor Solicitation requests. + A popular option for handling one-to-many traffic patterns in data + centers is head end replication (HER). HER means the traffic is + duplicated and sent to each end point individually using conventional + IP unicast. Obvious disadvantages of HER include traffic duplication + and the additional processing burden on the head end. Nevertheless, + HER is especially attractive when overlays are in use as the + replication can be carried out by the hypervisor or encapsulation end + point. Consequently, the VMs and IP fabric are unmodified and + unaware of how the traffic is delivered to the multiple end points. + Additionally, it is possible to use a number of approaches for + constructing and disseminating the list of which endpoints should + receive what traffic and so on. -7.2. Direct Mapping for Multicast address resolution + For example, the reluctance of data center operators to enable PIM + and IGMP within the data center fabric means VXLAN is often used with + HER. Thus, BUM traffic from each VNI is replicated and sent using + unicast to remote VTEPs with VMs in that VNI. The list of remote + VTEPs to which the traffic should be sent may be configured manually + on the VTEP. Alternatively, the VTEPs may transmit appropriate state + to a centralized controller which in turn sends each VTEP the list of + remote VTEPs for each VNI. Lastly, HER also works well when a + distributed control plane is used instead of the centralized + controller. Again, BGP-EVPN may be used to distribute the + information needed to faciliate HER to the VTEPs. - With IPv4 unicast address resolution, the translation of an IP - address to a MAC address is done dynamically by ARP. With multicast - address resolution, the mapping from a multicast IP address to a - multicast MAC address is derived from direct mapping. In IPv4, the - mapping is done by assigning the low-order 23 bits of the multicast - IP address to fill the low-order 23 bits of the multicast MAC - address. When a host joins an IP multicast group, it instructs the - data link layer to receive frames that match the MAC address that - corresponds to the IP address of the multicast group. The data link - layer filters the frames and passes frames with matching destination - addresses to the IP module. Since the mapping from multicast IP - address to a MAC address ignores 5 bits of the IP address, groups of - 32 multicast IP addresses are mapped to the same MAC address. As a - result a multicast MAC address cannot be uniquely mapped to a - multicast IPv4 address. Planning is required within an organization - to select IPv4 groups that are far enough away from each other as to - not end up with the same L2 address used. Any multicast address in - the [224-239].0.0.x and [224-239].128.0.x ranges should not be - considered. When sending IPv6 multicast packets on an Ethernet link, - the corresponding destination MAC address is a direct mapping of the - last 32 bits of the 128 bit IPv6 multicast address into the 48 bit - MAC address. It is possible for more than one IPv6 Multicast address - to map to the same 48 bit MAC address. +4.3. BIER -8. IANA Considerations + As discussed in Section 3.4, PIM and IGMP face potential scalability + challenges when deployed in data centers. These challenges are + typically due to the requirement to build and maintain a distribution + tree and the requirement to hold per-flow state in routers. Bit + Index Explicit Replication (BIER) [RFC 8279] is a new multicast + forwarding paradigm that avoids these two requirements. + + When a multicast packet enters a BIER domain, the ingress router, + known as the Bit-Forwarding Ingress Router (BFIR), adds a BIER header + to the packet. This header contains a bit string in which each bit + maps to an egress router, known as Bit-Forwarding Egress Router + (BFER). If a bit is set, then the packet should be forwarded to the + associated BFER. The routers within the BIER domain, Bit-Forwarding + Routers (BFRs), use the BIER header in the packet and information in + the Bit Index Forwarding Table (BIFT) to carry out simple bit- wise + operations to determine how the packet should be replicated optimally + so it reaches all the appropriate BFERs. + + BIER is deemed to be attractive for facilitating one-to-many + communications in data ceneters [I-D.ietf-bier-use-cases]. The + deployment envisioned with overlay networks is that the the + encapsulation endpoints would be the BFIR. So knowledge about the + actual multicast groups does not reside in the data center fabric, + improving the scalability compared to conventional IP multicast. + Additionally, a centralized controller or a BGP-EVPN control plane + may be used with BIER to ensure the BFIR have the required + information. A challenge associated with using BIER is that, unlike + most of the other approaches discussed in this draft, it requires + changes to the forwarding behaviour of the routers used in the data + center IP fabric. + +4.4. Segment Routing + + Segment Routing (SR) [I-D.ietf-spring-segment-routing] adopts the the + source routing paradigm in which the manner in which a packet + traverses a network is determined by an ordered list of instructions. + These instructions are known as segments may have a local semantic to + an SR node or global within an SR domain. SR allows enforcing a flow + through any topological path while maintaining per-flow state only at + the ingress node to the SR domain. Segment Routing can be applied to + the MPLS and IPv6 data-planes. In the former, the list of segments + is represented by the label stack and in the latter it is represented + as a routing extension header. Use-cases are described in [I-D.ietf- + spring-segment-routing] and are being considered in the context of + BGP-based large-scale data-center (DC) design [RFC7938]. + + Multicast in SR continues to be discussed in a variety of drafts and + working groups. The SPRING WG has not yet been chartered to work on + Multicast in SR. Multicast can include locally allocating a Segment + Identifier (SID) to existing replication solutions, such as PIM, + mLDP, P2MP RSVP-TE and BIER. It may also be that a new way to signal + and install trees in SR is developed without creating state in the + network. + +5. Conclusions + + As the volume and importance of one-to-many traffic in data centers + increases, conventional IP multicast is likely to become increasingly + unattractive for deployment in data centers for a number of reasons, + mostly pertaining its inherent relatively poor scalability and + inability to exploit characteristics of data center network + architectures. Hence, even though IGMP/MLD is likely to remain the + most popular manner in which end hosts signal interest in joining a + multicast group, it is unlikely that this multicast traffic will be + transported over the data center IP fabric using a multicast + distribution tree built by PIM. Rather, approaches which exploit + characteristics of data center network architectures (e.g. fixed and + regular topology, owned and exclusively controlled by single + organization, well-known overlay encapsulation endpoints etc.) are + better placed to deliver one-to-many traffic in data centers, + especially when judiciously combined with a centralized controller + and/or a distributed control plane (particularly one based on BGP- + EVPN). + +6. IANA Considerations This memo includes no request to IANA. -9. Security Considerations +7. Security Considerations No new security considerations result from this document -10. Acknowledgements - - The authors would like to thank the many individuals who contributed - opinions on the ARMD wg mailing list about this topic: Linda Dunbar, - Anoop Ghanwani, Peter Ashwoodsmith, David Allan, Aldrin Isaac, Igor - Gashinsky, Michael Smith, Patrick Frejborg, Joel Jaeggli and Thomas - Narten. +8. Acknowledgements -11. References +9. References -11.1. Normative References +9.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . -11.2. Informative References +9.2. Informative References + + [I-D.ietf-bier-use-cases] + Kumar, N., Asati, R., Chen, M., Xu, X., Dolganow, A., + Przygienda, T., Gulko, A., Robinson, D., Arya, V., and C. + Bestler, "BIER Use Cases", draft-ietf-bier-use-cases-06 + (work in progress), January 2018. + + [I-D.ietf-nvo3-geneve] + Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic + Network Virtualization Encapsulation", draft-ietf- + nvo3-geneve-06 (work in progress), March 2018. + + [I-D.ietf-nvo3-vxlan-gpe] + Maino, F., Kreeger, L., and U. Elzur, "Generic Protocol + Extension for VXLAN", draft-ietf-nvo3-vxlan-gpe-06 (work + in progress), April 2018. + + [I-D.ietf-spring-segment-routing] + Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., + Litkowski, S., and R. Shakir, "Segment Routing + Architecture", draft-ietf-spring-segment-routing-15 (work + in progress), January 2018. + + [RFC2236] Fenner, W., "Internet Group Management Protocol, Version + 2", RFC 2236, DOI 10.17487/RFC2236, November 1997, + . + + [RFC2710] Deering, S., Fenner, W., and B. Haberman, "Multicast + Listener Discovery (MLD) for IPv6", RFC 2710, + DOI 10.17487/RFC2710, October 1999, + . + + [RFC3376] Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A. + Thyagarajan, "Internet Group Management Protocol, Version + 3", RFC 3376, DOI 10.17487/RFC3376, October 2002, + . + + [RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas, + "Protocol Independent Multicast - Sparse Mode (PIM-SM): + Protocol Specification (Revised)", RFC 4601, + DOI 10.17487/RFC4601, August 2006, + . + + [RFC4607] Holbrook, H. and B. Cain, "Source-Specific Multicast for + IP", RFC 4607, DOI 10.17487/RFC4607, August 2006, + . + + [RFC5015] Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano, + "Bidirectional Protocol Independent Multicast (BIDIR- + PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007, + . [RFC6820] Narten, T., Karir, M., and I. Foo, "Address Resolution Problems in Large Data Center Networks", RFC 6820, DOI 10.17487/RFC6820, January 2013, . -Author's Address + [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, + L., Sridhar, T., Bursell, M., and C. Wright, "Virtual + eXtensible Local Area Network (VXLAN): A Framework for + Overlaying Virtualized Layer 2 Networks over Layer 3 + Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, + . + + [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., + Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based + Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February + 2015, . + + [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network + Virtualization Using Generic Routing Encapsulation", + RFC 7637, DOI 10.17487/RFC7637, September 2015, + . + + [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of + BGP for Routing in Large-Scale Data Centers", RFC 7938, + DOI 10.17487/RFC7938, August 2016, + . + + [RFC8014] Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T. + Narten, "An Architecture for Data-Center Network + Virtualization over Layer 3 (NVO3)", RFC 8014, + DOI 10.17487/RFC8014, December 2016, + . + + [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., + Przygienda, T., and S. Aldrin, "Multicast Using Bit Index + Explicit Replication (BIER)", RFC 8279, + DOI 10.17487/RFC8279, November 2017, + . + + [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., + Uttaro, J., and W. Henderickx, "A Network Virtualization + Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, + DOI 10.17487/RFC8365, March 2018, + . + +Authors' Addresses Mike McBride Huawei Email: michael.mcbride@huawei.com + + Olufemi Komolafe + Arista Networks + + Email: femi@arista.com