MBONED M. McBride Internet-Draft Huawei Intended status: InformationalFebruary 28, 2018O. Komolafe Expires:September 1,December 31, 2018 Arista Networks June 29, 2018 Multicast in the Data Center Overviewdraft-ietf-mboned-dc-deploy-02draft-ietf-mboned-dc-deploy-03 AbstractThere has been much interest in issues surrounding massive amountsThe volume and importance ofhostsone-to-many traffic patterns inthedatacenter. These issues include the prevalent use of IP Multicast within the Data Center. Its important to understand how IP Multicastcenters isbeing deployedlikely to increase significantly in theData Center to be ablefuture. Reasons for this increase are discussed and then attention is paid tounderstandthesurrounding issues with doing so. This document provides a quick survey of usesmanner in which this traffic pattern may be judiously handled in data centers. The intuitive solution of deploying conventional IP multicastin thewithin datacentercenters is explored andshould serve as an aid to further discussionevaluated. Thereafter, a number ofissues related to large amountsemerging innovative approaches are described before a number ofmulticast in the data center.recommendations are made. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire onSeptember 1,December 31, 2018. Copyright Notice Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 2.Multicast Applications in the Data CenterReasons for increasing one-to-many traffic patterns . . . . . 3 2.1. Applications . . . . . . .3 2.1. Client-Server Applications. . . . . . . . . . . . . . . 3 2.2.Non Client-Server Multicast ApplicationsOverlays . . . . . . . . . . . . . .4 3. L2 Multicast Protocols in the Data Center. . . . . . . . . . 54. L3 Multicast2.3. Protocolsin the Data Center. . . . . . . . . .6 5. Challenges of using multicast in the Data Center. . . . . .7 6.. . . . . . . . 5 3. Handling one-to-many traffic using conventional multicast . . 5 3.1. Layer 3/multicast . . . . . . . . . . . . . . . . . . . . 6 3.2. Layer 2Topological Variationsmulticast . . . . . . . . . .8 7. Address Resolution. . . . . . . . . . 6 3.3. Example use cases . . . . . . . . . . .9 7.1. Solicited-node Multicast Addresses for IPv6 address resolution. . . . . . . . . 8 3.4. Advantages and disadvantages . . . . . . . . . . . . . . 97.2. Direct Mapping4. Alternative options forMulticast address resolution .handling one-to-many traffic . . . . 98. IANA Considerations4.1. Minimizing traffic volumes . . . . . . . . . . . . . . . 9 4.2. Head end replication . . . . . . . . . . . . . . . . . . 109. Security Considerations4.3. BIER . . . . . . . . . . . . . . . . . . .10 10. Acknowledgements. . . . . . . 11 4.4. Segment Routing . . . . . . . . . . . . . . .10 11. References. . . . . . 12 5. Conclusions . . . . . . . . . . . . . . . . . . .10 11.1. Normative References. . . . . . 12 6. IANA Considerations . . . . . . . . . . . .10 11.2. Informative References. . . . . . . . . 12 7. Security Considerations . . . . . . . .10 Author's Address. . . . . . . . . . . 13 8. Acknowledgements . . . . . . . . . . . . .10 1. Introduction Data center servers often use IP Multicast to send data to clients or other application servers. IP Multicast. . . . . . . . . 13 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 9.1. Normative References . . . . . . . . . . . . . . . . . . 13 9.2. Informative References . . . . . . . . . . . . . . . . . 13 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 1. Introduction The volume and importance of one-to-many traffic patterns in data centers isexpectedlikely tohelp conserve bandwidthincrease significantly in thedata center and reducefuture. Reasons for this increase include theload on servers. IP Multicast is also a key component in several data center overlay solutions. Increased reliance on multicast,nature of the traffic generated by applications hosted innext generationthe datacenters, requires higher performance and capacity especially fromcenter, theswitches. If multicast is to continueneed tobehandle broadcast, unknown unicast and multicast (BUM) traffic within the overlay technologies usedinto support multi-tenancy at scale, and the use of certain protocols that traditionally require one-to-many control message exchanges. These trends, allied with the expectation that future highly virtualized datacenter, itcenters mustscale well within andsupport communication betweendatacenters. There has been much interest in issues surrounding massive amountspotentially thousands ofhostsparticipants, may lead to the natural assumption that IP multicast will be widely used in data centers, specifically given the bandwidth savings it potentially offers. However, such an assumption would be wrong. In fact, there is widespread reluctance to enable IP multicast in datacenter. There wascenters for alengthy discussion, in the now closed ARMD WG, involvingnumber of reasons, mostly pertaining to concerns about its scalability and reliability. This draft discusses some of theissues with address resolutionmain drivers fornon ARP/ND multicastthe increasing volume and importance of one-to-many traffic patterns in data centers.This document provides a quick survey of multicast inThereafter, thedata center and should serve as an aidmanner in which conventional IP multicast may be used tofurther discussionhandle this traffic pattern is discussed and some ofissues related to multicast inthedata center. ARP/ND issues are not addressed inassociated challenges highlighted. Following thisdocument except to explain how address resolution occurs with multicast.discussion, a number of alternative emerging approaches are introduced, before concluding by discussing key trends and making a number of recommendations. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. 2.MulticastReasons for increasing one-to-many traffic patterns 2.1. ApplicationsinKey trends suggest that theData Center There are many data center operators who do not deploy Multicast in their networks for scalability and stability reasons. There are also many operators for whom multicast is a critical protocol within their network and is enabled on their data center switches and routers. For this latter group, there are several uses of multicast in their data centers. An understandingnature of theusesapplications likely to dominate future highly-virtualized multi-tenant data centers will produce large volumes ofthat multicastone-to-many traffic. For example, it isimportantwell-known that traffic flows inorderdata centers have evolved from being predominantly North-South (e.g. client-server) to predominantly East- West (e.g. distributed computation). This change has led toproperly support these applications intheever evolving data centers. If, for instance,consensus that topologies such as themajority ofLeaf/Spine, that are easier to scale in theapplicationsEast-West direction, arediscovering/signaling each other, using multicast, there may bebetterwayssuited tosupport them then using multicast. If, however,themulticasting of data is occurring in large volumes, there is a need for gooddata centeroverlay multicast support. The applications either fall into the categoryofthose that leverage L2 multicast for discovery orthe future. This increase in East-West traffic flows results from VMs often having to exchange numerous messages between themselves as part ofthose thatexecuting a specific workload. For example, a computational workload could requireL3 support and likely span multiple subnets. 2.1. Client-Server Applications IPTV servers use multicastdata, or an executable, todeliver content frombe disseminated to workers distributed throughout the data centerto end users. IPTVwhich may be subsequently polled for status updates. The emergence of such applications means there istypically a onelikely tomany application where the hosts are configured for IGMPv3,be an increase in one-to-many traffic flows with theswitches are configuredincreasing dominance of East-West traffic. The TV broadcast industry is another potential future source of applications withIGMP snooping,one-to-many traffic patterns in data centers. The requirement for robustness, stability andthe routers are running PIM-SSM mode. Often redundantpredicability has meant the TV broadcast industry has traditionally used TV-specific protocols, infrastructure and technologies for transmitting video signals between cameras, studios, mixers, encoders, serversare sending multicast streams intoetc. However, thenetworkgrowing cost and complexity of supporting this approach, especially as thenetwork is forwardingbit rates of thedata across diverse paths. Windows Media servers send multicast streamingvideo signals increase due toclients. Windows Media Services streams to an IP multicast addressdemand for formats such as 4K-UHD andall clients subscribe8K-UHD, means there is a consensus that the TV broadcast industry will transition from industry-specific transmission formats (e.g. SDI, HD-SDI) over TV- specific infrastructure to using IP-based infrastructure. The development of pertinent standards by the SMPTE, along with the increasing performance of IP routers, means this transition is gathering pace. A possible outcome of this transition will be the building of IPaddress to receivedata centers in broadcast plants. Traffic flows in thesame stream. This allowsbroadcast industry are frequently one-to-many and so if IP data centers are deployed in broadcast plants, it is imperative that this traffic pattern is supported efficiently in that infrastructure. In fact, asingle streampivotal consideration for broadcasters considering transitioning to IP is the manner in which these one-to-many traffic flows will beplayed simultaneously by multiple clientsmanaged andthus reducing bandwidth utilization. Marketmonitored in a datarelies extensively oncenter with an IP fabric. Arguably one of the (few?) success stories in using conventional IP multicast has been for disseminating market trading data. For example, IP multicast is commonly used today to deliver stock quotes from thedata centerstock exchange toafinancial services provider and then to the stockanalysts. The most critical requirement of a multicast trading floor is that it be highly available.analysts or brokerages. The network must be designed with no single point of failure and in such a way that the network can respond in a deterministic manner to any failure.TypicallyTypically, redundant servers (in a primary/backup orlive livelive-live mode)are sendingsend multicast streams into thenetworknetwork, with diverse paths being used across the network. Another critical requirement is reliability and traceability; regulatory and legal requirements means that the producer of the marketing data must know exactly where the flow was sent and be able to prove conclusively that the data was received within agreed SLAs. The stock exchange generating the one-to-many traffic and stock analysts/brokerage that receive the traffic will typically have their own data centers. Therefore, the manner in which one-to-many traffic patterns are handled in these data centers are extremely important, especially given the requirements and constraints mentioned. Many data center cloud providers provide publish and subscribe applications. There can be numerous publishers andthe network is forwarding the data across diverse paths (when duplicatesubscribers and many message channels within a datais sent by multiple servers).center. With publish and subscribe servers, a separate message is sent to each subscriber of a publication. With multicast publish/subscribe, only one message is sent, regardless of the number of subscribers. In apublish/subscribepublish/ subscribe system, client applications, some of which are publishers and some of which are subscribers, are connected to a network of message brokers that receive publications on a number of topics, and send the publications on to the subscribers for those topics. The more subscribers there are in the publish/subscribe system, the greater the improvement to network utilization there might be with multicast. 2.2.Non Client-Server Multicast Applications Routers, running Virtual Routing Redundancy Protocol (VRRP), communicate with one another usingOverlays The proposed architecture for supporting large-scale multi-tenancy in highly virtualized data centers [RFC8014] consists of a tenant's VMs distributed across the data center connected by a virtual network known as the overlay network. A number of different technologies have been proposed for realizing the overlay network, including VXLAN [RFC7348], VXLAN-GPE [I-D.ietf-nvo3-vxlan-gpe], NVGRE [RFC7637] and GENEVE [I-D.ietf-nvo3-geneve]. The often fervent and arguably partisan debate about the relative merits of these overlay technologies belies the fact that, conceptually, it may be said that these overlays typically simply provide a means to encapsulate and tunnel Ethernet frames from the VMs over the data center IP fabric, thus emulating a layer 2 segment between the VMs. Consequently, the VMs believe and behave as if they are connected to the tenant's other VMs by a conventional layer 2 segment, regardless of their physical location within the data center. Naturally, in a layer 2 segment, point to multi-point traffic can result from handling BUM (broadcast, unknown unicast and multicast) traffic. And, compounding this issue within data centers, since the tenant's VMs attached to the emulated segment may be dispersed throughout the data center, the BUM traffic may need to traverse the data center fabric. Hence, regardless of the overlay technology used, due consideration must be given to handling BUM traffic, forcing the data center operator to consider the manner in which one-to-many communication is handled within the IP fabric. 2.3. Protocols Conventionally, some key networking protocols used in data centers require one-to-many communication. For example, ARP and ND use broadcast and multicastaddress. VRRP packetsmessages within IPv4 and IPv6 networks respectively to discover MAC address to IP address mappings. Furthermore, when these protocols aresent, encapsulatedrunning within an overlay network, then it essential to ensure the messages are delivered to all the hosts on the emulated layer 2 segment, regardless of physical location within the data center. The challenges associated with optimally delivering ARP and ND messages in data centers has attracted lots of attention [RFC6820]. Popular approaches inIP packets,use mostly seek to224.0.0.18. A failureexploit characteristics of data center networks toreceive aavoid having to broadcast/multicast these messages, as discussed in Section 4.1. 3. Handling one-to-many traffic using conventional multicastpacket from3.1. Layer 3 multicast PIM is themaster routermost widely deployed multicast routing protocol and so, unsurprisingly, is the primary multicast routing protocol considered fora period longer thanuse in the data center. There are threetimespotential popular flavours of PIM that may be used: PIM-SM [RFC4601], PIM-SSM [RFC4607] or PIM-BIDIR [RFC5015]. It may be said that these different modes of PIM tradeoff theadvertisement timer causesoptimality of thebackup routers to assume thatmulticast forwarding tree for themaster router is dead. The virtual router then transitions into an unsteadyamount of multicast forwarding state that must be maintained at routers. SSM provides the most efficient forwarding between sources andan election processreceivers and thus isinitiated to select the next master router from the backup routers. Thismost suitable for applications with one-to- many traffic patterns. State isfulfilled throughbuilt and maintained for each (S,G) flow. Thus, theuseamount of multicastpackets. Backup router(s) are only to send multicast packets during an election process. Overlays may use IP multicast to virtualize L2 multicasts. IP multicast is used to reduce the scope of the L2-over-UDP flooding to only those hosts that have expressed explicit interestforwarding state held by routers in theframes.VXLAN, for instance,data center isan encapsulation schemeproportional tocarry L2 frames over L3 networks. The VXLAN Tunnel End Point (VTEP) encapsulates frames inside an L3 tunnel. VXLANs are identified by a 24 bit VXLAN Network Identifier (VNI). The VTEP maintains a tablethe number ofknown destination MAC addresses,sources andstoresgroups. At theIP addressother end of thetunnel tospectrum, BIDIR is theremote VTEP to usemost efficient shared tree solution as one tree is built foreach. Unicast frames,all (S,G)s, therefore minimizing the amount of state. This state reduction is at the expense of optimal forwarding path betweenVMs, are sent directly tosources and receivers. This use of a shared tree makes BIDIR particularly well-suited for applications with many-to-many traffic patterns, given that theunicast L3 addressamount ofthe remote VTEP. Multicast frames are sentstate is uncorrelated toa multicast IP group associated withtheVNI. Underlying IP Multicast protocols (PIM-SM/SSM/BIDIR)number of sources. SSM and BIDIR areused to forwardoptimizations of PIM-SM. PIM-SM is still the most widely deployed multicastdata acrossrouting protocol. PIM-SM can also be theoverlay. The Ganglia applicationmost complex. PIM-SM relies upon a RP (Rendezvous Point) to set up the multicastfor distributed discoverytree andmonitoringsubsequently there is the option ofcomputing systems such as clusters and grids. It has been usedswitching tolink clusters across university campuses and can scalethe SPT (shortest path tree), similar tohandle clusters with 2000 nodes Windows Server, cluster node exchange, relies uponSSM, or staying on theuse ofshared tree, similar to BIDIR. 3.2. Layer 2 multicastheartbeats between servers. Only the other interfaces inWith IPv4 unicast address resolution, thesametranslation of an IP address to a MAC address is done dynamically by ARP. With multicastgroup useaddress resolution, thedata. Unlike broadcast,mapping from a multicasttraffic does not needIPv4 address tobe flooded throughout the network, reducinga multicast MAC address is done by assigning thechance that unnecessary CPU cycles are expended filtering traffic on nodes outsidelow-order 23 bits of thecluster. Asmulticast IPv4 address to fill thenumberlow-order 23 bits ofnodes increases,theability to replace several unicast messages with a singlemulticastmessage improves node performance and decreases network bandwidth consumption. Multicast messages replace unicast messages in two components of clustering: o Heartbeats: The clustering failure detection engineMAC address. Each IPv4 multicast address has 28 unique bits (the multicast address range isbased on a scheme whereby nodes send heartbeat messages to other nodes. Specifically, for each network interface, a node sends224.0.0.0/12) therefore mapping aheartbeat message to all other nodes with interfaces on that network. Heartbeat messages are sent every 1.2 seconds. In the common case where each node has an interface on each cluster network, there are N * (N - 1) unicast heartbeats sent per network every 1.2 seconds in an N-node cluster. Withmulticastheartbeats, the message count dropsIP address toN multicast heartbeats per network every 1.2 seconds, because each node sends 1 message instead of N - 1. This representsareduction in processing cycles onMAC address ignores 5 bits of thesending node andIP address. Hence, groups of 32 multicast IP addresses are mapped to the same MAC address meaning areduction in network bandwidth consumed. o Regroup: The clustering membership engine executesaregroup protocol duringmulticast MAC address cannot be uniquely mapped to amembership view change. The regroup protocol algorithm assumes the abilitymulticast IPv4 address. Therefore, planning is required within an organization tobroadcast messageschoose IPv4 multicast addresses judiciously in order toall cluster nodes. Toavoidunnecessary network flooding and to properly authenticate messages,address aliasing. When sending IPv6 multicast packets on an Ethernet link, thebroadcast primitivecorresponding destination MAC address isimplemented byasequencedirect mapping ofunicast messages. Convertingtheunicast messageslast 32 bits of the 128 bit IPv6 multicast address into the 48 bit MAC address. It is possible for more than one IPv6 multicast address to map to the same 48 bit MAC address. The default behaviour of many hosts (and, in fact, routers) is to block multicast traffic. Consequently, when asinglehost wishes to join an IPv4 multicastmessage conserves processing power ongroup, it sends an IGMP [RFC2236], [RFC3376] report to thesending noderouter attached to the layer 2 segment andreduces network bandwidth consumption. Multicast addresses inalso it instructs its data link layer to receive Ethernet frames that match the224.0.0.x range are consideredcorresponding MAC address. The data linklocal multicast addresses. They are used for protocol discovery and are floodedlayer filters the frames, passing those with matching destination addresses toevery port. For example, OSPF uses 224.0.0.5 and 224.0.0.6the IP module. Similarly, hosts simply hand the multicast packet forneighbor and DR discovery. These addresses are reserved and will not be constrained by IGMP snooping. These addresses are nottransmission tobe used by any application. 3. L2 Multicast Protocols intheData Center The switches, in betweendata link layer which would add theservers andlayer 2 encapsulation, using therouters, rely upon igmp snooping to boundMAC address derived in the manner previously discussed. When this Ethernet frame with a multicast MAC address is received by a switch configured to forward multicast traffic, theports leading to interested hosts anddefault behaviour is toL3 routers. A switch will, by default,floodmulticast trafficit to all the ports in the layer 2 segment. Clearly there may not be abroadcast domain (VLAN). IGMP snooping is designed to prevent hosts on a local network from receiving trafficreceiver forathis multicast groupthey have not explicitly joined. It provides switches with a mechanism to prune multicast traffic from links that do not contain a multicast listener (an IGMP client).present on each port and IGMP snooping isa L2 optimization for L3 IGMP.used to avoid sending the frame out of ports without receivers. IGMP snooping, with proxy reporting or report suppression, actively filters IGMP packets in order to reduce load on the multicastrouter. Joins and leaves heading upstream to therouterare filtered so thatby ensuring only the minimal quantity of information is sent. The switch is trying to ensure the routeronlyhas only a single entry for the group, regardless ofhow manythe number of activelisteners there are.listeners. If there are two active listeners in a group and the first one leaves, then the switch determines that the router does not need this information since it does not affect the status of the group from the router's point of view. However the next time there is a routine query from the router the switch will forward the reply from the remaining host, to prevent the router from believing there are no active listeners. It follows that in active IGMP snooping, the router will generally only know about the most recently joined member of the group. In order forIGMP,IGMP and thus IGMPsnooping,snooping to function, a multicast router must exist on the network and generate IGMP queries. The tables (holding the member ports for each multicast group) created for snooping are associated with the querier. Without a querier the tables are not created and snooping will not work.FurthermoreFurthermore, IGMP general queries must be unconditionally forwarded by all switches involved in IGMP snooping. Some IGMP snooping implementations include full querier capability. Others are able to proxy and retransmit queries from the multicast router.In source-only networks, however, which presumably describes most data center networks, there are no IGMP hostsMulticast Listener Discovery (MLD) [RFC 2710] [RFC 3810] is used by IPv6 routers for discovering multicast listeners onswitch portsa directly attached link, performing a similar function togenerateIGMPpackets. Switch portsin IPv4 networks. MLDv1 [RFC 2710] is similar to IGMPv2 and MLDv2 [RFC 3810] [RFC 4604] similar to IGMPv3. However, in contrast to IGMP, MLD does not send its own distinct protocol messages. Rather, MLD is a subprotocol of ICMPv6 [RFC 4443] and so MLD messages areconnecteda subset of ICMPv6 messages. MLD snooping works similarly tomulticast source portsIGMP snooping, described earlier. 3.3. Example use cases A use case where PIM and IGMP are currently used in data centers is to support multicastrouter ports. The switch typically learns about multicast groups fromin VXLAN deployments. In the original VXLAN specification [RFC7348], a data-driven flood and learn control plane was proposed, requiring themulticastdatastreamcenter IP fabric to support multicast routing. A multicast group is associated with each virtual network, each uniquely identified byusing a type of source only learning (when only receivingits VXLAN network identifiers (VNI). VXLAN tunnel endpoints (VTEPs), typically located in the hypervisor or ToR switch, with local VMs that belong to this VNI would join the multicastdata ongroup and use it for theport, no IGMP packets). The switch forwardsexchange of BUM trafficonly towith themulticast router ports. Whenother VTEPs. Essentially, theswitch receivesVTEP would encapsulate any BUM trafficfor newfrom attached VMs in an IP multicastgroups, it will typically floodpacket, whose destination address is thepacketsassociated multicast group address, and transmit the packet toall portsthe data center fabric. Thus, PIM must be running in thesame VLAN. This unnecessary floodingfabric to maintain a multicast distribution tree per VNI. Alternatively, rather than setting up a multicast distribution tree per VNI, a tree canimpact switch performance. 4. L3 Multicast Protocols inbe set up whenever hosts within theData Center There are three flavors ofVNI wish to exchange multicast traffic. For example, whenever a VTEP receives an IGMP report from a locally connected host, it would translate this into a PIMused for Multicast Routing in the Data Center: PIM-SM [RFC4601], PIM-SSM [RFC4607], and PIM-BIDIR [RFC5015]. SSM providesjoin message which will be propagated into themost efficient forwarding between sources and receivers andIP fabric. In order to ensure this join message ismost suitable for onesent tomany typesthe IP fabric rather than over the VXLAN interface (since the VTEP will have a route back to the source of the multicastapplications. State is built for each S,G channel thereforepacket over themore sourcesVXLAN interface andgroups there are,so would naturally attempt to send the join over this interface) a morestate there is in the network. BIDIR is the most efficient shared tree solution as one tree is built for all S,G's, therefore saving state. But it is notspecific route back to themost efficient in forwarding path between sources and receivers. SSM and BIDIR are optimizations of PIM-SM. PIM-SM is stillsource over themost widely deployed multicast routing protocol. PIM-SM can alsoIP fabric must be configured. In this approach PIM must be configured on themost complex. PIM-SM relies upon a RP (Rendezvous Point) to set upSVIs associated with themulticast treeVXLAN interface. Another use case of PIM andthen will either switchIGMP in data centers is when IPTV servers use multicast to deliver content from theSPT (shortest path tree), similardata center toSSM, or stay on the shared tree (similarend users. IPTV is typically a one toBIDIR). For massive amounts ofmany application where the hostssending (and receiving) multicast,are configured for IGMPv3, theshared tree (particularlyswitches are configured withPIM-BIDIR) providesIGMP snooping, and thebest potential scaling since no matter how manyrouters are running PIM-SSM mode. Often redundant servers send multicastsources exist within a VLAN,streams into thetree number staysnetwork and thesame. IGMP snooping, IGMP proxy,network is forwards the data across diverse paths. Windows Media servers send multicast streams to clients. Windows Media Services streams to an IP multicast address andPIM-BIDIR haveall clients subscribe to thepotentialIP address toscalereceive the same stream. This allows a single stream to be played simultaneously by multiple clients and thus reducing bandwidth utilization. 3.4. Advantages and disadvantages Arguably thehuge scaling numbers requiredbiggest advantage of using PIM and IGMP to support one- to-many communication inadatacenter. 5. Challenges ofcenters is that these protocols are relatively mature. Consequently, PIM is available in most routers and IGMP is supported by most hosts and routers. As such, no specialized hardware or relatively immature software is involved in usingmulticastthem in data centers. Furthermore, theData Center Data Center environments may create unique challenges for IP Multicast. Data Centermaturity of these protocols means their behaviour and performance in operational networksrequired a high amountis well-understood, with widely available best-practices and deployment guides for optimizing their performance. However, somewhat ironically, the relative disadvantages ofVM trafficPIM andmobility withinIGMP usage in data centers also stem mostly from their maturity. Specifically, these protocols were standardized andbetween DC networks. DC networks have large numbersimplemented long before the highly-virtualized multi-tenant data centers ofservers. DC networkstoday existed. Consequently, PIM and IGMP areoften usedneither optimally placed to deal withcloud orchestration software. DC networks often use IP Multicast in their unique environments. This section looks atthechallengesrequirements ofusingone-to-many communication in modern data centers nor to exploit characteristics and idiosyncrasies of data centers. For example, there may be thousands of VMs participating in a multicast session, with some of these VMs migrating to servers within thechallengingdatacenter environment. When IGMP/MLD Snooping is not implemented, ethernet switches will flood multicast frames out ofcenter, new VMs being continually spun up and wishing to join the sessions while allswitch-ports, which turnsthetraffic into something more liketime other VMs are leaving. In such abroadcast. VRRP uses multicast heartbeat to communicate between routers. The communication betweenscenario, thehostchurn in the PIM and IGMP state machines, thedefault gateway is unicast. The multicast heartbeat can be very chatty when there are thousandsvolume ofVRRP pairs with sub-second heartbeat calls backcontrol messages they would generate andforth. Link-local multicast should scale well within one IP subnet particularly with a large layer3 domain extending down totheaccess or aggregation switches. Butamount of state they would necessitate within routers, especially ifmulticast traverses beyond one IP subnet, which is necessarythey were deployed naively, would be untenable. 4. Alternative options foran overlay like VXLAN, you could potentially have scaling concerns. If using a VXLAN overlay, ithandling one-to-many traffic Section 2 has shown that there isnecessarylikely tomap the L2 multicastbe an increasing amount one-to-many communications inthe overlay to L3data centers. And Section 3 has discussed how conventional multicast may be used to handle this traffic. Having said that, there are a number of alternative options of handling this traffic pattern in data centers, as discussed in theunderlaysubsequent section. It should be noted that many of these techniques are not mutually-exclusive; in fact many deployments involve a combination of more than one of these techniques. Furthermore, as will be shown, introducing a centralized controller ordo head end replicationa distributed control plane, makes these techniques more potent. 4.1. Minimizing traffic volumes If handling one-to-many traffic in data centers can be challenging then arguably theoverlay and receive duplicate frames on the first link from the router to the core switch. Themost intuitive solutioncould beis torun potentially thousands of PIM messagesaim togenerate/maintainminimize therequired multicast statevolume of such traffic. It was previously mentioned in Section 2 that theIP underlay. The behaviorthree main causes of one-to-many traffic in data centers are applications, overlays and protocols. While, relatively speaking, little can be done about theupper layer, with respectvolume of one-to-many traffic generated by applications, there is more scope for attempting tobroadcast/multicast, affectsreduce thechoicevolume ofhead end (*,G) or (S,G) replication in the underlay, which affects the opexsuch traffic generated by overlays and protocols. (And often by protocols within overlays.) This reduction is possible by exploiting certain characteristics of data center networks: fixed andcapexregular topology, owned and exclusively controlled by single organization, well-known overlay encapsulation endpoints etc. A way of minimizing theentire solution. A VXLAN, with thousandsamount oflogical groups, mapsone-to-many traffic that traverses the data center fabric is tohead end replication inuse a centralized controller. For example, whenever a new VM is instantiated, the hypervisor or encapsulation endpoint can notify a centralized controller of this new MAC address, the associated virtual network, IP address etc. The controller could subsequently distribute this information toIGMPevery encapsulation endpoint. Consequently, when any endpoint receives an ARP request from a locally attached VM, it could simply consult its local copy of thehypervisor and then PIM betweeninformation distributed by theTOR and CORE 'switches'controller and reply. Thus, thegateway router. Requiring IP multicast (especially PIM BIDIR) fromARP request is suppressed and does not result in one-to-many traffic traversing thenetwork can prove challenging fordata centeroperators especially at the kind of scale thatIP fabric. Alternatively, theVXLAN/NVGRE proposals require. This is also true whenfunctionality supported by theL2 topological domaincontroller can realized by a distributed control plane. BGP-EVPN [RFC7432, RFC8365] islarge and extended all the way totheL3 core. Inmost popular control plane used in datacenterscenters. Typically, the encapsulation endpoints will exchange pertinent information withhighly virtualized servers, even small L2 domains may spread across many server racks (i.e. multiple switches and router ports). It's not uncommon for thereeach other by all peering with a BGP route reflector (RR). Thus, information about local MAC addresses, MAC to IP address mapping, virtual networks identifiers etc can be10-20disseminated. Consequently, ARP requests from local VMsper server in a virtualized environment. One vendor reported a customer requesting a scale to 400VM's per server. For multicast tocan bea viable solutionsuppressed by the encapsulation endpoint. 4.2. Head end replication A popular option for handling one-to-many traffic patterns inthis environment,data centers is head end replication (HER). HER means thenetwork needs to be able to scaletraffic is duplicated and sent tothese numbers when these VMs are sending/receiving multicast. A lot of switching/routing hardware has problems witheach end point individually using conventional IPMulticast, particularly with regards to hardware support of PIM-BIDIR. Sending L2 multicast over a campus or data center backbone, in any sortunicast. Obvious disadvantages ofsignificant way,HER include traffic duplication and the additional processing burden on the head end. Nevertheless, HER isa new challenge enabled forespecially attractive when overlays are in use as thefirst timereplication can be carried out byoverlays. Therethe hypervisor or encapsulation end point. Consequently, the VMs and IP fabric areinteresting challenges when pushing large amounts of multicast traffic through a network,unmodified andhave thus far been dealt with using purpose-built networks. Whileunaware of how theoverlay proposals have been careful nottraffic is delivered toimpose new protocol requirements, they have not addressedtheissuesmultiple end points. Additionally, it is possible to use a number ofperformanceapproaches for constructing andscalability, nordisseminating thelarge-scale availabilitylist ofthese protocols. There is an unnecessary multicast stream flooding problem in the link layer switches between the multicast sourcewhich endpoints should receive what traffic and so on. For example, thePIM First Hop Router (FHR). The IGMP-Snooping Switch will forward multicast streamsreluctance of data center operators torouter ports,enable PIM and IGMP within thePIM FHR must receive all multicast streams even if theredata center fabric means VXLAN isno request from receiver. Thisoftenleadsused with HER. Thus, BUM traffic from each VNI is replicated and sent using unicast towasteremote VTEPs with VMs in that VNI. The list ofswitch cache and link bandwidth whenremote VTEPs to which themulticast streams are not actually required. [I-D.pim-umf-problem-statement] detailstraffic should be sent may be configured manually on theproblem and defines design goalsVTEP. Alternatively, the VTEPs may transmit appropriate state to a centralized controller which in turn sends each VTEP the list of remote VTEPs for each VNI. Lastly, HER also works well when ageneric mechanismdistributed control plane is used instead of the centralized controller. Again, BGP-EVPN may be used torestraindistribute theunnecessary multicast stream flooding. 6. Layer 3 / Layer 2 Topological Variationsinformation needed to faciliate HER to the VTEPs. 4.3. BIER As discussed inRFC6820, the ARMD problems statement, there are a variety of topologicalSection 3.4, PIM and IGMP face potential scalability challenges when deployed in datacenter variations including L3centers. These challenges are typically due toAccess Switches, L3the requirement toAggregation Switches,build and maintain a distribution tree andL3 intheCore only. Further analysis is needed in orderrequirement tounderstand howhold per-flow state in routers. Bit Index Explicit Replication (BIER) [RFC 8279] is a new multicast forwarding paradigm that avoids thesevariations affect IP Multicast scalability 7. Address Resolution 7.1. Solicited-node Multicast Addresses for IPv6 address resolution Solicited-node Multicast Addresses are used with IPv6 Neighbor Discovery to providetwo requirements. When a multicast packet enters a BIER domain, thesame functioningress router, known as theAddress Resolution Protocol (ARP)Bit-Forwarding Ingress Router (BFIR), adds a BIER header to the packet. This header contains a bit string inIPv4. ARP uses broadcasts,which each bit maps tosendanARP Requests, which are received by all end hosts onegress router, known as Bit-Forwarding Egress Router (BFER). If a bit is set, then thelocal link. Onlypacket should be forwarded to thehost being queried responds. However,associated BFER. The routers within theother hosts still have to processBIER domain, Bit-Forwarding Routers (BFRs), use the BIER header in the packet anddiscardinformation in therequest. With IPv6, a hostBit Index Forwarding Table (BIFT) to carry out simple bit- wise operations to determine how the packet should be replicated optimally so it reaches all the appropriate BFERs. BIER isrequireddeemed tojoin a Solicited-Node multicast groupbe attractive foreach of its configured unicast or anycast addresses. Because a Solicited-node Multicast Addressfacilitating one-to-many communications in data ceneters [I-D.ietf-bier-use-cases]. The deployment envisioned with overlay networks isa function ofthat thelast 24-bits of an IPv6 unicast or anycast address,thenumber of hosts that are subscribed to each Solicited-node Multicast Addressencapsulation endpoints wouldtypicallybeone (there could be more becausethemapping function isBFIR. So knowledge about the actual multicast groups does nota 1:1 mapping). Compared to ARPreside inIPv4, a host should not need to be interrupted as often to service Neighbor Solicitation requests. 7.2. Direct Mapping for Multicast address resolution With IPv4 unicast address resolution,thetranslation of an IP addressdata center fabric, improving the scalability compared to conventional IP multicast. Additionally, aMAC address is done dynamically by ARP. With multicast address resolution, the mapping fromcentralized controller or amulticast IP addressBGP-EVPN control plane may be used with BIER toa multicast MAC address is derived from direct mapping. In IPv4,ensure themapping is done by assigningBFIR have thelow-order 23 bitsrequired information. A challenge associated with using BIER is that, unlike most of themulticast IP addressother approaches discussed in this draft, it requires changes tofillthelow-order 23 bitsforwarding behaviour of themulticast MAC address. Whenrouters used in the data center IP fabric. 4.4. Segment Routing Segment Routing (SR) [I-D.ietf-spring-segment-routing] adopts the the source routing paradigm in which the manner in which ahost joinspacket traverses a network is determined by anIP multicast group, it instructsordered list of instructions. These instructions are known as segments may have a local semantic to an SR node or global within an SR domain. SR allows enforcing a flow through any topological path while maintaining per-flow state only at thedata link layeringress node toreceive frames that matchtheMAC address that correspondsSR domain. Segment Routing can be applied to theIP address ofMPLS and IPv6 data-planes. In themulticast group. The data link layer filtersformer, theframes and passes frames with matching destination addresses tolist of segments is represented by theIP module. Sincelabel stack and in themapping from multicast IP address tolatter it is represented as aMAC address ignores 5 bits ofrouting extension header. Use-cases are described in [I-D.ietf- spring-segment-routing] and are being considered in theIP address, groupscontext of32 multicast IP addresses are mappedBGP-based large-scale data-center (DC) design [RFC7938]. Multicast in SR continues tothe same MAC address. As a result a multicast MAC address cannotbeuniquely mappeddiscussed in a variety of drafts and working groups. The SPRING WG has not yet been chartered to work on Multicast in SR. Multicast can include locally allocating amulticast IPv4 address. Planning is required within an organizationSegment Identifier (SID) toselect IPv4 groups that are far enough away from each otherexisting replication solutions, such as PIM, mLDP, P2MP RSVP-TE and BIER. It may also be that a new way tonot end up with the same L2 address used. Any multicast addresssignal and install trees in SR is developed without creating state in the[224-239].0.0.xnetwork. 5. Conclusions As the volume and[224-239].128.0.x ranges should not be considered. When sending IPv6importance of one-to-many traffic in data centers increases, conventional IP multicastpackets on an Ethernet link, the corresponding destination MAC addressis likely to become increasingly unattractive for deployment in data centers for adirect mappingnumber ofthe last 32 bitsreasons, mostly pertaining its inherent relatively poor scalability and inability to exploit characteristics of data center network architectures. Hence, even though IGMP/MLD is likely to remain the128 bit IPv6most popular manner in which end hosts signal interest in joining a multicastaddress into the 48 bit MAC address. Itgroup, it ispossible for more than one IPv6 Multicast address to map tounlikely that this multicast traffic will be transported over thesame 48 bit MAC address. 8.data center IP fabric using a multicast distribution tree built by PIM. Rather, approaches which exploit characteristics of data center network architectures (e.g. fixed and regular topology, owned and exclusively controlled by single organization, well-known overlay encapsulation endpoints etc.) are better placed to deliver one-to-many traffic in data centers, especially when judiciously combined with a centralized controller and/or a distributed control plane (particularly one based on BGP- EVPN). 6. IANA Considerations This memo includes no request to IANA.9.7. Security Considerations No new security considerations result from this document10.8. AcknowledgementsThe authors would like to thank the many individuals who contributed opinions on the ARMD wg mailing list about this topic: Linda Dunbar, Anoop Ghanwani, Peter Ashwoodsmith, David Allan, Aldrin Isaac, Igor Gashinsky, Michael Smith, Patrick Frejborg, Joel Jaeggli and Thomas Narten. 11.9. References11.1.9.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/info/rfc2119>.11.2.9.2. Informative References [I-D.ietf-bier-use-cases] Kumar, N., Asati, R., Chen, M., Xu, X., Dolganow, A., Przygienda, T., Gulko, A., Robinson, D., Arya, V., and C. Bestler, "BIER Use Cases", draft-ietf-bier-use-cases-06 (work in progress), January 2018. [I-D.ietf-nvo3-geneve] Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic Network Virtualization Encapsulation", draft-ietf- nvo3-geneve-06 (work in progress), March 2018. [I-D.ietf-nvo3-vxlan-gpe] Maino, F., Kreeger, L., and U. Elzur, "Generic Protocol Extension for VXLAN", draft-ietf-nvo3-vxlan-gpe-06 (work in progress), April 2018. [I-D.ietf-spring-segment-routing] Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., Litkowski, S., and R. Shakir, "Segment Routing Architecture", draft-ietf-spring-segment-routing-15 (work in progress), January 2018. [RFC2236] Fenner, W., "Internet Group Management Protocol, Version 2", RFC 2236, DOI 10.17487/RFC2236, November 1997, <https://www.rfc-editor.org/info/rfc2236>. [RFC2710] Deering, S., Fenner, W., and B. Haberman, "Multicast Listener Discovery (MLD) for IPv6", RFC 2710, DOI 10.17487/RFC2710, October 1999, <https://www.rfc-editor.org/info/rfc2710>. [RFC3376] Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A. Thyagarajan, "Internet Group Management Protocol, Version 3", RFC 3376, DOI 10.17487/RFC3376, October 2002, <https://www.rfc-editor.org/info/rfc3376>. [RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas, "Protocol Independent Multicast - Sparse Mode (PIM-SM): Protocol Specification (Revised)", RFC 4601, DOI 10.17487/RFC4601, August 2006, <https://www.rfc-editor.org/info/rfc4601>. [RFC4607] Holbrook, H. and B. Cain, "Source-Specific Multicast for IP", RFC 4607, DOI 10.17487/RFC4607, August 2006, <https://www.rfc-editor.org/info/rfc4607>. [RFC5015] Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano, "Bidirectional Protocol Independent Multicast (BIDIR- PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007, <https://www.rfc-editor.org/info/rfc5015>. [RFC6820] Narten, T., Karir, M., and I. Foo, "Address Resolution Problems in Large Data Center Networks", RFC 6820, DOI 10.17487/RFC6820, January 2013, <https://www.rfc-editor.org/info/rfc6820>.Author's Address[RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, L., Sridhar, T., Bursell, M., and C. Wright, "Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, <https://www.rfc-editor.org/info/rfc7348>. [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 2015, <https://www.rfc-editor.org/info/rfc7432>. [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network Virtualization Using Generic Routing Encapsulation", RFC 7637, DOI 10.17487/RFC7637, September 2015, <https://www.rfc-editor.org/info/rfc7637>. [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of BGP for Routing in Large-Scale Data Centers", RFC 7938, DOI 10.17487/RFC7938, August 2016, <https://www.rfc-editor.org/info/rfc7938>. [RFC8014] Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T. Narten, "An Architecture for Data-Center Network Virtualization over Layer 3 (NVO3)", RFC 8014, DOI 10.17487/RFC8014, December 2016, <https://www.rfc-editor.org/info/rfc8014>. [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., Przygienda, T., and S. Aldrin, "Multicast Using Bit Index Explicit Replication (BIER)", RFC 8279, DOI 10.17487/RFC8279, November 2017, <https://www.rfc-editor.org/info/rfc8279>. [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., Uttaro, J., and W. Henderickx, "A Network Virtualization Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, DOI 10.17487/RFC8365, March 2018, <https://www.rfc-editor.org/info/rfc8365>. Authors' Addresses Mike McBride Huawei Email: michael.mcbride@huawei.com Olufemi Komolafe Arista Networks Email: femi@arista.com