draft-ietf-v6ops-pmtud-ecmp-problem-00.txt | draft-ietf-v6ops-pmtud-ecmp-problem-01.txt | |||
---|---|---|---|---|
v6ops M. Byerly | v6ops M. Byerly | |||
Internet-Draft Fastly | Internet-Draft Fastly | |||
Intended status: Informational M. Hite | Intended status: Informational M. Hite | |||
Expires: September 2, 2015 Evernote | Expires: November 20, 2015 Evernote | |||
J. Jaeggli | J. Jaeggli | |||
Fastly | Fastly | |||
March 1, 2015 | May 19, 2015 | |||
Close encounters of the ICMP type 2 kind (near misses with ICMPv6 PTB) | Close encounters of the ICMP type 2 kind (near misses with ICMPv6 PTB) | |||
draft-ietf-v6ops-pmtud-ecmp-problem-00 | draft-ietf-v6ops-pmtud-ecmp-problem-01 | |||
Abstract | Abstract | |||
This document calls attention to the problem of delivering ICMPv6 | This document calls attention to the problem of delivering ICMPv6 | |||
type 2 "Packet Too Big" (PTB) messages to the intended destination in | type 2 "Packet Too Big" (PTB) messages to the intended destination in | |||
ECMP load balanced, or anycast network architectures. It discusses | ECMP load balanced or anycast network architectures. It discusses | |||
operational mitigations that can be employed to address this class of | operational mitigations that can be employed to address this class of | |||
failure. | failure. | |||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on September 2, 2015. | This Internet-Draft will expire on November 20, 2015. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2015 IETF Trust and the persons identified as the | Copyright (c) 2015 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
skipping to change at page 2, line 15 | skipping to change at page 2, line 15 | |||
the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
described in the Simplified BSD License. | described in the Simplified BSD License. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | |||
2. Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 | 2. Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 | |||
3. Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . 4 | 3. Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
3.1. Alternatives . . . . . . . . . . . . . . . . . . . . . . 5 | 3.1. Alternatives . . . . . . . . . . . . . . . . . . . . . . 5 | |||
3.2. Implementation . . . . . . . . . . . . . . . . . . . . . 5 | 3.2. Implementation . . . . . . . . . . . . . . . . . . . . . 5 | |||
4. Improvements . . . . . . . . . . . . . . . . . . . . . . . . 6 | 3.2.1. Alternatives . . . . . . . . . . . . . . . . . . . . 6 | |||
4. Improvements . . . . . . . . . . . . . . . . . . . . . . . . 7 | ||||
5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 | 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 | |||
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 | 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 | |||
7. Security Considerations . . . . . . . . . . . . . . . . . . . 7 | 7. Security Considerations . . . . . . . . . . . . . . . . . . . 7 | |||
8. Informative References . . . . . . . . . . . . . . . . . . . 7 | 8. Informative References . . . . . . . . . . . . . . . . . . . 8 | |||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 7 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 | |||
1. Introduction | 1. Introduction | |||
Operators of popular Internet services face complex challenges | Operators of popular Internet services face complex challenges | |||
associated with scaling their infrastructure. One approach is to | associated with scaling their infrastructure. One approach is to | |||
utilize equal-cost multi-path (ECMP) routing to perform stateless | utilize equal-cost multi-path (ECMP) routing to perform stateless | |||
distribution of incoming TCP or UDP sessions to multiple servers or | distribution of incoming TCP or UDP sessions to multiple servers or | |||
to middle boxes such as load balancers. Distribution of traffic in | to middle boxes such as load balancers. Distribution of traffic in | |||
this manner presents a problem when dealing with ICMP signaling. | this manner presents a problem when dealing with ICMP signaling. | |||
Specifically, an ICMP error is not guaranteed to hash via ECMP to the | Specifically, an ICMP error is not guaranteed to hash via ECMP to the | |||
same destination as its corresponding TCP or UDP session. A case | same destination as its corresponding TCP or UDP session. A case | |||
where this is particularly problematic operationally is path MTU | where this is particularly problematic operationally is path MTU | |||
discovery (PMTUD). | discovery (PMTUD). | |||
2. Problem | 2. Problem | |||
A common application for stateless load balancing of TCP or UDP flows | A common application for stateless load balancing of TCP or UDP flows | |||
is to perform an initial subdivision of flows in front of a stateful | is to perform an initial subdivision of flows in front of a stateful | |||
load balancer tier or multiple servers, so that the workload becomes | load balancer tier or multiple servers so that the workload becomes | |||
divided into manageable fractions of the total number of flows. The | divided into manageable fractions of the total number of flows. The | |||
flow division is performed using ECMP forwarding and a stateless but | flow division is performed using ECMP forwarding and a stateless but | |||
sticky algorithm for hashing across the available paths. This | sticky algorithm for hashing across the available paths. This | |||
nexthop selection for the purposes of flow distribution is a | nexthop selection for the purposes of flow distribution is a | |||
constrained form of anycast topology, where all anycast destinations | constrained form of anycast topology where all anycast destinations | |||
are equidistant from the upstream router responsible for making the | are equidistant from the upstream router responsible for making the | |||
last next-hop forwarding decision before the flow arrives on the | last next-hop forwarding decision before the flow arrives on the | |||
destination device. In this approach, the hash is performed across | destination device. In this approach, the hash is performed across | |||
some set of available protocol headers. Typically, these headers may | some set of available protocol headers. Typically, these headers may | |||
include all or a subset of (IPv6)Flow-Label, IP-source, IP- | include all or a subset of (IPv6) Flow-Label, IP-source, IP- | |||
destination, protocol, source-port, destination-port and potentially | destination, protocol, source-port, destination-port and potentially | |||
others such as ingress interface. | others such as ingress interface. | |||
A problem common to this approach of distribution through hashing is | A problem common to this approach of distribution through hashing is | |||
impact on path MTU discovery. An ICMPv6 type 2 PTB message generated | impact on path MTU discovery. An ICMPv6 type 2 PTB message generated | |||
on an intermediate device for a packet sent from an a server that is | on an intermediate device for a packet sent from a server that is | |||
part of an ECMP load balanced service to a client, will have the | part of an ECMP load balanced service to a client will have the load | |||
load-balanced anycast address as the destination and would be | balanced anycast address as the destination and hence will be | |||
statelessly load balanced to one of the servers. While the ICMPv6 | statelessly load balanced to one of the servers. While the ICMPv6 | |||
PTB message contains as much of the packet that could not be | PTB message contains as much of the packet that could not be | |||
forwarded as possible, the payload headers are not considered into | forwarded as possible, the payload headers are not considered in the | |||
the forwarding decision and are ignored. Because the PTB message is | forwarding decision and are ignored. Because the PTB message is not | |||
not identifiable as part of the original flow by the IP or upper | identifiable as part of the original flow by the IP or upper layer | |||
layer packet headers the results of the ICMPv6 ECMP hash are unlikely | packet headers, the results of the ICMPv6 ECMP hash are unlikely to | |||
to be hashed to the same nexthop as packets matching TCP or UDP ECMP | be hashed to the same nexthop as packets matching TCP or UDP ECMP | |||
hash. | hash. | |||
An example packet flow and topology follow. | An example packet flow and topology follow. | |||
ptb -> router ecmp -> nexthop L4/L7 load balancer -> destination | ptb -> router ecmp -> nexthop L4/L7 load balancer -> destination | |||
router --> load balancer 1 ---> | router --> load balancer 1 ---> | |||
\\--> load balancer 2 ---> load-balanced service | \\--> load balancer 2 ---> load-balanced service | |||
\--> load balancer N ---> | \--> load balancer N ---> | |||
Figure 1 | Figure 1 | |||
The router ECMP decision is used because it is part of the forwarding | The router ECMP decision is used because it is part of the forwarding | |||
architecture, can be performed at line rate, and does not depend on | architecture, can be performed at line rate, and does not depend on | |||
shared state or coordination across a distributed forwarding system | shared state or coordination across a distributed forwarding system | |||
which may include multiple linecards or routers. The ECMP routing | which may include multiple linecards or routers. The ECMP routing | |||
decision is deterministic with respect to packets having the same | decision is deterministic with respect to packets having the same | |||
computed hash. | computed hash. | |||
Atypical case where ICMPv6 PTB messages are received at the load | A typical case where ICMPv6 PTB messages are received at the load | |||
balancer is a case where the path MTU from the client to the load | balancer is a case where the path MTU from the client to the load | |||
balancer is limited by a tunnel in which the client itself is not | balancer is limited by a tunnel in which the client itself is not | |||
aware of. In the common case of a TCP flow where TLS is employed, | aware of. | |||
the first packet sent from the server that is likely to exceed a | ||||
tunnel MTU lower than that specified by the MSS on the client and the | ||||
load balancer/server is the TLS ServerHello and certificate. | ||||
Direct experience says that the frequency of PTB messages is small | Direct experience says that the frequency of PTB messages is small | |||
compared to total flows. One possible conclusion being that tunneled | compared to total flows. One possible conclusion being that tunneled | |||
IPv6 deployments that cannot carry 1500 mtu packets are relatively | IPv6 deployments that cannot carry 1500 MTU packets are relatively | |||
rare. Techniques employed by clients such as happy-eyeballs may | rare. Techniques employed by clients such as happy-eyeballs may | |||
actually contribute some amelioration to the IPv6 client experience | actually contribute some amelioration to the IPv6 client experience | |||
by preferring IPv4 in cases that might be identified as failures. | by preferring IPv4 in cases that might be identified as failures. | |||
Still, the expectation of operators is that PMTUD should work and | Still, the expectation of operators is that PMTUD should work and | |||
that unnecessary breakage of client traffic should be avoided. | that unnecessary breakage of client traffic should be avoided. | |||
A final observation regarding server tuning is that it is not always | A final observation regarding server tuning is that it is not always | |||
possible even if it is potentially desirable to be able to | possible even if it is potentially desirable to be able to | |||
independently set the TCP MSS for different address families on end- | independently set the TCP MSS for different address families on some | |||
systems. | end-systems. On Linux platforms, advmss may be set on a per route | |||
basis for selected destinations in cases where discrimination by | ||||
route is possible. | ||||
The problem as described does also impact IPv4; however, the ability | The problem as described does also impact IPv4; however | |||
to fragment on wire at tunnel ingress points and the relative rarity | implementation of RFC 4821 [RFC4821] TCP MTU probing, the ability to | |||
of sub-1500 byte MTUs that are not coupled to changes in client | fragment on wire at tunnel ingress points and the relative rarity of | |||
behavior (for example, endpoint VPN clients set the tunnel interface | sub-1500 byte MTUs that are not coupled to changes in client behavior | |||
MTU accordingly for performance reasons) makes the problem | (for example, endpoint VPN clients set the tunnel interface MTU | |||
sufficiently rare that some existing deployments simply choose to | accordingly for performance reasons) makes the problem sufficiently | |||
ignore it. | rare that some existing deployments have choosen to ignore it. | |||
3. Mitigation | 3. Mitigation | |||
Mitigation of the potential for PTB messages to be mis-delivered | Mitigation of the potential for PTB messages to be mis-delivered | |||
involves ensuring that an ICMPv6 error message is distributed to the | involves ensuring that an ICMPv6 error message is distributed to the | |||
same anycast server responsible for the flow for which the error is | same anycast server responsible for the flow for which the error is | |||
generated. Ideally Mitigation could be done by the mechanism hosts | generated. Ideally, mitigation could be done by the mechanism hosts | |||
use to identify the flow, by looking into the payload of the ICMPv6 | use to identify the flow, by looking into the payload of the ICMPv6 | |||
message (to determine which TCP flow it was associated with) before | message (to determine which TCP flow it was associated with) before | |||
making a forwarding decision. Because the encapsulated IP header | making a forwarding decision. Because the encapsulated IP header | |||
occurs at a fixed offset in the icmp message it is not outside the | occurs at a fixed offset in the ICMP message it is not outside the | |||
realm of possibility that routers with sufficient header processing | realm of possibility that routers with sufficient header processing | |||
capability could parse that far into the payload. Employing a | capability could parse that far into the payload. Employing a | |||
mediation device that handles the parsing and distribution of PTB | mediation device that handles the parsing and distribution of PTB | |||
messages after policy routing or on each load-balancer/server is a | messages after policy routing or on each load-balancer/server is a | |||
possibility. | possibility. | |||
Another mitigation approach is predicated upon distributing the PTB | Another mitigation approach is predicated upon distributing the PTB | |||
message to all anycast servers under the assumption that the one for | message to all anycast servers under the assumption that the one for | |||
which the message was intended will be able to match it to the flow | which the message was intended will be able to match it to the flow | |||
and update the route cache with the new MTU, devices not able to | and update the route cache with the new MTU and that devices not able | |||
match the flow will discard these packets. Such distribution has | to match the flow will discard these packets. Such distribution has | |||
potentially significant implications for resource consumption and the | potentially significant implications for resource consumption and the | |||
potential for self-inflicted denial-of-service if not carefully | potential for self-inflicted denial-of-service if not carefully | |||
employed. Fortunately, in real-world-deployment we have observed | employed. Fortunately, in real-world deployments we have observed | |||
that, the number of flows for which this problem occurs is relatively | that the number of flows for which this problem occurs is relatively | |||
small (example, 10 or fewer pps on 1Gb/s or more worth of https | small (example, 10 or fewer pps on 1Gb/s or more worth of https | |||
traffic) and sensible ingress rate limiters which will discard | traffic) and sensible ingress rate limiters which will discard | |||
excessive message volume can be applied to protect even very large | excessive message volume can be applied to protect even very large | |||
anycast server tiers with the potential for fallout only under | anycast server tiers with the potential for fallout only under | |||
circumstances of deliberate duress. | circumstances of deliberate duress. | |||
3.1. Alternatives | 3.1. Alternatives | |||
As an alternative, it may be appropriate to lower the TCP MSS to 1220 | As an alternative, it may be appropriate to lower the TCP MSS to 1220 | |||
in order to accommodate 1280 byte MTU. We consider this undesirable | in order to accommodate 1280 byte MTU. We consider this undesirable | |||
as hosts may not be able to independently set TCP MSS by address- | as hosts may not be able to independently set TCP MSS by address- | |||
family thereby impacting IPv4, or alternatively that it relies on a | family thereby impacting IPv4, or alternatively that it relies on a | |||
middle-box to clamp the MSS independently from the end-systems. | middle-box to clamp the MSS independently from the end-systems. | |||
3.2. Implementation | 3.2. Implementation | |||
1. Filter-based-forwarding matches next-header ICMPv6 type-2 and | 1. Filter-based-forwarding matches next-header ICMPv6 type-2 and | |||
matches a next-hop on a particular subnet directly attached to | matches a next-hop on a particular subnet directly attached to | |||
both border routers. The filter is policed to reasonable limits | both border routers. The filter is policed to reasonable limits | |||
(we chose 1000pps). | (we chose 1000pps more conservative rates might be required in | |||
other imlementations). | ||||
2. Filter is applied on input side of all external interfaces | 2. Filter is applied on input side of all external interfaces | |||
3. A proxy located at the next-hop forwards ICMPv6 type-2 packets | 3. A proxy located at the next-hop forwards ICMPv6 type-2 packets | |||
received at the next-hop to an Ethernet broadcast address | received at the next-hop to an Ethernet broadcast address | |||
(example ff:ff:ff:ff:ff:ff) on all specified subnets. This was | (example ff:ff:ff:ff:ff:ff) on all specified subnets. This was | |||
necessitated by router inability (in IPv6) to forward the same | necessitated by router inability (in IPv6) to forward the same | |||
packet to multiple unicast next-hops. | packet to multiple unicast next-hops. | |||
4. Anycast servers receive the PTB error and process packet as | 4. Anycast servers receive the PTB error and process packet as | |||
skipping to change at page 6, line 33 | skipping to change at page 6, line 33 | |||
if __name__ == '__main__': | if __name__ == '__main__': | |||
main() | main() | |||
This example script listens on all interfaces for IPv6 PTB errors | This example script listens on all interfaces for IPv6 PTB errors | |||
being forwarded using filter-based-forwarding. It removes the | being forwarded using filter-based-forwarding. It removes the | |||
existing Ethernet source and rewrites a new Ethernet destination of | existing Ethernet source and rewrites a new Ethernet destination of | |||
the Ethernet broadcast address. It then sends the resulting frame | the Ethernet broadcast address. It then sends the resulting frame | |||
out the p2p1 and p2p2 interfaces where our anycast servers reside. | out the p2p1 and p2p2 interfaces where our anycast servers reside. | |||
Alternatively, network designs in which a common layer2 network | 3.2.1. Alternatives | |||
exists could rewrite the destination on the end system, for example | ||||
in using iptables before forwarding the packet back to the network | Alternatively, network designs in which a common layer 2 network | |||
containing all of the server or load balancer interfaces. | exists on the ECMP hop could distribute the proxy onto the end | |||
systems, eleminating the need for policy routing. They could then | ||||
rewrite the destination -- for example, using iptables before | ||||
forwarding the packet back to the network containing all of the | ||||
server or load balancer interfaces. This implmentation can be done | ||||
entirely within the Linux iptables firewall. Because of the | ||||
distributed nature of the filter, more conservative rate limits are | ||||
required than when a global rate limit can be employed. | ||||
An example ip6tables / nftables rule to match icmp6 traffic, not | ||||
match broadcast traffic, impose a rate limit of 10 pps, and pass to a | ||||
target destination would resemble: | ||||
ip6tables -I INPUT -i lo -p icmpv6 -m icmpv6 --icmpv6-type 2/0 \ | ||||
-m pkttype ! --pkt-type broadcast -m limit --limit 10/second \ | ||||
-j TEE 2001:DB8::1 | ||||
As with the scapy example, once the destination has been rewritten | ||||
from a hardcoded ND entry to an Ethernet broadcast address -- in this | ||||
case to an IPv6 documentation address -- the traffic will be | ||||
reflected to all the hosts on the subnet. | ||||
4. Improvements | 4. Improvements | |||
There are several ways that improvements could be made to the | There are several ways that improvements could be made to the | |||
situation with respect to ECMP load balancing of ICMPv6 PTB. | situation with respect to ECMP load balancing of ICMPv6 PTB. | |||
1. Routers with sufficient capacity within the lookup process could | 1. Routers with sufficient capacity within the lookup process could | |||
parse all the way through the L3 or L4 header in the ICMPv6 | parse all the way through the L3 or L4 header in the ICMPv6 | |||
payload beginning at bit offset 32 of the ICMP header. By | payload beginning at bit offset 32 of the ICMP header. By | |||
reordering the elements of the hash to match the inward direction | reordering the elements of the hash to match the inward direction | |||
of the flow, the PTB error could be directed to the same next-hop | of the flow, the PTB error could be directed to the same next-hop | |||
as the incoming packets in the flow. | as the incoming packets in the flow. | |||
2. The FIB could be programmed with a multicast distribution tree | 2. The FIB (Forwarding Information Base) on the router could be | |||
that included all of the necessary next-hops. | programmed with a multicast distribution tree that included all | |||
of the necessary next-hops. | ||||
3. Ubiquitous implementation of RFC 4821 [RFC4821] Packetization | 3. Ubiquitous implementation of RFC 4821 [RFC4821] Packetization | |||
Layer Path MTU Discovery would probably go a long way towards | Layer Path MTU Discovery would probably go a long way towards | |||
reducing dependence on ICMPv6 PTB. | reducing dependence on ICMPv6 PTB. | |||
5. Acknowledgements | 5. Acknowledgements | |||
The authors would like to thank Mark Andrews, Brian Carpenter, Nick | The authors would like to thank Marak Majkowsiki for contributing | |||
Hilliard and Ray Hunter, for review. | text, examples, and a very close review. The authors would like to | |||
thank Mark Andrews, Brian Carpenter, Nick Hilliard and Ray Hunter, | ||||
for review. | ||||
6. IANA Considerations | 6. IANA Considerations | |||
This memo includes no request to IANA. | This memo includes no request to IANA. | |||
7. Security Considerations | 7. Security Considerations | |||
The employed mitigation has the potential to greatly amplify the | The employed mitigation has the potential to greatly amplify the | |||
impact of a deliberately malicious sending of ICMPv6 PTB messages. | impact of a deliberately malicious sending of ICMPv6 PTB messages. | |||
Sensible ingress rate limiting can reduce the potential for impact; | Sensible ingress rate limiting can reduce the potential for impact; | |||
skipping to change at page 7, line 44 | skipping to change at page 8, line 19 | |||
[RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU | [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU | |||
Discovery", RFC 4821, March 2007. | Discovery", RFC 4821, March 2007. | |||
Authors' Addresses | Authors' Addresses | |||
Matt Byerly | Matt Byerly | |||
Fastly | Fastly | |||
Kapolei, HI | Kapolei, HI | |||
US | US | |||
Email: mbyerly@zynga.com | Email: suckawha@gmail.com | |||
Matt Hite | Matt Hite | |||
Evernote | Evernote | |||
Redwood City, CA | Redwood City, CA | |||
US | US | |||
Email: mhite@hotmail.com | Email: mhite@hotmail.com | |||
Joel Jaeggli | Joel Jaeggli | |||
Fastly | Fastly | |||
Mountain View, CA | Mountain View, CA | |||
US | US | |||
Email: joelja@gmail.com | Email: joelja@gmail.com | |||
End of changes. 27 change blocks. | ||||
50 lines changed or deleted | 75 lines changed or added | |||
This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |