There are some cases when our customers are reporting that the TCP connections from On-premise to OCI just hangs apparently without any good reason. In the next discussion we will introduce an important detail that usually is not taken into consideration when the connection hanging troubleshooting is made.
The case discussed is when the TCP three way handshake completes successfully but when starting sending the data the connection is hanging. Why? In modern TCP and UDP stack implementation the "Don't Fragment" bit is set in the IP header when encapsulating a TCP segment or a UDP datagram. This is used to implement PMTUD to automatically discover the lowest MTU on the path and to avoid IP packet fragmentation between sending and receiving hosts.
PMTUD stands for Path MTU Discovery, an automatic mechanism to discover the lowest MTU between two endpoints. PMTUD relies on ICMP Type 3 Code 4 messages received from the upstream routers announcing that a packet exceeding the MTU value, needs to be sent out but in a non-fragmented way (due to the Don't Fragment" bit set). The router is dropping the packet announcing its MTU that needs to be used in order to avoid the IP packet fragmentation to the sender host. The sender host will store the value in the routing entry associated with the destination host for a period of time and use it to avoid fragmentation that can impact the performance.
More details about PMTUD can be found accessing the following link: https://www.ietf.org/rfc/rfc1191.txt
Next is a list of ICMP Types and Codes: https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml, we will focus on ICMP Type 3 (Destination Unreachable) and Code 4 (Fragmentation Needed and Don't Fragment was Set).
Note: On the OCI side the PMTUD is automatically allowed if statefull security rules are used (means the ICMP Type 3 Code 4 is automatically allowed, you do not need to have security rules allowing the ICMP Type 3 Code 4). If stateless security rules are used the ICMP Type 3 Code 4 needs to be allowed to perform PMTUD. More details can be found by accessing the OCI public documentation: https://docs.cloud.oracle.com/iaas/Content/Network/Troubleshoot/connectionhang.htm
The traffic will flow from On-premise to OCI from the two subnets: Subnet1 with PMTUD active (ICMP Type 3 Code 4 is allowed) and from Subnet2 with PMTUD inactive (ICMP Type 3 Code 4 blocked - cannot reach 172.30.1.2). The On-premise Gateway (a Linux machine with LibreSwan for IPSec) has MTU of 1500 bytes. The tunnel MTU on the OCI side is 1420 bytes and all hosts MTU are set to 9000 bytes. The hosts MTU is intentionally set to 9000 to send data bigger than 1500 bytes and to trigger the PMTUD mechanism.
1. Starting the iperf3 tcp traffic:
2. The LibreSwan VM is sending the ICMP Type 3 Code 4 announcing that the IP packet is too big and needs to be fragmented but the DF bit is set and also includes the MTU that should be used back to the originating host:
3. The sender VM is updating the route table with the MTU for this particular destination and will use it for about 600 seconds (Linux) - the connection is working fine; the same is happening with ssh over the IPSec tunnel - the packet size sent out by the sending host will not exceed the desired MTU:
1. We are trying to connect via ssh over the IPSec tunnel to 192.168.12.242 (the connection hangs):
2. The next-hop (LibreSwan VM) is sending the ICMP Type 3 Code 4 back to 172.30.1.2:
3. Because the ICMP Type 3 Code 4 is filtered in this subnet the tcpdump on the sending host confirms that the ICMP Type 3 Code 4 is not received, so it will not be able to use the correct MTU value, it will always send data size more than expected MTU with the DF bit set in the IP header, the next router will drop the packet and the connection hangs:
One solution is to manually set the correct MTU value on each and every host but this can be a very time consuming job if needs to be set on hundreds of hosts. Allowing ICMP Type 3 Code 4 on the On-premise firewalls to reach the sending hosts we can let PMTUD do the MTU signalling. Why not?