eBGP On-Premise to OCI fast failover detection

Overview

Quite often customers intend to transition only a portion of their workloads to the Oracle Cloud Infrastructure. Remaining workloads continue to operate on-premise which leaves applications depending on network connectivity to the OCI VCNs. Many of these applications were written to perform on-premise or are very sensitive to any network latency impairments. The criticality to a customer’s business may also require little or no application downtime. While the OCI FastConnect option provides predictable latency it does not inherently address customer network edge resiliency. As a network best practice we recommend a redundant on-premise edge routing and connectivity. This means there should be a preferred primary path from on-premise to the OCI VCN. This also implies that during a link and/or router failure there needs to be an acceptable delay between the primary path to backup path failover. There are several factors that impact how fast the network edge will failover to support the applications. This blog article will focus specifically on eBGP peering between the on-premise edge and OCI. The objective is to quickly trigger failure notification of the primary path in order to begin routing convergence to the backup path. We will discuss 3 common options implemented in the BGP configuration.

1.BGP keepalive and hold timer tuning

The BGP keepalive and hold timer values are negotiated in the BGP OPEN message when the peering session is initially built. The peers agree on the lowest values presented in the respective messages. The keepalive message monitors the health and status of the session to the neighbor. The hold time value must be three times the value of the keepalive value. The hold time resets each time a keepalive message is received successfully on the neighbor router. If the neighbor fails to receive three keepalives the hold time will expire and the peering between the neighbor routers will be closed.

These timer values may be decreased in order to help improve routing convergence when there is a primary link failure. The default values will depend on your particular vendor implementation. Below lists some vendor examples as well as OCI default values.

OCI – 60/180 (minimum 6/18 supported)

https://docs.cloud.oracle.com/en-us/iaas/Content/Network/Concepts/fastconnectrequirements.htm

Juniper – 30/90

Palo Alto – 30/90

Cisco – 60/180

Below we have a sample test bed using 2 Cisco routers peered together

Timers set at 60 seconds for keepalive and 180 seconds for hold time

Next we simulate a link failure by shutting down the interface

The BGP adjacency hold time expires about 175 seconds after the link is shut down

Next we restore the link and set the timers to the minimum value supported by OCI

The keepalive and hold timers have been reduced to 6 seconds and 18 seconds respectively

The hold time expires and brings the BGP adjacency down 11 seconds after the link is shut down

Link Based failure detection

Another approach to the issue is to avoid dependence on expiring timers completely. This method can be achieved by triggering off of the underlying state of the link used to reach the peer. The Fast external fallover feature will close the peering session immediately when the link goes down. This method works only with directly connected peers. Several vendors (e.g., Cisco and Juniper) employ this method as an option. This is a default feature at the Cisco BGP process level.

With default BGP Fast-external-fallover the BGP adjacency is brought down immediately after the link is shut down

Bidirectional Forwarding Detection (BFD)

OCI has enhanced its FastConnect service with the introduction of Bi-Directional Forwarding Detection (BFD), a critical feature designed to improve network reliability and reduce downtime. BFD enables continuous monitoring of the health and availability of network paths between on-premises environments and OCI, ensuring rapid detection of link failures. By supporting bi-directional forwarding detection, a FastConnect can now proactively identify and respond to connectivity issues within milliseconds, minimizing the impact on mission-critical applications. This addition not only strengthens the resilience of hybrid cloud architectures but also provides customers with greater confidence in maintaining seamless and efficient data transfers between their on-premises infrastructure and Oracle Cloud.

The BFD timers in OCI uses the fixed value of 300ms and a detection multiplier of three. This can’t be modified in OCI, however you can negotiate higher timers on the CPE side of the BFD configuration.

BFD can be able during the initial provisoning of the FastConnect virtual circuit or afterwards. Note that BFD can be enabled on an existing FastConnect before configuring the CPE side without affecting it’s availability. Potential distruptions with BFD happens only once the BFD session is established.

Create FC with BFD

Status of a FastConnect with BFD enabled:

BFD enabled

Solution summary

As demonstrated in our blog BGP routing convergence process can be influenced in a number of ways. The approach selected for this common issue depends on a couple of factors:

Router/Firewall vendor support

Impact to router/firewall resource

Level of sensitivity of applications to a network link failure event

eBGP On-Premise to OCI fast failover detection

Shawn Moore

Principal Cloud Network Architect

Kevin Miles

Network Solutions Architect

From Cost Reports to Cost Intelligence: Automating FinOps Insights in OCI – Part 2

Securing OCI-Based Workloads: Using OCI Network Firewall to Shield your Core Environment from End Customers

eBGP On-Premise to OCI fast failover detection

Authors

Shawn Moore

Principal Cloud Network Architect

Kevin Miles

Network Solutions Architect

From Cost Reports to Cost Intelligence: Automating FinOps Insights in OCI – Part 2

Securing OCI-Based Workloads: Using OCI Network Firewall to Shield your Core Environment from End Customers