X

Best Practices from Oracle Development's A‑Team

eBGP On-Premise to OCI fast failover detection

Kevin Miles
Network Solutions Architect

Overview

Quite often customers intend to transition only a portion of their workloads to the Oracle Cloud Infrastructure. Remaining workloads continue to operate on-premise which leaves applications depending on network connectivity to the OCI VCNs. Many of these applications were written to perform on-premise or are very sensitive to any network latency impairments. The criticality to a customer’s business may also require little or no application downtime. While the OCI FastConnect option provides predictable latency it does not inherently address customer network edge resiliency. As a network best practice we recommend a redundant on-premise edge routing and connectivity. This means there should be a preferred primary path from on-premise to the OCI VCN. This also implies that during a link and/or router failure there needs to be an acceptable delay between the primary path to backup path failover. There are several factors that impact how fast the network edge will failover to support the applications. This blog article will focus specifically on eBGP peering between the on-premise edge and OCI. The objective is to quickly trigger failure notification of the primary path in order to begin routing convergence to the backup path.  We will discuss 3 common options implemented in the BGP configuration.

 

1.BGP keepalive and hold timer tuning

The BGP keepalive and hold timer values are negotiated in the BGP OPEN message when the peering session is initially built. The peers agree on the lowest values presented in the respective messages. The keepalive message monitors the health and status of the session to the neighbor. The hold time value must be three times the value of the keepalive value. The hold time resets each time a keepalive message is received successfully on the neighbor router. If the neighbor fails to receive three keepalives the hold time will expire and the peering between the neighbor routers will be closed.

These timer values may be decreased in order to help improve routing convergence when there is a primary link failure. The default values will depend on your particular vendor implementation.  Below lists some vendor examples as well as OCI default values.

OCI – 60/180 (minimum 6/18 supported)

https://docs.cloud.oracle.com/en-us/iaas/Content/Network/Concepts/fastconnectrequirements.htm

Juniper – 30/90

Palo Alto – 30/90

Cisco – 60/180

Below we have a sample test bed using 2 Cisco routers peered together

Timers set at 60 seconds for keepalive and 180 seconds for hold  time

 

Timers set at 60 seconds for keepalive and 180 seconds for hold time

 

Next we simulate a link failure by shutting down the interface