Hybrid DNS in OCI

It’s a common scenario to have DNS name resolution between FQDNs in OCI and your on-prem… or between multiple OCI VCNs.  How do we make this happen today?  By using a hybrid DNS solution, of course!

What is a hybrid DNS solution?  It’s essentially a DNS overlay on top of the built-in OCI VCN DNS.  It would be great if it were integrated into OCI as a whole, but it’s not… so knowing that, it’s actually very easy to set this up and be happily working in a short period of time.

There are different resources available around hybrid DNS on OCI (example: https://github.com/terraform-providers/terraform-provider-oci/tree/master/docs/examples/networking/hybrid_dns) that go into some sample topologies and scenarios, however I thought a bit more context and coloring around how hybrid DNS is designed and is implemented would be helpful.  Without further ado, let’s dive into it…

Sample Scenario

To highlight some considerations and how this is done, a sample environment has been created.  The description is below, then we’ll dive into the testing and behavior of the solution.

Keep in mind that while I’m using DNSmasq, you could just as easily use BIND or any other modern DNS forwarder…

Cloud Topology

Our example scenario solution consists of two VCNs in the same region, with two DNSmasq forwarders in each VCN and a single DBSystem:

Hybrid DNS Cloud Topology

Hybrid DNS Cloud Topology

The following DHCP Options are configured:

Subnet Type Resolvers
dns1 VCN/Internet Resolver N/A
dns2 VCN/Internet Resolver N/A
servers1 Custom Resolver 192.168.0.2, 192.168.0.3
servers2 Custom Resolver 10.0.0.2, 10.0.0.3
db1 VCN/Internet Resolver N/A
db2 VCN/Internet Resolver N/A

DBSystem (DBaaS) cannot be deployed on a subnet that does not have the VCN/Internet Resolver configured. I’m not sure if you *should*, but you can change this after-the-fact… 🙂

Logical Topology

Here’s how the logical topology is built:

Hybrid DNS Cloud Logical Topology

Hybrid DNS Cloud Logical Topology

DNS Forwarding Topology

DNSmasq forwarders are configured to forward requests for the other VCN’s namespace to the two DNSmasq forwarders in that remote VCN (connected via LPGs):

Hybrid DNS Cloud DNS Topology

Hybrid DNS Cloud DNS Topology

All other requests are handled by the local 169.254.169.254 resolver accessible within the VCN.

Each DNSmasq instance points to the other DNSmasq instances in the other VCN… by default, DNSmasq falls back to using what’s in /etc/resolv.conf, which should be using 169.254.169.254, so no further configuration is necessary for resolution of other Oracle OCI or public FQDNs.

vcn1 DNSmasq instances have their /etc/dnsmasq.conf containing:

server=/vcn2.oraclevcn.com./10.0.0.2
server=/vcn2.oraclevcn.com./10.0.0.3
cache-size=0

vcn2 DNSmasq instances have their /etc/dnsmasq.conf containing:

server=/vcn1.oraclevcn.com./192.168.0.2
server=/vcn1.oraclevcn.com./192.168.0.3
cache-size=0

This is all handled via the cloud-init metadata that’s passed to the instances upon instantiation. The firewall rules (permitting udp/53) and DNSmasq installation/configuration is all automated via cloud-init. Set it and forget it! 😉

Here’s an example of what the cloud-init metadata looks like that is passed to DNSmasq instances in vcn1:

#cloud-config

write_files:
  # create dnsmasq config
  - path: /etc/dnsmasq.conf
    content: |
      server=/vcn2.oraclevcn.com./10.0.0.2
      server=/vcn2.oraclevcn.com./10.0.0.3
      cache-size=0

runcmd:
  # Run firewall commands to open DNS (udp/53)
  - firewall-offline-cmd --zone=public --add-port=53/udp
  # install dnsmasq package
  - yum install dnsmasq -y
  # enable dnsmasq process
  - systemctl enable dnsmasq
  # restart dnsmasq process
  - systemctl restart dnsmasq
  # restart firewalld
  - systemctl restart firewalld

Within the TerraForm instance resource definition, the above content is saved to a file (in this case, dns1.tpl) and referenced in the instance:

resource "oci_core_instance" "dns1a" {
  ... <removed for brevity>
  metadata {
      user_data = "${base64encode(file("dns1.tpl"))}"
  }
  ... <removed for brevity>
}

The same is done for the DNSmasq instances in vcn2, except that we’ve changed the DNSmasq configuration to match what’s shown above.

 

Testing

Once we’ve setup the above environment, what kind of behavior do we expect from it?  Here are some pretty common (and maybe uncommon) questions and scenarios that might surface from this kind of solution…

Can I resolve FQDNs for subnets using custom resolvers?

Yes! We’ll test by resolving server2’s FQDN from server1 (both of which are using custom resolvers).

First off, let’s look at the config of `/etc/resolv.conf` on server1:

[opc@server1 ~]$ cat /etc/resolv.conf
; generated by /usr/sbin/dhclient-script
search servers1.oraclevcn.com
nameserver 192.168.0.2
nameserver 192.168.0.3
[opc@server1 ~]$

It’s clearly configured to use the custom resolvers we’ve configured.

Back to our testing… server1 can resolve server2.vcn2.oraclevcn.com:

[opc@server1 ~]$ dig server2.servers2.vcn2.oraclevcn.com

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> server2.servers2.vcn2.oraclevcn.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40716
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;server2.servers2.vcn2.oraclevcn.com. IN A

;; ANSWER SECTION:
server2.servers2.vcn2.oraclevcn.com. 300 IN A 10.0.0.18

;; AUTHORITY SECTION:
servers2.vcn2.oraclevcn.com. 86400 IN NS vcn-dns.oraclevcn.com.

;; ADDITIONAL SECTION:
vcn-dns.oraclevcn.com. 86138 IN A 169.254.169.254

;; Query time: 3 msec
;; SERVER: 192.168.0.2#53(192.168.0.2)
;; WHEN: Wed Sep 26 22:04:34 GMT 2018
;; MSG SIZE rcvd: 118

[opc@server1 ~]$

We’re also able to resolve public FQDNs from server1:

[opc@server1 ~]$ dig www.oracle.com

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> www.oracle.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58598
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 13, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.oracle.com. IN A

;; ANSWER SECTION:
www.oracle.com. 290 IN CNAME ds-www.oracle.com.edgekey.net.
ds-www.oracle.com.edgekey.net. 280 IN CNAME e870.dscx.akamaiedge.net.
e870.dscx.akamaiedge.net. 10 IN A 23.35.173.32

;; AUTHORITY SECTION:
. 517378 IN NS d.root-servers.net.
. 517378 IN NS b.root-servers.net.
. 517378 IN NS e.root-servers.net.
. 517378 IN NS c.root-servers.net.
. 517378 IN NS a.root-servers.net.
. 517378 IN NS g.root-servers.net.
. 517378 IN NS i.root-servers.net.
. 517378 IN NS m.root-servers.net.
. 517378 IN NS h.root-servers.net.
. 517378 IN NS k.root-servers.net.
. 517378 IN NS j.root-servers.net.
. 517378 IN NS l.root-servers.net.
. 517378 IN NS f.root-servers.net.

;; Query time: 6 msec
;; SERVER: 192.168.0.2#53(192.168.0.2)
;; WHEN: Wed Sep 26 22:05:35 GMT 2018
;; MSG SIZE rcvd: 345

[opc@server1 ~]$

 

Can I use FQDNs for replication between DBaaS/DBsystems?

THIS IS ONLY PROVIDED FOR FOOD FOR THOUGHT AND SHOULD NOT BE DONE AS THERE COULD BE LONG-TERM NEGATIVE REPERCUSSIONS ON YOUR DBAAS/DBSYSTEM BY USING A NON-SUPPORTED DHCP OPTION.

Maybe?  🙂

We’re restricted from deploying DBaaS/DBSystem to a subnet that is using custom DNS resolvers (only VCN/Internet Resolver is supported).

Of course, after the fact, there’s nothing stopping you from changing the DHCP Options for the DBaaS subnet and restarting the node.

Here’s what `/etc/resolv.conf` looks like before we change the dbaas1 subnet’s (db1) DNS forwarders:

[opc@dbaas1 ~]$ cat /etc/resolv.conf
; generated by /sbin/dhclient-script
search db1.oraclevcn.com
nameserver 169.254.169.254
[opc@dbaas1 ~]$

Now the DNS resolvers are changed to custom for the db1 subnet (pointing to 192.168.0.2 and 192.168.0.3) and the DBS node is restarted.

Now see what `/etc/resolv.conf` looks like on dbaas1:

[opc@dbaas1 ~]$ cat /etc/resolv.conf
; generated by /sbin/dhclient-script
search db1.oraclevcn.com
nameserver 192.168.0.2
nameserver 192.168.0.3
[opc@dbaas1 ~]$

From dbaas1, let’s try resolving dbaas2.db2.vcn2.oraclevcn.com:

[opc@dbaas1 ~]$ dig dbaas2.db2.vcn2.oraclevcn.com

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.5 <<>> dbaas2.db2.vcn2.oraclevcn.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1675
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;dbaas2.db2.vcn2.oraclevcn.com. IN A

;; ANSWER SECTION:
dbaas2.db2.vcn2.oraclevcn.com. 300 IN A 10.0.0.34

;; AUTHORITY SECTION:
db2.vcn2.oraclevcn.com. 86400 IN NS vcn-dns.oraclevcn.com.

;; ADDITIONAL SECTION:
vcn-dns.oraclevcn.com. 85738 IN A 169.254.169.254

;; Query time: 4 msec
;; SERVER: 192.168.0.2#53(192.168.0.2)
;; WHEN: Wed Sep 26 22:15:03 2018
;; MSG SIZE rcvd: 101

[opc@dbaas1 ~]$

It works! Voila! Imagine needing to use a SCAN listener FQDN from one node to another? Obviously the name resolution between VCNs works fine for DBaaS/DBSystem as well as other OCI instances in this scenario.

Now for full disclosure, this probably shouldn’t be done… I am not sure if there’s any long-term negative impact on the DBSystem by making this change.  Maybe some sort of management, patching or other type of functionality will be broken? I suspect not, as we’re still able to resolve all FQDN requests (except for those in the vcn2.oraclevcn.com namespace) via the VCN/Internet resolver (169.254.169.254), just via an extra forwarder (our local VCN resolver). With that disclosure and warning out of the way (don’t do this – this is just for an example of what not to do), you can make your own decision. The point is that we are able to achieve FQDN resolution between VCNs on DBaaS/DBSystem.

Why can I not resolve FQDNs that are within my VCN from the subnets using custom resolvers?

Great observation! This is a normal behavior on OCI. By using a custom resolver for a subnet, it effectively turns on a “DNS firewall” (this is my own term – for lack of a better phrase), where only public FQDNs are resolvable when querying the VCN resolver (169.254.169.254) from a subnet configured to use custom resolvers.

Let’s see this in action. On server1, let’s try to query dbaas1.db1.vcn1.oraclevcn.com:

[opc@server1 ~]$ dig dbaas1.db1.vcn1.oraclevcn.com

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> dbaas1.db1.vcn1.oraclevcn.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27902
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;dbaas1.db1.vcn1.oraclevcn.com. IN A

;; ANSWER SECTION:
dbaas1.db1.vcn1.oraclevcn.com. 300 IN A 192.168.0.34

;; AUTHORITY SECTION:
db1.vcn1.oraclevcn.com. 86400 IN NS vcn-dns.oraclevcn.com.

;; ADDITIONAL SECTION:
vcn-dns.oraclevcn.com. 85032 IN A 169.254.169.254

;; Query time: 4 msec
;; SERVER: 192.168.0.2#53(192.168.0.2)
;; WHEN: Wed Sep 26 22:31:37 GMT 2018
;; MSG SIZE rcvd: 112

[opc@server1 ~]$

It works! No surprises here… but wait, this is using our DNSmasq forwarders, right? Sure enough, these are used. So let’s manually specify the VCN resolver and retry:

[opc@server1 ~]$ dig @169.254.169.254 dbaas1.db1.vcn1.oraclevcn.com

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> @169.254.169.254 dbaas1.db1.vcn1.oraclevcn.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 48868
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;dbaas1.db1.vcn1.oraclevcn.com. IN A

;; Query time: 166 msec
;; SERVER: 169.254.169.254#53(169.254.169.254)
;; WHEN: Wed Sep 26 22:31:28 GMT 2018
;; MSG SIZE rcvd: 58

[opc@server1 ~]$

Oh, there it is. Ok, yep – that’s the behavior we were expecting (or if we didn’t know about it, wouldn’t be expecting).

Let’s double-check the assumption that public FQDNs are resolvable by the VCN resolver:

[opc@server1 ~]$ dig @169.254.169.254 www.oracle.com

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> @169.254.169.254 www.oracle.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57881
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.oracle.com. IN A

;; ANSWER SECTION:
www.oracle.com. 215 IN CNAME ds-www.oracle.com.edgekey.net.
ds-www.oracle.com.edgekey.net. 215 IN CNAME e870.dscx.akamaiedge.net.
e870.dscx.akamaiedge.net. 20 IN A 23.35.173.32

;; Query time: 77 msec
;; SERVER: 169.254.169.254#53(169.254.169.254)
;; WHEN: Wed Sep 26 22:33:15 GMT 2018
;; MSG SIZE rcvd: 137

[opc@server1 ~]$

Yes indeed… it’s working!

Rule of thumb: any name resolution in a subnet using a custom DNS resolver(s) should really go through those resolvers. The VCN resolver (169.254.169.254) that’s available in the OCI subnet, while it’s available, will effectively render the larger VCN DNS namespace inaccessible to these subnets (using custom resolvers).

Conclusion

While a hybrid DNS scenario might not be your ideal picture of how this should work, it does work quite well, especially as it’s the best way to get name resolution from on-prem or between multiple OCI VCNs.

Hopefully the above sheds a bit more light and context on this common need and will be helpful as you work to architect, implement and manage applications on OCI!

Add Your Comment