multi-master-dnsThis article is a continuation of our discussion on Multi-master DNS Servers in VitalQIP, and highlights some of the problems, issues, and concerns with Multi-master DNS using VitalQIP.

Issue #1 - Dynamic DNS

A BIND DNS server does not use the same replication mechanism for Dynamic DNS (DDNS) updates as a Microsoft Active Directory-integrated DNS servers do. We could write an entire article on DDNS, but that is outside the scope of this article.

There are mainly two different strategies for implementing Dynamic DNS updates when using VitalQIP:

  • qip-dnsupdated - allow the VitalQIP product to propagate DDNS Updates to all authoritative servers (master(s) and slaves)
  • notify - allow the VitalQIP product to propagate DDNS Updates to a single authoritative master and rely on BIND DNS Notify and ixfr from the master to its slaves

The reasons why a multi-master DNS should not be used on Dynamic DNS zones are:

  1. DDNS Updates are made via UDP and are not guaranteed delivery. This means that there is no mechanism for keeping the masters in lock-step between DNS File Generation cycles.
  2. DDNS Updates could cause potentially significant "thrashing" of notify and ixfer packets of inconsistent data.

If we know that the zones are static and not capable of DDNS Updates, then we know that the data is not changing between DNS File Generation cycles. Keeping the masters in sync is more predictable. And we know from our prior article that the serial numbers of the two masters should only differ by two during and after DNS File Generations. Given a zone that is dynamically updateable, we cannot guarantee synchronized DDNS Updates to both servers because UDP is best effort. Additional monitoring facilities and visibility would be required to see how "in sync" the masters were to one another between pushes. For this reason, we don't recommended using multi-master DNS with dynamically updatable zones. When using VitalQIP to manage dynamic DNS updates it is best to configure a single master nameserver, and configure the rest of the authoritative nameservers as slaves to that server.

Issue #2 - Loss of one or more of the master DNS nameservers

Given our configuration of two masters for the same zone, ns1.acme.com and ns2.acme.com we have two different failure scenarios. We could lose ns1, ns2 or both masters.

Failure to ns1.acme.com

multi-master-dns3Assuming that our servers are "in sync" and we perform DNS Files Generation to ns1 followed by ns2, we know that the data should be the same with the exception that the serial numbers on ns2 will be 2 greater than ns1 for each zone. This means that our slaves will be actively pulling from ns2.acme.com. Suppose ns1.acme.com were to either fail or be taken offline. Zone transfers would continue to take place normally from our slaves to ns2.acme.com. This is shown in the figure on the right.

When ns1.acme.com is restored, it is recommended that a DNS Files Generation be performed to "re-sync" its data with the VitalQIP database. If we push to both servers, then ns2.acme.com will have identical data but its serial numbers for each zone will be 2 greater than that of ns1.acme.com. Slaves will continue to perform zone transfers from ns2.acme.com.

Failure to ns2.acme.com

If ns2.acme.com were to fail or be taken offline after all the slaves had refreshed their data, then we'd have a slightly different scenario. Let's assume our serial numbers for acme.com on both servers and all of our slaves is as follows:

ServerSerial # for acme.com
ns1.acme.com 100
ns2.acme.com 102
ALL SLAVES 102

multi-master-dns4It would be desirable for all the slaves to start performing any zone transfers from ns1.acme.com. But, this won't occur naturally, because the serial number of acme.com is 100 on ns1 which is lower than the serial numbers on all our slaves. The slaves won't pull from ns1 because they consider lower serial numbers to be out of date or stale. If ns2.acme.com is down for a prolonged period that extends past the zone's expiration time, then the slaves would expire the zone and fail to respond to queries.

So, what's required when there's a failure to ns2?

There are two possible fixes to this problem:

  1. Perform an rndc retransfer zone on all the slaves to force them to pull zones from ns1.acme.com even though the serial number is lower and data is "out-of-date"
  2. Perform at least 1 (preferably 2) DNS Files Generation from VitalQIP to ns1.acme.com so that its serial numbers for acme.com are incremented to a number greater than 102.

Why two (2) pushes?

If you recall from our prior article Multi-master DNS Servers in VitalQIP we explained that serial numbers are derived by taking the highest active serial number and incrementing it by 2. The first push will increment ns1's serial number from 100 to 102, which is still not greater than the serial number of the slaves. Following the second DNS File Generation, our serial number would be incremented to 104. Slaves would begin performing zone transfers from ns1 since its serial number was greater.

Failure of both ns1 and ns2

In the unlikely event that both ns1 and ns2 were to fail or be taken offline, then there is no source for the slaves to perform zone transfers from. You basically have until the zone expiration time for acme.com (set in the SOA record) to bring ns1 and/or ns2 back online, before the slaves will expire the zone and fail to answer queries for acme.com.

Conclusion

In final, we don't recommend running multi-master DNS configurations in VitalQIP because of added complexity, and a lack of native replication mechanisms to guarantee full synchronization of data and serial numbers between master nameservers. It's better to configure VitalQIP to utilize a single master with multiple slaves to that master. Instead, focus on the monitoring aspect of the master-slave relationship to "test" when a master is non-functional, and zone transfers fail. Additionally, High Availability (HA) or Anycast DNS are proven technology solutions that can improve the resiliency of your DNS better than using multi-master DNS.