by Bill Manning
The Domain Name System (DNS) specification calls for the use of caching. Caching is expected to improve the overall responsiveness of the system by ensuring that answers to questions are known and stored locally and that the query load placed on the authoritative servers is minimized. Certain presumptions are associated with caches that may no longer hold. This article looks at some of these presumptions and explores some of the problems that emerge when they are violated. Based on our observations, we offer some recommendations on DNS cache best practices and show our results of testing these practices.
A DNS resolver can no longer trust the data it gets—because the data generally comes from nonauthoritative nodes or caches operated by third parties, most of whom have no vested interest in providing accurate data. Removing or bypassing caching from the DNS and going directly to the authoritative servers is considered a fatal flaw because authoritative servers are presumed to have neither the bandwidth nor the processing power to accommodate the perceived demand from a cacheless service. This article looks at the bandwidth and processing capabilities of modern authoritative servers to ascertain the viability of these presumptions. We start by looking briefly at the DNS.
The DNS namespace is made visible and useful by nodes publishing authoritative information about the namespace and resolvers that send queries about the namespace to these servers. As an optimization, other nodes may act as intermediates or proxies for the authoritative servers for one to many resolvers. These intermediate nodes are called caching nameservers or iterative mode resolvers. This flow is shown in Figure 1.
Several assumptions about the use and placement of caches have been questioned recently. The simplest is one of placement. A cache works best when the Round-Trip Time (RTT) between the resolver and the cache is low. Historically, a cache was placed at traffic aggregation points such as an Internet Service Provider (ISP) operating a cache for its clients. With increased mobility of nodes, this presumption is no longer as firm. There are reported cases where resolvers continue to use caches 300 ms away, while an authoritative server is 15 ms away. So if the intent is to reduce network bandwidth, then a cache presuming its client resolvers are all "local" might be misconstrued.
Fixing a resolver to a specific cache does have the benefit of being tied to a known business relationship; for example, using your ISP's caching service. In contrast, mobile nodes often get an IP address from a provider's Dynamic Host Configuration Protocol (DHCP) servers, which also hand out more "local" caching servers to be used by the mobile node.
This scenario would be fine—as long as the DNS namespace was in fact a coherent, single space. Unfortunately it is not. So-called Walled-Garden networks that have their own versions of DNS namespace have been and remain common. In the Internet, there are more and more alternate root hierarchies that diverge from what most think of as "the" root namespace in either subtle or wildly divergent ways. To date, there is no deployed way for a resolver to determine the origin of the data stored in a cache. A resolver then has no way other than verification of the data to know that the locally assigned cache is in fact using the namespace desired. This situation represents one important reason for going back to a well-known cache, even if it is topologically remote. But this assumption may no longer be valid.
ISPs and even some caching service providers are starting to manipulate caches as a means to monetize their operations.  Numerous techniques are in use, from the nominally benign method of using wildcards to more insidious capture and rewrite of NXDOMAIN replies, to outright intentional cache pollution.
In this climate, a resolver should choose its cache carefully. We argue that it is reasonable, in many of today's environments, to place the cache within 1 ms of the resolver; for example, run a cache on the local node. This argument is an extension of the assertion  that claims that caches are effective for client populations that are about 10 or fewer.
This technique has the added advantage of reducing the "attack surface" by reducing the effect of cache poisoning or rewriting replies to a small handful of nodes. The perceived disadvantage is the increased load on network bandwidth and query load on authoritative servers as the number of caches increases.
Our experiment has two parts: first we looked at authoritative server processing capabilities and then at the bandwidth effects of a larger number of caches.
Authoritative service is generally run on systems with modern software, supporting threading or precomputed responses. Independent testing shows that these stock software solutions can, on current hardware, support query rates in the hundreds of thousands of queries per second. 
On the surface, this result would indicate that there is enough overhead to be able to process more queries, regardless of how they are originated. Regarding bandwidth, a survey of Top-Level Domain (TLD) operators has shown that 92 percent of the delegations have two or more authoritative servers for that data on networks with a minimum uplink bandwidth of 100 Mbps. Selected path characterization from clients to target authoritative servers seems to support our presumption that bandwidth is not of concern.
The DNS was designed to function as a roughly symmetrical transfer of information: a request or query is sent and the reply reflects the query and supplies the answer and additional data. Historically, the request and reply were within the same order of magnitude. Into the future, this model may no longer be valid. With Domain Name System Security Extensions (DNSSEC), IP Version 6 (IPv6), and Naming Authority Pointer (NAPTR) records being possible candidates in the Resource Record set (RRset), the traffic profile more resembles an HTTP request/response, with a significant amount of data being returned from a simple question. 
With this information, we can project a worse case in today's environment where a query/reply is about 260 bytes to a worst case in a future environment where a query/reply is about 9 KB, clearly indicating that the amount of bandwidth to authoritative servers needs to grow as new DNS capabilities are deployed, but for the nonce, most have a bandwidth overhead sufficient to absorb a modest change in the number of queries presented.
Modification of the Number of Caching Servers
We began with a cache that serviced 140 stub resolvers on the University of Southern California’s Information Sciences Institute (USC/ISI) campus in a "normal" dense cache mode (Figure 2).
Traffic traces show a distribution of priming queries to 534 authoritative servers in the first 15 minutes of clearing the cache.
We then added 9 new caches and redistributed the 140 stub resolvers among the 10 caches into a sparse cache mode (Figure 3) and restarted all the caches. In the first 15 minutes, the number of priming queries from each of the caches averaged 61, with a total of 622 unique priming queries for all caches. The number of "duplicate" queries between caches averaged 45. Although the number of queries to the authoritative servers was slightly higher, the results seem to indicate that there is a small but significant difference in each of the caches .
Reducing the size of the user population for each cache reduces the attack surface for the DNS overall because we have effectively compartmentalized the threat to a small number of nodes. Generally, restarting a cache for a small number of nodes is considered acceptable, whereas restarting a cache for 10,000 or 100,000 nodes would significantly affect operations.
Moving the cache closer to the resolver increases overall response time and may support better mobility of the node. If validation is also placed with the cache, it is possible to increase the confidence of validation because that information may not have to use DNS protocols to send validation data over untrusted, open networks.
The concept of supporting larger numbers of full DNS servers on more nodes raises concerns, but most systems these days have enough processing power and bandwidth to support this application. Administrative and management processes can be fully automated. Overall, this design complements other, protocol-based attempts to increase DNS integrity.