How can I resolve DNS problems somewhere in the middle of recursion?

Solution 1:

How can I debug this problem and find the offending nameserver?

daxd5 offered some good starting advice, but the only real answer here is that you need to know how to think like a recursive DNS server. Since there are numerous misconfigurations at the authoritative layer that can result in an inconsistent SERVFAIL, you need a DNS professional or online validation tools.

Anyway, the goal isn't to cop out of helping you, but I wanted to make sure that you understand that there is no conclusive answer to that question.


In your particular case, I noticed that strugee.net appears to be a zone signed with DNSSEC. This is evident from the presence of the DS and RRSIG records in the referral chain:

# dig +trace +additional strugee.net
<snip>
strugee.net.            172800  IN      NS      dns2.registrar-servers.com.
strugee.net.            172800  IN      NS      dns1.registrar-servers.com.
strugee.net.            172800  IN      NS      dns3.registrar-servers.com.
strugee.net.            172800  IN      NS      dns4.registrar-servers.com.
strugee.net.            172800  IN      NS      dns5.registrar-servers.com.
strugee.net.            86400   IN      DS      16517 8 1 B08CDBF73B89CCEB2FD3280087D880F062A454C2
strugee.net.            86400   IN      RRSIG   DS 8 2 86400 20160423051619 20160416040619 50762 net. w76PbsjxgmKAIzJmklqKN2rofq1e+TfzorN+LBQVO4+1Qs9Gadu1OrPf XXgt/AmelameSMkEOQTVqzriGSB21azTjY/lLXBa553C7fSgNNaEXVaZ xyQ1W/K5OALXzkDLmjcljyEt4GLfcA+M3VsQyuWI4tJOng184rGuVvJO RuI=
dns2.registrar-servers.com. 172800 IN   A       216.87.152.33
dns1.registrar-servers.com. 172800 IN   A       216.87.155.33
dns3.registrar-servers.com. 172800 IN   A       216.87.155.33
dns4.registrar-servers.com. 172800 IN   A       216.87.152.33
dns5.registrar-servers.com. 172800 IN   A       216.87.155.33
;; Received 435 bytes from 192.41.162.30#53(l.gtld-servers.net) in 30 ms

Before we go any further, we need to check whether or not the signing is valid. DNSViz is a tool frequently used for this purpose, and it confirms that there are indeed problems. The angry red in the picture is suggesting that you have a problem, but rather than mousing over everything we can just expand Notices on the left sidebar:

RRSIG strugee.net/A alg 8, id 10636: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
RRSIG strugee.net/DNSKEY alg 8, id 16517: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
RRSIG strugee.net/DNSKEY alg 8, id 16517: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
RRSIG strugee.net/MX alg 8, id 10636: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
RRSIG strugee.net/NS alg 8, id 10636: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
RRSIG strugee.net/SOA alg 8, id 10636: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
RRSIG strugee.net/TXT alg 8, id 10636: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
net to strugee.net: No valid RRSIGs made by a key corresponding to a DS RR were found covering the DNSKEY RRset, resulting in no secure entry point (SEP) into the zone. (216.87.152.33, 216.87.155.33, UDP_0_EDNS0_32768_4096)

The problem is clear: the signature on your zone has expired and the keys need to be refreshed. The reason why you are seeing inconsistent results is because not all recursive servers have DNSSEC validation enabled. Ones which validate are dropping your domain, and for ones which do not it is business as usual.


Edit: Comcast's DNS infrastructure is known to implement DNSSEC validation, and as one of their customers I can confirm that I'm seeing a SERVFAIL as well.

$ dig @75.75.75.75 strugee.net | grep status
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 2011

Solution 2:

While you are indeed seeing that the authoritative name servers are responding correctly, you need to follow up the entire chain of DNS resolution. This is, walk down the whole DNS hierachy from the root servers up.

$ dig net NS
;; ANSWER SECTION:
net.            172800  IN  NS  c.gtld-servers.net.
net.            172800  IN  NS  f.gtld-servers.net.
net.            172800  IN  NS  k.gtld-servers.net.
;; snipped extra servers given
$ dig @c.gtld-servers.net strugee.net NS
;; AUTHORITY SECTION:
strugee.net.        172800  IN  NS  dns2.registrar-servers.com.
strugee.net.        172800  IN  NS  dns1.registrar-servers.com.
;; snipped extra servers again

This basically checks that the public DNS servers are working, and you're doing the same thing that your DNS resolver should be doing. So you should be getting the same answers as above in your Digital Ocean server unless something's wrong with their DNS resolver:

$ dig net NS
$ dig strugee.net NS
$ dig strugee.net

If the first two queries fail, it's the DNS on Digital Ocean's side failing. Check your /etc/resolv.conf and try querying the secondary DNS server. If the secondary one works, just switch the order for resolvers and try again.