DNS failing to propagate worldwide

Solution 1:

This is not directly a DNS problem, it's a network routing problem between some parts of the internet and the DNS servers for serverfault.com. Since the nameservers can't be reached the domain stops resolving.

As far as I can tell the routing problem is on the (Global Crossing?) router with IP address 204.245.39.50.

As shown by @radius, packets to ns52 (as used by stackoverflow.com) pass from here to 208.109.115.121 and from there work correctly. However packets to ns22 go instead to 208.109.115.201.

Since those two addresses are both in the same /24 and the corresponding BGP announcement is also for a /24 this shouldn't happen.

I've done traceroutes via my network which ultimately uses MFN Above.net instead of Global Crossing to get to GoDaddy and there's no sign of any routing trickery below the /24 level - both name servers have identical traceroutes from here.

The only times I've ever seen something like this it was broken Cisco Express Forwarding (CEF). This is a hardware level cache used to accelerate packet routing. Unfortunately just occasionally it gets out of sync with the real routing table, and tries to forward packets via the wrong interface. CEF entries can go down to the /32 level even if the underlying routing table entry is for a /24. It's tricky to find these sorts of problems, but once identified they're normally easy to fix.

I've e-mailed GC and also tried to speak to them, but they won't create a ticket for non-customers. If any of you are a customer of GC, please try and report this...

UPDATE at 10:38 UTC As Jeff has noted the problem has now cleared. Traceroutes to both servers mentioned above now go via the 208.109.115.121 next hop.

Solution 2:

your dns servers for serverfault.com [ ns21.domaincontrol.com, ns22.domaincontrol.com. ] are unreachable. for last ~20h, at least from couple major isps in sweden [ telia, tele2, bredband2 ].

at the same time 'neighbor' dns servers for stackoverflow.com & superuser.com [ ns51.domaincontrol.com, ns52.domaincontrol.com ] are reachable.

sample traceroute to ns52.domaincontrol.com:

 1. xxxxxxxxxxx
 2. 83.233.28.193           
 3. 83.233.79.81            
 4. 213.200.72.5            
 5. 64.208.110.129          
 6. 204.245.39.50           
 7. 208.109.115.121         
 8. 208.109.115.162         
 9. 208.109.113.62          
10. 208.109.255.26          

and to ns21.domaincontrol.com

 1. xxxxxxxxxxxx
 2. 83.233.28.193      
 3. 83.233.79.81       
 4. 213.200.72.5       
 5. 64.208.110.129     
 6. 204.245.39.50      
 7. 208.109.115.201    
 8. ???

maybe screwed up filtering / someone triggered some unwanted ddos protection and blacklisted some parts of internet. probably you should contact your dns service provider - go daddy.

you can verify if problem is [partialy] solved by:

  1. checking if godaddy has reacted and changed name servers - eg lookup serverfault.com at http://www.squish.net/dnscheck/ using recort type: ANY
  2. check if provided name servers respond to ping [not very scientific since name servers can work fine and still block icmp, but in this case it seems that icmp is allowed to other servers ] from telia via looking glass.

edit: traceroutes from working places

poland

 1. xxxxxxxxxxxxxxx
 2. 153.19.40.254               
 3. ???
 4. 153.19.254.236              
 5. 212.191.224.205             
 6. 213.248.83.129              
 7. 80.91.254.171               
 8. 80.91.249.105               
    80.91.251.230
    80.91.254.93
    80.91.251.52
 9. 213.248.89.182              
10. 204.245.39.50               
11. 208.109.115.121             
12. 208.109.115.162             
13. 208.109.113.62              
14. 208.109.255.26              

germany

 1. xxxxxxxxxxxx
 2. 89.149.218.181       
 3. 89.149.218.2         
 4. 134.222.105.249      
 5. 134.222.231.205      
 6. 134.222.227.146      
 7. 80.81.194.26         
 8. 64.125.24.6          
 9. 64.125.31.249        
10. 64.125.27.165        
11. 64.125.26.178        
12. 64.125.26.242        
13. 209.249.175.170      
14. 208.109.113.58       
15. 208.109.255.26       

edit: all works fine now indeed.


Solution 3:

My suggestions: as explained by Alnitak, the problem is not DNS but routing (probably BGP). The fact that nothing was changed in the DNS setup is normal, since the problem was not in he DNS.

serverfault.com has today a very poor DNS setup, certainly insufficient for an important site like this:

  • only two name servers
  • all the eggs in the same basket (both are in the same AS)

We've just seen the result: a routing glitch (something which is quite common on the Internet) is sufficient to make serverfault.com disappears for some users (depending on their operators, not on their countries).

I suggest to add more name servers, located in other AS. This would allow failure resilience. You can either rent them to private companies or to ask serverfault users to offer secondary DNS hosting (may be only if the user has > 1000 rep :-)


Solution 4:

I do confirm that NS21.DOMAINCONTROL.COM and NS22.DOMAINCONTROL.COM are also unreacheable from ISP Free.fr in France.
Like pQd traceroute, mine also end after 208.109.115.201 for both ns21 and ns22.

traceroute to NS22.DOMAINCONTROL.COM (208.109.255.11), 64 hops max, 40 byte packets
 1  x.x.x.x (x.x.x.x)  2.526 ms  0.799 ms  0.798 ms
 2  78.224.126.254 (78.224.126.254)  6.313 ms  6.063 ms  6.589 ms
 3  213.228.5.254 (213.228.5.254)  6.099 ms  6.776 ms *
 4  212.27.50.170 (212.27.50.170)  6.943 ms  6.866 ms  6.842 ms
 5  212.27.50.190 (212.27.50.190)  8.308 ms  6.641 ms  6.866 ms
 6  212.27.38.226 (212.27.38.226)  68.660 ms  185.527 ms  14.123 ms
 7  204.245.39.50 (204.245.39.50)  48.544 ms  19.391 ms  19.753 ms
 8  208.109.115.201 (208.109.115.201)  19.315 ms  19.668 ms  34.110 ms
 9  * * *
10  * * *
11  * * *
12  * * *

But ns52.domaincontrol.com (208.109.255.26) do works and is in the same subnet as ns22.domaincontrol.com (208.109.255.11)

traceroute to ns52.domaincontrol.com (208.109.255.26), 64 hops max, 40 byte packets
 1  x.x.x.x (x.x.x.x)  1.229 ms  0.816 ms  0.808 ms
 2  78.224.126.254 (78.224.126.254)  12.127 ms  5.623 ms  6.068 ms
 3  * * *
 4  212.27.50.170 (212.27.50.170)  13.824 ms  6.683 ms  6.828 ms
 5  212.27.50.190 (212.27.50.190)  6.962 ms *  7.085 ms
 6  212.27.38.226 (212.27.38.226)  35.379 ms  7.105 ms  7.830 ms
 7  204.245.39.50 (204.245.39.50)  19.896 ms  19.426 ms  19.355 ms
 8  208.109.115.121 (208.109.115.121)  37.931 ms  19.665 ms  19.814 ms
 9  208.109.115.162 (208.109.115.162)  19.663 ms  19.395 ms  29.670 ms
10  208.109.113.62 (208.109.113.62)  19.398 ms  19.220 ms  19.158 ms
11  * * *
12  * * *
13  * * *

As you can see, this time after 204.245.39.50 we go to 208.109.115.121 instead of 208.109.115.201. And pQd has the same traceroute. From a working place I did not cross this 204.245.39.50 router (Global Crossing).

More traceroute from working and non working place would help, but it's highly probable that Global Crossing has a bogus routing entry for 208.109.255.11/32 and 216.69.185.11/32 as 208.109.255.10, 208.109.255.12, 216.69.185.10, 216.69.185.12 are working well.

Why it has a boged routing entry is hard to know. Probably 208.109.115.201 (Go Daddy) is advertising a non working route for 208.109.255.11/32 and 216.69.185.11/32.

EDIT: You can telnet route-server.eu.gblx.net to connect to the Global Crossing route server and do traceroute from within Global Crossing network

EDIT: It seems that the same problem already occured with others NS few days ago, see: http://www.newtondynamics.com/forum/viewtopic.php?f=9&t=5277&start=0


Solution 5:

What would be handy would be to see a detailed resolution trace from the locations that are failing... see what layer of the resolution path it's failing on. I'm not familiar with the service you're using, but perhaps it's an option somewhere.

Failing that, it's most likely that the problems are "lower down" in the tree, as failures at the root or TLDs would affect more domains (you'd hope). To increase resilience, you can delegate to a second DNS service to ensure better redundancy in resolution if there are problems with domaincontrol's network(s).