0

Horrible DNS latencies since yesterday - family is not happy.

Hi Team:  Long time NextDNS user with ASUS Merlin router.   NO changes on router and I do a manual setup using stubby.yml which has not changed.

Starting yesterday, DNS latencies went horrible and barely resolve.  Normally, boom and all in the low 20ms.  Family is screaming about how horrible DNS is.  What's going on? 

Oh I already tried to DL the "diag" script and 2+ antiviral / malware programs wiped it out immediately without even opening it.   So I doubt that's going to get passed those scanners. 

Also, I've already rebooted the router and checked my stubby.yml file for any changes.  

ping.nextdns.io yields (multiple times) I might get 1 to resolve to 25-50 ms if lucky.

 hydron-clt                error

  tier-clt                  error

  anexia-mnz                error

  zepto-xrs                 error

  zepto-iad                 error

  wlvrz-was                 error

  teraswitch-pit            error

  router-pit                error

  anexia-atl                error

  vultr-atl                 error

anycast.dns1.nextdns.io error (anycast1)

anycast.dns2.nextdns.io error (anycast2)

dns1.nextdns.io error (ultralow1)

dns2.nextdns.io error (ultralow2)

58 replies

null
    • G_Mobley
    • 3 yrs ago
    • Reported - view

    Still been having erratic behavior.  Dropped back to QUAD9/Cloudflare for about a week and the erratic and slow DNS seemed to behave... Switched back to NextDNS on SAT and things seemed to get noticeably slower.   I'm still digging.  I do not use the client as I manually configure stubby.yml for the few changes NextDNS wants.   Thanks.

      • John_DeCarlo
      • 3 yrs ago
      • Reported - view

      G Mobley Try setting the DOH on your browser. Then also setup default dns 45.90.28.xxx and 45.90.30xxx on your router. Then test your browser out and let me know if that helps?

    • G_Mobley
    • 3 yrs ago
    • Reported - view

    Thanks.  I've got DOH enabled on the ASUS and all DNS is forced thru the router's NextDNS setup.  I've also reverified all the "checkboxes" selected correctly for the NextDNS setup.  Been running NextDNS for more than a year without issues until my first posting here.  My setup did not change, my firmware and setups were the same when this started.  I gotta believe it's my ISP struggling with loads.       Is there something you think I need extra now?  That's why I was asking about the "dns rules"  I've never setup and dns rules.   THANKS!  

      • John_DeCarlo
      • 3 yrs ago
      • Reported - view

      G Mobley  I setup NextDNS CLI for the router. I also setup YogaDNS for every Microsoft Windows 10 setup. I also setup every browser DOH setting.  NextDNs works great . Very fast dns look ups. No lag at all.

    • G_Mobley
    • 3 yrs ago
    • Reported - view

    Thanks!   I may try the client again.  I think my issues are really ISP related b/c up until ~ 3 weeks ago, the setup had been rock solid screaming.   I'll keep watching the ISP.  Stay safe, stay alive!  Peace. 

      • John_DeCarlo
      • 3 yrs ago
      • Reported - view

      G Mobley    My ISP is Spectrum .   What is your ISP?  

      Can you do a trace route to 45.76.16.236, 191.96.51.196. and post them here. Thank you.

    • G_Mobley
    • 3 yrs ago
    • Reported - view

    Got up this AM after switching back to NextDNS setup on yesterday AM at it appears NextDNS became "unreachable" sometime between 02:00AM-03:00AM EDT. 

    10-4, I'm a long time Spectrum customer with a generally reliable 300/20 service. 

    I restarted dnsmsgq on the router (Merlin) just to be sure it was not something lurking in there - nope - still very dead.  There was nothing in the syslog indicating issues outside of speed testing failed messages which is a clue to when it died.

    Switching the DNS resolver to QUAD immediately revived my DNS resolution.

    I'll keep trying to figure the root-cause out b/c I like the NextDNS service but I have a feeling it's not my router/setup b/c it has been stable / rock solid for more than a year using the NextDNS service.  The past 3-4 weeks however, have been awful with the family standing in my door or yelling, "The internet is down again!"  The best I've gotten is 1-2 days with NextDNS working, before it's not again.

    Here's the fresh tracert from a Windows box.  I think this is the root-cause of what some customers are seeing.
    >tracert 45.76.16.236

    Tracing route to dns.nextdns.io [45.76.16.236]
    over a maximum of 30 hops:

      1    36 ms    <1 ms    <1 ms  AC1900-FA38 [192.168.100.99]
      2     1 ms    <1 ms    <1 ms  192.168.111.99
      3    11 ms    16 ms    10 ms  65.190.80.1
      4    11 ms    17 ms    14 ms  174.111.102.224
    5 17 ms 14 ms 14 ms cpe-024-025-062-048.ec.res.rr.com [24.25.62.48]
    6 20 ms 14 ms 14 ms be31.drhmncev01r.southeast.rr.com [24.93.64.184]
      7    27 ms    22 ms    25 ms  66.109.6.224
      8    17 ms    20 ms    16 ms  66.109.5.117
    9 18 ms 22 ms 23 ms be-206-pe07.ashburn.va.ibone.comcast.net [50.242.149.253]
    10 20 ms 19 ms 21 ms be-2207-cs02.ashburn.va.ibone.comcast.net [96.110.32.189]
    11 22 ms 22 ms 19 ms be-1212-cr12.ashburn.va.ibone.comcast.net [96.110.32.206]
    12 25 ms 23 ms 26 ms be-301-cr11.pittsburgh.pa.ibone.comcast.net [96.110.39.166]
    13 36 ms 25 ms 29 ms be-1211-cs02.pittsburgh.pa.ibone.comcast.net [96.110.38.133]
    14 23 ms 27 ms 27 ms be-1212-cr12.pittsburgh.pa.ibone.comcast.net [96.110.38.150]
    15 34 ms 43 ms 35 ms be-301-cr14.350ecermak.il.ibone.comcast.net [96.110.39.157]
    16 40 ms 42 ms 39 ms be-1314-cs03.350ecermak.il.ibone.comcast.net [96.110.35.57]
    17 38 ms 38 ms 37 ms be-2311-pe11.350ecermak.il.ibone.comcast.net [96.110.33.202]
    18 41 ms 39 ms 59 ms 96-87-9-182-static.hfc.comcastbusiness.net [96.87.9.182]
     19     *        *        *     Request timed out.
     20     *        *        *     Request timed out.
     21     *        *        *     Request timed out.
    22 35 ms 36 ms 35 ms dns.nextdns.io [45.76.16.236]

    Trace complete.

    > tracert 191.96.51.196

    Tracing route to dns.nextdns.io [191.96.51.196]
    over a maximum of 30 hops:

      1    39 ms    <1 ms    <1 ms  AC1900-FA38 [192.168.100.99]
      2     1 ms     1 ms    <1 ms  192.168.111.99
    3 14 ms 11 ms 12 ms 065-190-080-001.inf.spectrum.com [65.190.80.1]
      4    13 ms    14 ms    40 ms  174.111.102.226
    5 8 ms 13 ms 14 ms cpe-024-025-062-050.ec.res.rr.com [24.25.62.50]
    6 18 ms 14 ms 22 ms be31.chrcnctr01r.southeast.rr.com [24.93.64.186]
    7 32 ms 19 ms 20 ms bu-ether11.atlngamq46w-bcr00.tbone.rr.com [66.109.6.34]
      8    19 ms    17 ms    18 ms  66.109.5.125
    9 35 ms 44 ms 24 ms ae14.cr4-atl2.ip4.gtt.net [208.116.217.29]
    10 37 ms 45 ms 38 ms ae13.cr10-chi1.ip4.gtt.net [213.254.230.165]
    11 39 ms 39 ms 48 ms ip4.gtt.net [208.116.128.54]
    12 36 ms 38 ms 37 ms 0.ae1.ar4.ord6.scnet.net [204.93.204.113]
    13 38 ms 34 ms 41 ms unknown.servercentral.net [50.31.158.46]
    14 39 ms 37 ms 40 ms dns.nextdns.io [191.96.51.196]

    Trace complete.

    And this below is dead on  why my linkages to NextDNS stopped working!

    >tracert 45.90.28.114

    Tracing route to dns1.nextdns.io [45.90.28.114]
    over a maximum of 30 hops:

      1    29 ms    <1 ms    <1 ms  AC1900-FA38[192.168.100.99]
      2    <1 ms    <1 ms    <1 ms  192.168.111.99
    3 12 ms 13 ms 13 ms 065-190-080-001.inf.spectrum.com [65.190.80.1]
      4    12 ms    17 ms    19 ms  174.111.102.224
    5 13 ms 10 ms 15 ms cpe-024-025-062-048.ec.res.rr.com [24.25.62.48]
    6 16 ms 14 ms 16 ms be31.drhmncev01r.southeast.rr.com [24.93.64.184]
      7    23 ms    22 ms    22 ms  66.109.6.224
    8 243 ms 238 ms 253 ms bu-ether12.vinnva0510w-bcr00.tbone.rr.com [66.109.6.31]
    9 223 ms 258 ms 256 ms ae-11.edge5.WashintonDC12.Level3.net [4.68.37.213]
    10 * 23 ms 25 ms ae-1-3501.ear3.NewYork1.Level3.net [4.69.150.202]
    11 26 ms 31 ms 29 ms CHOOPA-LLC.ear3.NewYork1.Level3.net [4.15.213.214]
     12     *        *        *     Request timed out.
     13     *        *        *     Request timed out.
     14     *        *        *     Request timed out.
    15 24 ms 28 ms 27 ms dns1.nextdns.io [45.90.28.114]

    Trace complete.

    >tracert 45.90.30.114

    Tracing route to dns2.nextdns.io [45.90.30.114]
    over a maximum of 30 hops:

      1    17 ms    <1 ms    <1 ms  AC1900-FA38 [192.168.100.99]
      2     1 ms     1 ms    <1 ms  192.168.111.99
    3 18 ms 13 ms 14 ms 065-190-080-001.inf.spectrum.com [65.190.80.1]
      4    14 ms    12 ms    13 ms  174.111.102.224
    5 12 ms 10 ms 21 ms cpe-024-025-062-048.ec.res.rr.com [24.25.62.48]
    6 21 ms 14 ms 14 ms be31.drhmncev01r.southeast.rr.com [24.93.64.184]
      7    21 ms    22 ms    30 ms  66.109.10.176
      8    17 ms    19 ms    22 ms  66.109.5.117
    9 16 ms 24 ms 23 ms ash-b2-link.ip.twelve99.net [62.115.188.210]
    10 23 ms 18 ms 18 ms voxility-svc071266-ic357612.ip.twelve99-cust.net [195.12.254.137]
     11     *        *        *     Request timed out.
     12     *        *        *     Request timed out.
     13    20 ms    19 ms    22 ms  45.11.106.10
    14 18 ms 28 ms 19 ms dns2.nextdns.io [45.90.30.114]

    Trace complete.

      • olivier
      • 3 yrs ago
      • Reported - view

      G Mobley not sure to see why this traceroute would show the root cause. You have between 18 and 28ms latency to primary and secondary anycast and 35ms to ultralow endpoint, which is pretty good.

      How did you configure nextdns? Using dnsmasq with UDP IPs and link IP or something else?

      • Hans_Geiblinger
      • 3 yrs ago
      • Reported - view

      G Mobley I had a lot of similar issues for weeks,  that was until we disabled DNSSEC on the router.  Since then I've been rock sold for about a month now. 

      • G_Mobley
      • 3 yrs ago
      • Reported - view

      Hans Geiblinger   THANKS! I double checked and I already have DNSSEC, Rebind, and Forward unchecked.  I'll keep digging. 

      • G_Mobley
      • 3 yrs ago
      • Reported - view

      Olivier Poitrey  Maybe I'm reading the tracert incorrectly.  I view all the hops with timeouts as a red-flag. Before a few weeks ago, I was not seeing any timeouts on either tracert reaching back to the nextdns infrastruture.  I do not recall the number of hops though.

      Yes sir, no nextdns client. I setup NextDNS manually as I've done for the past year+.. in fact my renewal is coming up shortly. 

      1) make sure dnsmasq.conf has the correct entries (note: IPV6 is disabled)

      no-resolv
      bogus-priv
      strict-order
      server=45.90.30.0  (btw, that's what the generated page says "0" but I know it's really "114" per an earlier issue where you said it would never be '0'.
      server=45.90.28.0 (ditto above)
      add-cpe-id=XXXXXXXXX

      2) Alter stubby.yml to make sure it has the correct NextDNS entries:

      Set -> round_robin_upstreams: 0

      resolution_type: GETDNS_RESOLUTION_STUB
      dns_transport_list:
        - GETDNS_TRANSPORT_TLS
      tls_authentication: GETDNS_AUTHENTICATION_REQUIRED
      tls_query_padding_blocksize: 128
      appdata_dir: "/var/lib/misc"
      resolvconf: "/tmp/resolv.conf"
      edns_client_subnet_private: 1
      round_robin_upstreams: 0
      idle_timeout: 9000
      tls_connection_retries: 2
      tls_backoff_time: 900
      timeout: 3000
      listen_addresses:
        - 127.0.1.1@53
      upstream_recursive_servers:
        - address_data: 45.90.28.114
          tls_auth_name: "XXXXX.dns1.nextdns.io"
        - address_data: 45.90.30.114
          tls_auth_name: "XXXXX.dns2.nextdns.io"

      > restart dnsmsgq..

      I've restarted dnsmsgq service  few times just to make sure it's not lost and it never recovers until I drop the NextDNS entries and replace them with QUAD9 or Clouldflare or Google.   Then ususally no screams until I switch the router back to NextDNS and try again.

      Thanks for taking a look!  Let me know what I'm missing.  I've been following the NextDNS discussion in the ASUS Merlin forums for more than a year.. so this is really a mystery. 

      • Hans_Geiblinger
      • 3 yrs ago
      • Reported - view

      G Mobley When I ran Asus a while ago, I thought you had to handle round_robin via a start script?

      1. Create a start script:

      /jffs/scripts/stubby.postconf

      2. Add:

      #!/bin/sh
      CONFIG=$1
      source /usr/sbin/helper.sh
      pc_replace "round_robin_upstreams: 1" "round_robin_upstreams: 0" $CONFIG
      • G_Mobley
      • 3 yrs ago
      • Reported - view

      Hans Geiblinger   Hi - Correct.  Works perfectly.

      #!/bin/sh
      #
      # Used by NextDNS to fix the /etc/stubby/stubby.yml AUTOMATICALLY to have "0" for round_robin_upstreams
      #
      CONFIG=$1
      source /usr/sbin/helper.sh
      pc_replace "round_robin_upstreams: 1" "round_robin_upstreams: 0" $CONFIG
      #
      # <EOF>

      • olivier
      • 2 yrs ago
      • Reported - view

      G Mobley routers in a traceroutes car choose to drop ICMP packets, that common and a non issue.

      For 1), you can use .0 if you have the add-cpe-id option. Using 114 won't change anything.

      I'm not sure why you have a stubby config if you configured dnsmasq to go to NextDNS IPs directly. It should be either one or the other.

      Could you simplify the config and remove stubby? If you run on ASUS Merlin, why not trying our CLI? It should make things more stable.

      • G_Mobley
      • 2 yrs ago
      • Reported - view

      Olivier Poitrey  TY so much for the clarifications on the ICMP packets and "timeouts".  IDK that was the case with the tracert.   Still seems like an awful lot of hops to me.  As a performance guy, that # of hops would be a nightmare.

      OK on 0 or 114 and yes I have the add-cpe-id option set properly.

      As for using both dnsmasq and stubby, on ASUS Merlin, I run several other AMTM tools which leverage both stubby and dnsmasq (is my understanding) to implement those features:  skynet, diversion. 

      I realize that diversion has some duplicates to NextDNS but it has useful features you do not such as experimental blocking of certain PITA sites.   I've run with both diversion ON and OFF and it does not seem to matter with these recent failures I've been reporting. For the months where I had no issues, I had diversion ON + NextDNS with zero issues.

      In the very early threads where you were working with the Merlin developers, changes to both stubby.yml and dnsmasq.conf were listed by the SME on Merlin as required for the "manual" config - way before the client arrived.

      I do not run the client b/c it does not integrate well with the other added Merlin AMTM tooling is the last things in those threads. 

      Now that I know this is maybe not my ISP's many hops etc.. I'll keep experimenting with adding NextDNS back in.  For all I know it could be a problem caused by something in the entware updates too as they have been known to break things.

      Thanks for your guidance! 

    • G_Mobley
    • 2 yrs ago
    • Reported - view

    Just an update. To be fair to NextDNS, I had to restart dnsmsgq this AM with it connected to QUAD9... so at this point, I think somethings up with the setup on my ASUS and maybe not totally NextDNS.  My apologies.  I'll keep digging into the setup.  I'd not be surprised if all those recent entware updates might be involved.    Cheers!  Stay safe, stay alive!

    • G_Mobley
    • 2 yrs ago
    • Reported - view

    Updating this issue with these items:

    1. Switched to QUAD9 and had no DNS issues for 3 weeks.

    2. Switched back to NextDNS today and within 1 hour, had DNS resolution issues

    I caught STUBBY doing this:  Does this help with clues for why NextDNS is not behaving?

    Thanks!

      • olivier
      • 2 yrs ago
      • Reported - view

      G Mobley what is you stubby config and version?

      • G_Mobley
      • 2 yrs ago
      • Reported - view

      Hi Olivier:  Running most current ASUS Merlin

      [00:24:38.751471] STUBBY: Stubby version: Stubby 0.3.0
      [00:24:38.754325] STUBBY: Read config from file /etc/stubby/stubby.yml
      [00:24:38.754632] STUBBY: DNSSEC Validation is OFF
      [00:24:38.754664] STUBBY: Transport list is:
      [00:24:38.754690] STUBBY:   - TLS
      [00:24:38.754716] STUBBY: Privacy Usage Profile is Strict (Authentication required)

       I script change the round_robin_upstreams from 1 to 0 

      ... /stubby/stubby.yml

      resolution_type: GETDNS_RESOLUTION_STUB
      dns_transport_list:
        - GETDNS_TRANSPORT_TLS
      tls_authentication: GETDNS_AUTHENTICATION_REQUIRED
      tls_query_padding_blocksize: 128
      appdata_dir: "/var/lib/misc"
      resolvconf: "/tmp/resolv.conf"
      edns_client_subnet_private: 1
      round_robin_upstreams: 0
      idle_timeout: 9000
      tls_connection_retries: 2
      tls_backoff_time: 900
      timeout: 3000
      listen_addresses:
        - 127.0.1.1@53
      upstream_recursive_servers:
        - address_data: 45.90.28.0
          tls_auth_name: "XXXXXXX.dns1.nextdns.io"
        - address_data: 45.90.30.0
          tls_auth_name: "XXXXXXX.dns2.nextdns.io"

      Thanks for taking another look.  I monitor stubby -l now.. trying to catch whatever's causing my issues.

      G. Mobley

      • olivier
      • 2 yrs ago
      • Reported - view

      G Mobley Seems like stubby's fallback algorithm is pretty meh: https://github.com/getdnsapi/stubby/issues/105

      Does it fix the issue if you set round_robin_upstreams to 1?

      Any reason not wanting to use CLI? It should be much more stable.

      • G_Mobley
      • 2 yrs ago
      • Reported - view

      Olivier Poitrey  Thank you for the guidance.   I have never tried RRU to 1 b/c the setup instructions state that value must be "0"   I will try "1" tomorrow AM.  I cannot play anymore tonight.  

      WRT the NextDNS CLI, yes sir, I have ~ 6x IOT devices and cameras which do not function well when their DNS messed with so I list them on the "DNSFilter" page so they go directly to QUAD9.  They work fine when that is setup.    FWIW, when I have this exact same setup hitting 2 x QUAD9 hosts +  round_robin_upstreams:1  (or also  Clouldflare (used them both to test), the DNS worked for 3 weeks without a hiccup or blip of single DNS cannot be resolved.   I was watching it with "stubby -l"  I'll change tomorrow and see how it goes. 

      I'm reviewing that stubby link you posted above.  They are talking a lot about the timeouts...   What's the best timeout for NextDNS?  Quote>  "I am wondering if it would be worthwhile adding a note in stubby.yml.example explaining that stubby will cycle servers when round_robin_upstreams: 0 is set and idle_timeout is set to a value longer than a given upstream server has configured on the backend."  The default idle_timeout in my current stubby.yml is 9000ms.   Could that be triggering the problem that post is referencing?   Thank you sir.

      • G_Mobley
      • 2 yrs ago
      • Reported - view

      Olivier Poitrey   Morning!  ~ 24 hours with RRU:1 and I've not seen (nor have a I heard screams) about any DNS not resolving. TY!  This <IS> progress!   I'll keep the router running this way for a week, continue to monitor and report in again.    Do you have suggestions on the "proper idle_timeout" for NextDNS which is still set at 9000ms which is the default delivered in ASUS Merlin?  Many people cannot change that b/c that value is not exposed in the ASUS Merlin GUI.  I think they settled on 9000ms as good value for QUAD9, Cloudflare and others they built into the GUI to select for DoT.  You made my day! 

      • olivier
      • 2 yrs ago
      • Reported - view

      G Mobley our keep alive is around 30s so 9s should be fine. This is frustrating because this issue is not easily reproducible (I never managed to), which make it hard to debug...

      • G_Mobley
      • 2 yrs ago
      • Reported - view

      Olivier Poitrey  Morning sir.  Reporting on using above DNS issues on my ASUS/AX86U/Merlin/386.2_2.  When I changed the stubby.yml to use the default of RRU:1 vs RRU:0, the router has not lost DNS resolution - no family standing in my door!   I know that's not the configuration as stated but it's actually returned to being reliable. 

      My gut says many ASUS/Merlin users simple fill in the WAN GUI page and never bother editing the correct stubby.yml files which defaults to stubby using -> RRU:1 on Merlin.   I'll continuing monitoring the logs. 

      I also found your posting explaining when to use:  X.Y.Z.0 vs the X.Y.Z.### setups.  I'd never read or understood that before that post.  Maybe a good edition to the generated setups?    

      See -> https://help.nextdns.io/t/p8htq2y/dnsmasq-setting-clarification-ipv4-address-and-strict-order  

      At this point I feel the issue lies in stubby handling the RR setup on a RRU:0 setup - especially based on the link you posted earlier.  I'm betting it's not a common default or code-path. 

      If I can, maybe I'll put QUAD9 back in, maually set stubby to RRU:0 to see if DNS starts crapping out with that too!   I've never tried that one.

      Thank you sir!

      • G_Mobley
      • 2 yrs ago
      • Reported - view

      Hi Olivier, Reporting back in.  My NextDNS + Merlin 384.2_2 has been very stable since the above change to use RR:1.   I too believe there is something wrong in DNSMSGQ when setup to use RR:0.   I'm keeping the config running this way and report back in another week.  Since January, 2021, the router has never stayed up more than about 2 days, really meaning DNS working and family not screaming,  with the same config and using RR:0.     Thanks!

Content aside

  • 2 yrs agoLast active
  • 58Replies
  • 1732Views
  • 9 Following