1

TLS Connection Failures - Stubby

I’m seeing connection failures between Stubby and NextDNS that I haven’t seen before, causing lookup timeouts and excessive connections to the service. Plain DNS works very well. Cloudflare and other DoT providers work well on Stubby, which leads me to think it’s a NextDNS issue. I cannot get the diagnostic tool to successfully look up nextdns.io while using Stubby but can run when not connected.

Looking for any insight or assistance. 

Version: Stubby 0.4.0 on FreshTomato

daemon.info stubby[20713]: 45.90.28.0 : Upstream : TLS - Resps= 26, Timeouts = 10, Best_auth =Success - with occasional SERVFAIL from dnsmasq

config

resolution_type: GETDNS_RESOLUTION_STUB
dns_transport_list:
- GETDNS_TRANSPORT_TLS
tls_authentication: GETDNS_AUTHENTICATION_REQUIRED
tls_query_padding_blocksize: 256
edns_client_subnet_private: 0
idle_timeout: 9000
tls_connection_retries: 5
tls_backoff_time: 900
timeout: 2000
round_robin_upstreams: 1
tls_min_version: GETDNS_TLS1_3
listen_addresses:
- 127.0.0.1@5453
- 0::1@5453
upstream_recursive_servers:
- address_data: 45.90.28.0
tls_auth_name: "xxxxxx.dns1.nextdns.io" etc

Will message diag privately on request. 

55replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • We found why stubby is not happy. We will push a workaround in production ASAP.

    Like 4
      • BS
      • teal_rabbit
      • 1 mth ago
      • Reported - view

      NextDNS THANK YOU for fixing this. Can confirm that NextDNS is behaving well when it previously did not.

      Like
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      NextDNS amazing! Thank you!

      Like
      • firstlast
      • firstlast
      • 1 mth ago
      • Reported - view

      NextDNS THANK YOU SO MUCH!

      Back to using NextDNS now, so glad to see the ads disappearing from my devices again.

      Like
    • NextDNS Working! Thanks! What was the issue?

      Like
      • Dan
      • Dan.3
      • 2 wk ago
      • Reported - view

      NextDNS sorry to bother you again but: 

      daemon.debug stubby[12925]: 45.90.28.0                               : Conn closed: TLS - *Failure*

      I’m seeing regression on the behaviour previously fixed. 

      Like
  • For the sake of testing, I spun up Stubby on a Debian instance with the config above and can’t resolve lookups:

    $ nslookup eff.org 127.0.0.1
    Server:         127.0.0.1
    Address:        127.0.0.1#53

    ** server can't find eff.org: SERVFAIL

    With Cloudflare dropped into the config, I can resolve addresses. Any ideas?

    Like
    • Dan you made stubby listen on port 5453, to test it use dig -P 5453 test.com instead.

      Like
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      NextDNS 

      Sorry, I did see that and modified the config. I was watching the verbose log from Stubby. DNS requests would hit, TLS connection open, and then nothing, closing shortly after. Stubby indicated a request time out, per the previous example. Swap the servers to Cloudflare and all works. Do you see something similar on a Stubby instance? 

      Like
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      NextDNS 

      Thanks in advance for your help! Stubby logs for example follow (sorry for the wall of text - how do you write code blocks here?)

      dig test.com @127.0.0.1

      ; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> test.com @127.0.0.1
      ;; global options: +cmd
      ;; Got answer:
      ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 15093
      ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
      ;; WARNING: recursion requested but not available

      ;; QUESTION SECTION:
      ;test.com.                      IN      A

      ;; Query time: 0 msec
      ;; SERVER: 127.0.0.1#53(127.0.0.1)
      ;; WHEN: Thu Aug 26 18:42:33 AWST 2021
      ;; MSG SIZE  rcvd: 26


      [10:38:00.995746] STUBBY: Read config from file stubby.yml
      [10:38:00.996627] STUBBY: DNSSEC Validation is OFF
      [10:38:00.996663] STUBBY: Transport list is:
      [10:38:00.996678] STUBBY:   - TLS
      [10:38:00.996693] STUBBY: Privacy Usage Profile is Strict (Authentication required)
      [10:38:00.996708] STUBBY: (NOTE a Strict Profile only applies when TLS is the ONLY transport!!)
      [10:38:00.996722] STUBBY: Starting DAEMON....
      [10:38:28.460227] STUBBY: 45.90.28.0                               : Conn opened: TLS - Strict Profile
      [10:38:28.576709] STUBBY: 45.90.28.0                               : Verify passed : TLS
      [10:38:33.458539] STUBBY: 2a07:a8c0::                              : Conn opened: TLS - Strict Profile
      [10:38:34.018680] STUBBY: 2a07:a8c0::                              : Verify passed : TLS
      [10:38:38.458883] STUBBY: 45.90.30.0                               : Conn opened: TLS - Strict Profile
      [10:38:38.460499] STUBBY: 45.90.28.0                               : Conn closed: TLS - Resps=     0, Timeouts  =     1, Curr_auth =Success, Keepalive(ms)=     0
      [10:38:38.463042] STUBBY: 45.90.28.0                               : Upstream   : TLS - Resps=     0, Timeouts  =     1, Best_auth =Success
      [10:38:38.463065] STUBBY: 45.90.28.0                               : Upstream   : TLS - Conns=     1, Conn_fails=     0, Conn_shuts=      0, Backoffs     =     0
      [10:38:38.483120] STUBBY: 45.90.30.0                               : Verify passed : TLS
      [10:38:43.463608] STUBBY: 2a07:a8c0::                              : Conn closed: TLS - Resps=     0, Timeouts  =     1, Curr_auth =Success, Keepalive(ms)=     0
      [10:38:43.463769] STUBBY: 2a07:a8c0::                              : Upstream   : TLS - Resps=     0, Timeouts  =     1, Best_auth =Success
      [10:38:43.463789] STUBBY: 2a07:a8c0::                              : Upstream   : TLS - Conns=     1, Conn_fails=     0, Conn_shuts=      0, Backoffs     =     0
      [10:38:48.464297] STUBBY: 45.90.30.0                               : Conn closed: TLS - Resps=     0, Timeouts  =     1, Curr_auth =Success, Keepalive(ms)=     0
      [10:38:48.464377] STUBBY: 45.90.30.0                               : Upstream   : TLS - Resps=     0, Timeouts  =     1, Best_auth =Success
      [10:38:48.464395] STUBBY: 45.90.30.0                               : Upstream   : TLS - Conns=     1, Conn_fails=     0, Conn_shuts=      0, Backoffs     =     0

      Like
    • Dan please send a diag

      Like
    • Dan your logs shows ipv6 but your configuration has only one v4. Is the config shown above complete? If you have v6 IPs, please try again without them.

      Like
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      NextDNS I’ve sent a message to you with the diag

      Like
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      NextDNS thanks for checking. Yes, normal config is the complete output on the NextDNS setup page (4+6). I’ve also tested with just 45.90.28.0 with no configuration specific info. 

      Like
    • Dan please try the full config with ipv6 removed

      Like
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      NextDNS 

      dig example.com @127.0.0.1

      ; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> example.com @127.0.0.1
      ;; global options: +cmd
      ;; Got answer:
      ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 13951
      ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
      ;; WARNING: recursion requested but not available

      ;; QUESTION SECTION:
      ;example.com.                   IN      A

      ;; Query time: 2003 msec
      ;; SERVER: 127.0.0.1#53(127.0.0.1)
      ;; WHEN: Thu Aug 26 22:34:19 AWST 2021
      ;; MSG SIZE  rcvd: 29
       

      Config

      resolution_type: GETDNS_RESOLUTION_STUB
      dns_transport_list:
        - GETDNS_TRANSPORT_TLS
      tls_authentication: GETDNS_AUTHENTICATION_REQUIRED
      tls_query_padding_blocksize: 128
      edns_client_subnet_private: 0
      idle_timeout: 5000
      tls_connection_retries: 5
      tls_backoff_time: 900
      timeout: 2000
      round_robin_upstreams: 1
      #tls_min_version: GETDNS_TLS1_3
      listen_addresses:
        - 127.0.0.1
        - 0::1
      upstream_recursive_servers:
        - address_data: 45.90.28.0
          tls_auth_name: "xxxxxx.dns1.nextdns.io"
        - address_data: 45.90.30.0
          tls_auth_name: "xxxxxx.dns2.nextdns.io"
       

      Stubby log

      [14:34:12.911360] STUBBY: Read config from file stubby_noipv6.yml
      [14:34:12.912172] STUBBY: DNSSEC Validation is OFF
      [14:34:12.912192] STUBBY: Transport list is:
      [14:34:12.912200] STUBBY:   - TLS
      [14:34:12.912208] STUBBY: Privacy Usage Profile is Strict (Authentication required)
      [14:34:12.912215] STUBBY: (NOTE a Strict Profile only applies when TLS is the ONLY transport!!)
      [14:34:12.912223] STUBBY: Starting DAEMON....
      [14:34:17.436308] STUBBY: 45.90.28.0                               : Conn opened: TLS - Strict Profile
      [14:34:17.551961] STUBBY: 45.90.28.0                               : Verify passed : TLS
      [14:34:19.437698] STUBBY: 45.90.28.0                               : Conn closed: TLS - Resps=     0, Timeouts  =     1, Curr_auth =Success, Keepalive(ms)=     0
      [14:34:19.437771] STUBBY: 45.90.28.0                               : Upstream   : TLS - Resps=     0, Timeouts  =     1, Best_auth =Success
      [14:34:19.437787] STUBBY: 45.90.28.0                               : Upstream   : TLS - Conns=     1, Conn_fails=     0, Conn_shuts=      0, Backoffs     =     0


      In contrast, using 1.1.1.1:

      dig example.com @127.0.0.1

      ; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> example.com @127.0.0.1
      ;; global options: +cmd
      ;; Got answer:
      ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46405
      ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

      ;; OPT PSEUDOSECTION:
      ; EDNS: version: 0, flags:; udp: 1232
      ;; QUESTION SECTION:
      ;example.com.                   IN      A

      ;; ANSWER SECTION:
      example.com. 71169 IN A 93.184.216.34

      ;; Query time: 34 msec
      ;; SERVER: 127.0.0.1#53(127.0.0.1)
      ;; WHEN: Thu Aug 26 22:44:20 AWST 2021
      ;; MSG SIZE  rcvd: 67

      Like
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      NextDNS any thoughts?

      Like
    • Dan can you please turn on debug logs?

      Like
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      NextDNS debug logs for Stubby? Those are in my previous message. 

      Like
    • Dan did you start stubby with the debug log level?

      Like
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      NextDNS yes. 

      stubby -v 7 -C stubby_noipv6.yml

      The logs you see above immediately follow. 

      Like
      • Dan
      • Dan.3
      • 1 mth ago
      • 1
      • Reported - view

      NextDNS have you had a chance to test this config on an instance of Stubby you control? Unfortunately I have no other test sites, other than another FreshTomato router, which exhibits the same symptoms (but is a different internet provider). 

      If I know it’s my end, I can start down another path - just let me know :)

      Could this have anything to do with the TLS cert changes in June? Thanks again. 

      Like 1
    • Dan we indeed already tested stubby. Here it seems to be a timeout. Judging your diag, anycast routing for IPv6 isn’t right from where you are but v4 should be fine.

      Would you be able to use our CLI instead of stubby?

      Like 1
      • BS
      • teal_rabbit
      • 1 mth ago
      • Reported - view

      NextDNS Not the OP but this doesn't really seem like a fair solution. If your product is meant to work outside of the app you've developed, then it should. Whatever recent changes were made to cause this issue are clearly affecting more than just one person. Otherwise NextDNS shouldn't advertise their DNS IPs for any solution other solution (DoT/DoH) if the only way you expect customers to use the product is via your CLI app. 🤔

      Like
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      NextDNS I would like to! But Tomato or Entware CLI isn’t ready yet :(

      I could configure another host to run CLI for the network, but I would rather have it all on the router. I’ll continue running DNS over 53 for now. 

      So Stubby is working okay for you? What are your thoughts on the timeouts? If it was a routing issue, I would be having issues establishing a connection at all, right? DNS over 53 works really well.

      Like
    • Dan stubby is working for many people but it always had issues with certain versions and is generally less robust than many other clients. Why it does not work in your case is unclear. The timeout error does not make much sense and the logs does show much more to debug.

      Like
      • Dan
      • Dan.3
      • 1 mth ago
      • 1
      • Reported - view

      NextDNS thank you. I’m hoping to use the CLI soon. Cloudflare and other resolvers work well with Stubby - what do you think the difference is with NextDNS? DoT should work if it’s functioning with other providers?

      Like 1
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      NextDNS what additional steps can I take to debug this issue? Unfortunately Stubby’s debug logs can be limited. I’m hoping you might be able to test an instance on a system you control and observe NextDNS logs to see if the servers get hit, with what, and if they respond? That would be very helpful :)

      I’m not able to do additional troubleshooting right now, so I hope you can help!

      Like
    • Dan we tested stubby on many systems and it works. The only known issue with stubby is when it is linked with an old version of openssl, but the error would be different. Some people also reported stubby randomly falling back after and stop working, but again, errors would be different and the fix is easy.

      Please try with another DoT client or CLI to see if you are also getting timeout errors. That is the only next step we can advise.

      Like 1
  • I use AsusWRT-Merlin with NextDNS and DoT. I believe it uses Stubby under the hood. For the past week or so, I've had terrible Internet on all my devices. I was able to pin it down to DNS today. Lots of slow DNS replies or total failures.

    Switching to Cloudflare fixes the issue.

    This may be anecdotal, but perhaps there is some wider issue here.

    Like 3
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      firstlast do you think you could turn on verbose logging for Stubby and post some snippets here?

      Like
      • BS
      • teal_rabbit
      • 1 mth ago
      • Reported - view

      firstlast I'm in a similar situation... thought it was my IPv6, but it continues to misbehave even when disabled... I've tried everything simple to fix it, because all I have is the DoT setup on my ASUS Merlin router and yeah... nothing fixes it, so I'm glad to hear other people were having issues... I was losing my mind thinking it was something in the configurations I'd messed up.

      Like
  • Here is someone else with the same issue on AsusWRT-Merlin: https://www.snbforums.com/threads/dns-over-tls-and-chroot-nextdns-dot-issue.74466

    It's annoying because it was working for months and now all of a sudden it is an issue. :(

    Like 1
  • Same problem with OpenWrt 19.07 running Stubby 0.3.0 and Debian Buster running Stubby 0.2.5. No problem if I change to Cloudflare or Quad9 DoT servers.

    Like 1
  • @firstlast @goodvibes please provide https://nextdns.io/diag

    Like
  • @NextDNS

    I have just begun (in the last 3 or 4 days) experiencing the same thing with Stubby after it was running fine for months and no changes to my config.  [I am surprised by seeing IPv6 addresses, traceroutes and pings seemingly working.  I have never had IPv6 before and not sure what to make of it -- ISP has not announced it.  Not sure when that started.]

    I have basically the same config as dan.

    I have sent a diag.

    [EDIT]  Oops.  diag didn't go. 

    Post unsuccessful: Post "https://api.nextdns.io/diagnostic": dial tcp: lookup api.nextdns.io on 127.0.0.1:53: server misbehaving
    Please report this issue on https://github.com/nextdns/diag
     

    Like
    • freeson please run the diag with stubby disabled. Some stubby logs in debug level would also be helpful.

      Like
      • freeson
      • freeson
      • 1 mth ago
      • Reported - view

      NextDNS 

      Here is a stubby log:

      dig lk-case.com
      [20:07:26.040584] STUBBY: 45.90.30.0                               : Conn opened: TLS - Strict Profile
      [20:07:26.078276] STUBBY: 45.90.30.0                               : Verify passed : TLS
      [20:07:32.050409] STUBBY: 45.90.28.0                               : Conn opened: TLS - Strict Profile
      [20:07:32.077409] STUBBY: 45.90.28.0                               : Verify passed : TLS
      
      ; <<>> DiG 9.16.18 <<>> lk-case.com
      ;; global options: +cmd
      ;; Got answer:
      ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30527
      ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
      
      ;; OPT PSEUDOSECTION:
      ; EDNS: version: 0, flags:; udp: 1232
      ;; QUESTION SECTION:
      ;lk-case.com.            IN    A
      
      ;; ANSWER SECTION:
      lk-case.com.        600    IN    A    23.227.38.32
      
      ;; Query time: 249 msec
      ;; SERVER: 127.0.0.1#53(127.0.0.1)
      ;; WHEN: Fri Sep 03 16:07:38 EDT 2021
      ;; MSG SIZE  rcvd: 67
      
      /etc/stubby# [20:07:41.043523] STUBBY: 45.90.30.0                               : Conn closed: TLS - Resps=     2, Timeouts  =     1, Curr_auth =Success, Keepalive(ms)=     0
      [20:07:41.043600] STUBBY: 45.90.30.0                               : Upstream   : TLS - Resps=     2, Timeouts  =     2, Best_auth =Success
      [20:07:41.043646] STUBBY: 45.90.30.0                               : Upstream   : TLS - Conns=     2, Conn_fails=     0, Conn_shuts=      0, Backoffs     =     0
      [20:07:47.050639] STUBBY: 45.90.28.0                               : Conn closed: TLS - Resps=     1, Timeouts  =     1, Curr_auth =Success, Keepalive(ms)=     0
      [20:07:47.050713] STUBBY: 45.90.28.0                               : Upstream   : TLS - Resps=     2, Timeouts  =     2, Best_auth =Success
      [20:07:47.050762] STUBBY: 45.90.28.0                               : Upstream   : TLS - Conns=     2, Conn_fails=     0, Conn_shuts=      0, Backoffs     =     0
      


      With round robin on, it takes 12 seconds to respond.  With it off, 6 seconds.  If the timeout is less than 6 seconds it fails (SERVFAIL) very consistently.  With the timeout greater than 6 seconds it usually succeeds (NOERROR) with the response coming in at 6 seconds.

      Like
  • For everybody having an issue with stubby, please provide the version of stubby you are running and on what OS (the router firmware name and version if it is a router).

    Like 2
      • freeson
      • freeson
      • 1 mth ago
      • Reported - view

      NextDNS OpenWRT 19.07.8, Stubby 0.3.0-1

      Like
      • BS
      • teal_rabbit
      • 1 mth ago
      • Reported - view

      NextDNS Asuswrt-Merlin 386.3_2,  Stubby 0.4.0

      Like
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      NextDNS FreshTomato 2021.5, Stubby 0.4.0

      Like
    • NextDNS OpenWrt 19.07  Stubby 0.3.0 and Debian Buster running Stubby 0.2.5

      Like
  • I'm back to seeing similar behaviour now. Are other stubby users experiencing a regression?

     

    Thanks!

    Like
      • Dan
      • Dan.3
      • 1 mth ago
      • Reported - view

      firstlast all still looks okay from my end. No timeouts or TLS issues. 

      Like
      • firstlast
      • firstlast
      • 1 mth ago
      • Reported - view

      Dan Thanks for checking!

      Like
    • firstlast My Stubby (OpenWrt 19.07) was behaving erratically (lot of SERVFAIL errors) but was fixed with a service restart.

      Like
    • firstlast as a matter of fact, I still get a lot of those errors. It almost seems random when the problem occurs and when not 

      Like
      • firstlast
      • firstlast
      • 3 wk ago
      • Reported - view

      Gordon Freeman Yep, same. I stopped using NextDNS a few days ago as I don't have time to keep troubleshooting it.

      I'll give it another shot eventually and hope that whatever this issue is has been sorted out.

      Like
    • firstlast there is also the chance that stubby is at fault. On their GitHub page there is one issue opened, but it also only links to here

       

      https://github.com/getdnsapi/stubby/issues/297

      Like
      • Dan
      • Dan.3
      • 3 wk ago
      • Reported - view

      Gordon Freeman very interesting. Okay, I have a task that restarts stubby every two hours on my router. I’ll stop this and see if the issue returns. The random connection failures were an issue prior to me opening this ticket, I just ran out of troubleshooting steam, and then it became unusable. It was periods of around five to ten minutes every couple of days where I could see the DNS requests hit the NextDNS logs, but dnsmasq would return SERVFAIL. Enabling round-robin in stubby also helped with this. 
       

      The issue described in the stubby issue is eerily similar. I’ll come back with results in the next day or two. 

      Like
      • Dan
      • Dan.3
      • 3 wk ago
      • Reported - view

      Okay: 24 hours in and I’m not seeing any major issues:

      Sep 30 10:55:17 daemon.info stubby[26616]: 45.90.28.0 : Upstream : TLS - Resps= 4382, Timeouts = 1, Best_auth =Success
      Sep 30 10:55:17 mary73 daemon.info stubby[26616]: 45.90.28.0 : Upstream : TLS - Conns= 1754, Conn_fails= 0, Conn_shuts= 1, Backoffs = 9

      The back offs are from my flaky DSL resyncing, so only the single connection shut is interesting. As I mentioned, I run round-robin, so I’ve got log entries for each, but they’re all similar. 
      Will update tomorrow. Has anyone else had issues during the last 24 hours? Which version of stubby?

      Like
      • Dan
      • Dan.3
      • 3 wk ago
      • Reported - view

      Update: I started seeing issues again. I had to restart stubby to stop the SERVFAILs. This is the same issue I had before and the workaround was setting a schedule to “service stubby restart” every two hours. 

      Like
      • firstlast
      • firstlast
      • 2 wk ago
      • Reported - view

      Dan I just tried again yesterday switching to NextDNS DoT servers and once again my home network came crawling to a halt. Same issues, cannot resolve queries.

      Sigh, I'm back to using Cloudflare DoT with the exact same config and have absolutely no issues. The problem is definitely unique to NextDNS somehow.

      Oh well.

      Like
    • firstlast seems to be running pretty well the last few days, I don't trust it 🤔🤨

      Like
Like1 Follow
  • Status Fixed
  • 1 Likes
  • 13 days agoLast active
  • 55Replies
  • 509Views
  • 9 Following