NGINX keeps sending requests to offline upstream

What's keepalive?

The idea behind keepalive is to address the latency of establishing TCP connections over high-latency networks. It takes a 3-way handshake to establish a TCP connection, so, when there is a perceivable latency between the client and the server, keepalive would greatly speed things up by reusing existing connections.

Why do folks put nginx in front of their backends?

Nginx is very efficient at juggling thousands of connections, whereas many backends are not, hence, to speed things up, folks often put nginx in front of their real web servers to speed things up, so that connections between the cloud and the user would be cached keepalive for subsequent reuse.

Note that nginx didn't even support upstream keepalive until 1.1.4, as per http://nginx.org/r/keepalive, since, as per above, it's more likely to use more resources of your upstream than to speed up any processing, provided that you have sub-millisecond latency between all your hosts (e.g., between nginx and the upstream servers).

Do you see where it's going?

By using an excessive number of keepalive connections on a LAN, in a few hundreds per each upstream server, you're likely only making things slower, not faster, even if you haven't been experiencing the issue as you describe.

What happens when a service/port is down?

Normally, when a given port is unavailable on the host, the host immediately returns a TCP reset packet, known as RST, which immediately tells the client (e.g., nginx) what is up, letting it take appropriate action promptly. (Packets other than RST may also be used for same effect, e.g., when the route to the host is unavailable.)

If we stop the service on the backend, nginx handles it correctly. The issue only reproduces when stopping the entire VM. – Ramiro Berrelleza Oct 27 at 22:48

Your above comment likely indicates that it's the lack of timely connection-denied packets that certainly confuse nginx — it seems likely that your setup may simply be dropping the packets that nginx sends. And without any response to the requests, how could anyone be capable of knowing whether your backend service is unavailable, versus simply exhibiting enterprise-level behaviour?

What should one do?

  • First, as already mentioned, even if you haven't been experiencing the problems you describe, you're likely only making things slower by using upstream keepalive functionality on a LAN, especially with such a high number.

  • Otherwise, you might want configure your setup (firewall, router, virtualisation environment etc) to return appropriate packets for unavailable hosts, which should certainly make nginx work as expected, as you've already tested yourself of what happens when TCP RST packets are, in fact, returned by the host.

  • Another option is to adjust the various timeouts within nginx, to account for the possibility of your upstreams disappearing without a trace, and to compensate for your network not being capable of generating the appropriate control packets.

    This may include the connect(2) timeout, for establishing new TCP connections, via http://nginx.org/r/proxy_connect_timeout in the millisecond range (e.g., if all your servers are on a local network, and you're NOT running enterprise-grade software with the enterprise-level multisecond delays), as well as the timeouts for ongoing read(2) and send(2) operations, via http://nginx.org/r/proxy_read_timeout and http://nginx.org/r/proxy_send_timeout, respectively, which would depend on how fast your backend application is supposed to reply to the requests. You might also want to increase the fail_timeout parameter of the server directive within the upstream context, as per http://nginx.org/en/docs/http/ngx_http_upstream_module.html#server.


Try to set keepalive 16 and test again. 1k cached connections per worker could be too much for your case.

Tags:

Nginx