AWS ELB Apache2 503 Service Unavailable: Back-end server is at capacity

Solution 1:

You will get a "Back-end server is at capacity" when the ELB load balancer performs its health checks and receives a "page not found" (or other simple error) due to a mis-configuration (typically with the NameVirtual host).

Try grepping the log files folder using the "ELB-HealthChecker" user agent. e.g.

grep ELB-HealthChecker  /var/log/httpd/*

This will typically give you a 4x or 5x error which is easily fixed. e.g. Flooding, MaxClients etc is giving the problem way too much credit.

FYI Amazon: Why not show the returned response from request? Even a status code would help.

Solution 2:

I just ran into this issue myself. The Amazon ELB will return this error if there are no healthy instances. Our sites were misconfigured, so the ELB healthcheck was failing, which caused the ELB to take the two servers out of rotation. With zero healthy sites, the ELB returned 503 Service Unavailable: Back-end server is at capacity.


Solution 3:

[EDIT after understanding the question better] Not having any experience of the ELB, I still think this sounds suspiciously like the 503 error which may be thrown when Apache fronts a Tomcat and floods the connection.

The effect is that if Apache delivers more connection requests than can be processed by the backend, the backend input queues fill up until no more connections can be accepted. When that happens, the corresponding output queues of Apache start filling up. When the queues are full Apache throws a 503. It would follow that the same could happen when Apache is the backend, and the frontend delivers at such a rate as to make the queues fill up.

The (hypothetical) solution is to size the input connectors of the backend and output connectors of the frontend. This turns into a balancing act between the anticipated flooding level and the available RAM of the computers involved.

So as this happens, check your maxclients settings and monitor your busy workers in Apache (mod_status.). Do the same if possible with whatever ELB has that corresponds to Tomcats connector backlog, maxthreads etc. In short, look at everything concerning the input queues of Apache and the output queues of ELB.

Although I fully understand it is not directly applicable, this link contains a sizing guide for the Apache connector. You would need to research the corresponding ELB queue technicalities, then do the math: http://www.cubrid.org/blog/dev-platform/maxclients-in-apache-and-its-effect-on-tomcat-during-full-gc/

As observed in the commentary below, to overwhelm the Apache connector a spike in traffic is not the only possibility. If some requests are slower served than others, a higher ratio of those can also lead to the connector queues filling up. This was true in my case.

Also, when this happened to me I was baffled that I had to restart the Apache service in order to not get served 503:s again. Simply waiting out the connector flooding was not enough. I never got that figured out, but one can speculate in Apache serving from its cache perhaps?

After increasing the number of workers and the corresponding pre-fork maxclients settings (this was multithreaded Apache on Windows which has a couple of other directives for the queues if I remember correctly), the 503-problem disappeared. I actually didn't do the math, but just tweaked the values up until I could observe a wide margin to the peak consumption of the queue resources. I let it go at that.

Hope this was of some help.


Solution 4:

you can up the values of the elb health checker, so as a single slow response wont pull a server from elb. better to have a few users get service unavailable, than the site being down for everyone.

EDIT: We are able to get away without pre-warming cache by upping health check timeout to 25 seconds......after 1-2 minutes... site is responsive as hell

EDIT:: just launch a bunch of on demand, and when your monitoring tools shows management just how fast your are, then just prepay RI amazon :P

EDIT: it is possible, a single backend elb registered instance is not enough. just launch a few more, and register them with elb, and that will help you narrow down your problem