nginx: how do I track down a random 500 from nginx (not my application). Potentially has something to do with load?

We use a combination of log formats in nginx and lmon to catch things like this. An NGINX log format like:

log_format main '$status:$request_time:$upstream_response_time:$pipe:$body_bytes_sent $connection $remote_addr $host $remote_user [$time_local] "$request" "$http_referer" "$http_user_agent" "$http_x_forwarded_for" $upstream_addr $upstream_cache_status "in: $http_cookie"'

Will capture a lot of helpful diagnostic info, like the upstream server that handled the request, as well as putting the status in the front so it is easy to read even if the logs are scrolling by pretty fast.

We use LMON to watch these logs and then alert us (pagers/email) if it sees errors, like 500s, 503s, 400s, in the logs:

http://www.bsdconsulting.no/tools/lmon-README

This can help you be alerted to an issue when its happening which is the easiest time to debug it.

The other thing you should probably consider if you haven't already is that by default nginx considers a 500 to be a fatal condition and doesn't try another upstream. If you have multiple upstreams you can configure it to use another one if it gets a 500, hopefully obscuring the failure from the user:

http://wiki.nginx.org/NginxHttpProxyModule#proxy_next_upstream


error_log $filename debug; will turn on debug level logging into the error log -- this will give you lots and lots of details of nginx's internal status at the time of the error, and if compiled with --with-debug (which several distros do by default) it'll give even more.

Be warned that the "debug" level really does generate lots of output, to the point that you may want to watch your disk space...