system auto reboot when tensorflow model is too large

Changing the GPU power settings will work, if you have PSU with enough power (WATTS). I limited my GPU's (TITANX) power to max. 200 WATTS using,

sudo nvidia-smi -pl 200

NOTE: Each GPU has power limitations, for e.g. TITANX's power limit is between 125W and 300W. So make sure to give value between those limits.


I tracked the issue down to a faulty power supply. It had enough capacity according to spec, and limiting GPU power consumption by running "nvidia-smi -pl 150" didn't help at all. Probably it couldn't handle bursts in power consumption.
Anyway, after I changed the power supply from "Corsair CX750 Builder Series ATX 80 PLUS" to "Cooler Master V1000", the issue is gone. See details of my investigation in the TensorFlow GitHub issue.


Make sure it is not a power supply unit problem. I was observing strange occasional reboots on my development machine. As I was increasing the size of input (batch size, larger NN) the rate of reboots was increasing as well. Turned out to be a PSU problem. A quick check is to limit GPU power consumption and see if this behavior will go away. For instance, you can limit power to about 150 watts with this command (you'll need a sudo rights):

sudo nvidia-smi -pl 150

Tags:

Tensorflow