Spark cluster full of heartbeat timeouts, executors exiting on their own

If you are using pyspark, changing spark context's configuration will solve this problem. You can set it as following (Note all mentioned time are in ms) and heartbeatInterval (default 10000) should be lesser than the timeout (default 120000)

conf = SparkConf().setAppName("applicaiton") \
.set("spark.executor.heartbeatInterval", "200000") \ 
.set("spark.network.timeout", "300000")
sc = SparkContext.getOrCreate(conf)
sqlcontext = SQLContext(sc)

Hope this solves your problem. If you face any further errors, vist the documentation here


The answer was rather simple. In my spark-defaults.conf I set the spark.network.timeout to a higher value. Heartbeat interval was somewhat irrelevant to the problem (though tuning is handy).

When using spark-submit I was also able to set the timeout as follows:

$SPARK_HOME/bin/spark-submit --conf spark.network.timeout 10000000 --class myclass.neuralnet.TrainNetSpark --master spark://master.cluster:7077 --driver-memory 30G --executor-memory 14G --num-executors 7 --executor-cores 8 --conf spark.driver.maxResultSize=4g --conf spark.executor.heartbeatInterval=10000000 path/to/my.jar

Missing heartbeats and executors being killed by YARN is nearly always due to OOMs. You should inspect the logs on the individual executors (look for the text "running beyond physical memory"). If you have many executors and find it cumbersome to inspect all of the logs manually, I recommend monitoring your job in the Spark UI while it runs. As soon as a task fails, it will report the cause in the UI, so it's easy to see. Note that some tasks will report failure due to missing executors that have already been killed, so make sure you look at causes for each of the individual failing tasks.

Note also that most OOM problems can be solved quickly by simply repartitioning your data at appropriate places in your code (again look at the Spark UI for hints as to where there might be a need for a call to repartition). Otherwise, you might want to scale up your machines to accommodate the need for memory.