Broken Pipe Error causes streaming Elastic MapReduce job on AWS to fail

Your streaming process (your Python script) is terminating prematurely. This may be do to it thinking input is complete (e.g. interpreting an EOF) or a swallowed exception. Either way, Hadoop is trying to feed into via STDIN to your script, but since the application has terminated (and thus STDIN is no longer a valid File Descriptor), you're getting a BrokenPipe error. I would suggest adding stderr traces in your script to see what line of input is causing the problem. Happy coding,

-Geoff


This is said in the accepted error, but let me attempt to clarify--you must block on stdin, even if you don't need it! This is not the same as Linux pipes, so don't let that fool you. What happens, intuitively, is, Streaming stands up your executable, then says, "wait here while I go get input for you". If your executable stops for any reason before Streaming sends you 100% of the input, Streaming says, "Hey, where did that executable go that I stood up?...Hmmmm...the pipe is broken, let me raise that exception!" So, here is some python code, all it does is what cat does, but you'll note, this code won't exit until all input is processed, and that is the key point:

#!/usr/bin/python
import sys

while True:
    s = sys.stdin.readline()
    if not s:
        break
    sys.stdout.write(s)

I have no experience with Hadoop on AWS but I had the same error on a regular hadoop cluster - and in my case the problem was how I started python -mapper ./mapper.py -reducer ./reducer.py worked but -mapper python mapper.py didn't.

You also seem to use a non-standard python package warc do you submit the necessary files to the streamjob? -cacheFiles or -cacheArchive could be helpful.