Hadoop Streaming API with Python mapper script - File Not Found

276
February 09, 2017, at 04:05 AM

I'm having the same issue as the users on this issue: Hadoop Streaming - Unable to find file error

As in this other issue, I am using a zip file which contains additional Python code which I import from my mapper. hadoop streaming with python modules In the script file that I've posted below, you can see the zip file on line 21 which is referenced in the call to the Hadoop Streaming API jar file on line 26. I am not using a pickle file like the aforementioned StackOverflow issue reports.

I decided to post my problem in a new thread, with additional details that didn't seem appropriate for a comment on that page.

The Hadoop Streaming API is throwing a Java Exception FileNotFound when running my script. The interesting thing is that it works in pseudo-distributed mode, but it does not work when I have a cluster of a few nodes (I have a cluster of 4 nodes on AWS).

I do have xrw permissions on the mapper file and the deploy.sh that is called on line 7 below does put xrw permissions on the zip file that is generated as well.

Is there something wrong in my call to the Hadoop Streaming API, or is the problem somewhere in my Python code? (note, code is from http://gurus.pyimagesearch.com and I have tested it in pseudo-distributed mode with success)

Here is my script file that I am running:

 1 #!/bin/sh
 2
 3 # grab the current working directory
 4 BASE=$(pwd)
 5
 6 # create the latest deployable package
 7 sbin/deploy.sh
 8
 9 # change directory to where Hadoop lives
10 cd $HADOOP_HOME
11
12 # (potentially optional): turn off safe mode
13 bin/hdfs dfsadmin -safemode leave
14
15 # remove the previous output directory
16 bin/hdfs dfs -rm -r /user/ubuntu/ukbench/output
17
18 # define the set of local files that need to be present to run the Hadoop
19 # job -- comma separate each file path
20 FILES="${BASE}/feature_extractor_mapper.py,\
21 ${BASE}/deploy/pyimagesearch.zip"
22
23 # run the job on Hadoop
24 bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-*.jar \
25     -D mapreduce.job.reduces=0 \
26     -files  ${FILES} \
27     -mapper ${BASE}/feature_extractor_mapper.py \
28     -input /user/ubuntu/ukbench/input/ukbench_dataset.txt \
29     -output /user/ubuntu/ukbench/output

And this is the stacktrace from executing the script:

ubuntu@ip-172-31-39-231:~/high_throughput_feature_extraction$ jobs/feature_extractor_mapper.sh
Safe mode is OFF
17/02/08 18:10:46 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/ubuntu/ukbench/output
packageJobJar: [/tmp/hadoop-unjar2327603386373063535/] [] /tmp/streamjob380494102161319103.jar tmpDir=null
17/02/08 18:10:48 INFO client.RMProxy: Connecting to ResourceManager at *I REMOVED THIS*
17/02/08 18:10:48 INFO client.RMProxy: Connecting to ResourceManager at *I REMOVED THIS*
17/02/08 18:10:49 INFO mapred.FileInputFormat: Total input paths to process : 1
17/02/08 18:10:49 INFO mapreduce.JobSubmitter: number of splits:10
17/02/08 18:10:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1486574928548_0004
17/02/08 18:10:50 INFO impl.YarnClientImpl: Submitted application application_1486574928548_0004
17/02/08 18:10:50 INFO mapreduce.Job: The url to track the job: http://*I REMOVED THIS*.compute.amazonaws.com:8088/proxy/    application_1486574928548_0004/
17/02/08 18:10:50 INFO mapreduce.Job: Running job: job_1486574928548_0004
17/02/08 18:10:57 INFO mapreduce.Job: Job job_1486574928548_0004 running in uber mode : false
17/02/08 18:10:57 INFO mapreduce.Job:  map 0% reduce 0%
17/02/08 18:11:12 INFO mapreduce.Job: Task Id : attempt_1486574928548_0004_m_000009_0, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
        at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:112)
        at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:78)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:449)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
        ... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
        at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:112)
        at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:78)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
        ... 14 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
        ... 17 more
Caused by: java.lang.RuntimeException: configuration exception
        at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
        at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
        ... 22 more
Caused by: java.io.IOException: Cannot run program "/home/ubuntu/high_throughput_feature_extraction/feature_extractor_mapper.py": error=2, No such file or     directory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
        at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
        ... 23 more
Caused by: java.io.IOException: error=2, No such file or directory
        at java.lang.UNIXProcess.forkAndExec(Native Method)
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
        at java.lang.ProcessImpl.start(ProcessImpl.java:134)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
        ... 24 more
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Answer 1

I figured out the solution to my own problem that has been plaguing me for quite some time.

For some reason on line 27 below, it doesn't like the full path and it likes the python script to be in quotes.

I made a few other changes...here is a summary of all of the changes: -comment out line 10 which changes into the Hadoop installation directory. -remove the full path references on lines 20 and 21 (since i'm not in the Hadoop directory...see previous bullet) -reference the $HADOOP_HOME directory on line 24. If you are on Cloudera your path to the streaming jar file will be different, so keep that in mind. -line 27: remove the full path since I'm in the directory where this file is, and also put the py file in quotes

I hope this helps other people!

 1 #!/bin/sh
 2
 3 # grab the current working directory
 4 BASE=$(pwd)
 5
 6 # create the latest deployable package
 7 sbin/deploy.sh
 8
 9 # change directory to where Hadoop lives
10 #cd $HADOOP_HOME
11
12 # (potentially optional): turn off safe mode
13 hdfs dfsadmin -safemode leave
14
15 # remove the previous output directory
16 hdfs dfs -rm -r /user/ubuntu/ukbench/output
17
18 # define the set of local files that need to be present to run the Hadoop
19 # job -- comma separate each file path
20 FILES="feature_extractor_mapper.py,\
21 deploy/pyimagesearch.zip"
22
23 # run the job on Hadoop
24 ${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-*.jar \
25     -D mapreduce.job.reduces=0 \
26     -files  ${FILES} \
27     -mapper "feature_extractor_mapper.py" \
28     -input /user/ubuntu/ukbench/input/ukbench_dataset.txt \
29     -output /user/ubuntu/ukbench/output
READ ALSO
Jackson JSON Java nested object and arrays

Jackson JSON Java nested object and arrays

I have an example nested json object like below :

394
how to display discontinues array in java

how to display discontinues array in java

i want to print these two columns to be printed in the same order with space(null value) included in the array

290
Milliseconds in java/android before 1970?

Milliseconds in java/android before 1970?

I know how to get the milliseconds for a date after 11

366
Selenium ChromeDriver extremly slow during isDisplay()-Method / doesn&#39;t click on Element

Selenium ChromeDriver extremly slow during isDisplay()-Method / doesn't click on Element

I want check if a modal is displayedif true, then click on close-button

280