Hadoop-Streaming
view release on metacpan or search on metacpan
"Test::More" : "0",
"strict" : "0",
"warnings" : "0"
}
}
},
"release_status" : "stable",
"resources" : {
"repository" : {
"type" : "git",
"url" : "git://github.com/spazm/hadoop-streaming-frontend",
"web" : "http://github.com/spazm/hadoop-streaming-frontend"
}
},
"version" : "0.143060"
}
url: http://module-build.sourceforge.net/META-spec-v1.4.html
version: '1.4'
name: Hadoop-Streaming
requires:
IO::Handle: '0'
Moo: '0'
Moo::Role: '0'
Params::Validate: '0'
Safe::Isa: '0'
resources:
repository: git://github.com/spazm/hadoop-streaming-frontend
version: '0.143060'
[@Git]
changelog = Changes ; this is the default
allow_dirty = dist.ini ; see Git::Check...
allow_dirty = Changes ; ... and Git::Commit
commit_msg = v%v%n%n%c ; see Git::Commit
tag_format = v%v ; see Git::Tag
tag_message = %v ; see Git::Tag
push_to = origin ; see Git::Push
[MetaResources]
repository.web = http://github.com/spazm/hadoop-streaming-frontend
examples/wordcount/example_hadoop.sh view on Meta::CPAN
#!/bin/sh -x
#1) copy the input file to your hadoop dfs
#2) build the hadoop job, (using 0.20 file layout)
hadoop dfs -copyFromLocal input ./
hadoop \
jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+152-streaming.jar \
-input input \
-output myoutput \
-mapper map.pl \
-reducer reduce.pl \
-file map.pl \
-file reduce.pl
lib/Hadoop/Streaming.pm view on Meta::CPAN
=item Map/Reduce at wikipedia
http://en.wikipedia.org/wiki/MapReduce
=item Hadoop
http://hadoop.apache.org
=item Hadoop Streaming interface:
http://hadoop.apache.org/common/docs/r0.20.1/streaming.html
=item PAR::Packer
http://search.cpan.org/perldoc?PAR::Packer
=back
=head1 EXAMPLES
=over 4
lib/Hadoop/Streaming.pm view on Meta::CPAN
my_mapper < test_input_file > output.map && \
sort output.map > output.mapsort && \
my_combiner < output.mapsort > output.combine && \
my_reducer < output.combine > output.reduce
=item hadoop commandline
Run this in hadoop from the shell:
hadoop \
jar $streaming_jar_name \
-D mapred.job.name="my hadoop example" \
-input my_input_file \
-output my_output_hdfs_path \
-mapper my_mapper \
-combiner my_combiner \
-reducer my_reducer
$streaming_jar_name is the full path to the streaming jar provided by the installed hadoop. For my 0.20 install the path is:
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+152-streaming.jar
The -D line is optional. If included, -D lines must come directly after the jar name and before other options.
For this hadoop job to work, the mapper, combiner and reducer must be full paths that are valid on each box in the hadoop cluster. There are a few ways to make this work.
=item hadoop job -files option
Additional files may be bundled into the hadoop jar via the '-files' option to hadoop jar. These files will be included in the jar that is distributed to each host. The files will be visible in the current working directory of the process. Subdire...
example:
hadoop \
jar $streaming_jar_name \
-D mapred.job.name="my hadoop example" \
-input my_input_file \
-output my_output_hdfs_path \
-mapper my_mapper \
-combiner my_combiner \
-reducer my_reducer \
-file /path/to/my_mapper \
-file /path/to/my_combiner \
-file /path/to/my_reducer
lib/Hadoop/Streaming.pm view on Meta::CPAN
use strict; use warnings;
use lib qw(/apps/perl5);
use My::Example::Job;
My::Example::Job::Mapper->run();
* The mapper/reducer/combiner files can be included with the job via -file options to hadoop jar or they can be referenced directly if they are in the shared environment.
=item full path of shared file
hadoop \
jar $streaming_jar_name \
-input my_input_file \
-output my_output_hdfs_path \
-mapper /apps/perl5/bin/my_mapper \
-combiner /apps/perl5/bin/my_combiner \
-reducer /apps/perl5/bin/my_reducer
=item local path of included -file file
hadoop \
jar $streaming_jar_name \
-input my_input_file \
-output my_output_hdfs_path \
-file /apps/perl5/bin/my_mapper \
-file /apps/perl5/bin/my_combiner \
-file /apps/perl5/bin/my_reducer \
-mapper ./my_mapper \
-combiner ./my_combiner \
-reducer ./my_reducer
=back
lib/Hadoop/Streaming.pm view on Meta::CPAN
=item PAR::Packer / pp
Deprecated.
Use pp (installed via PAR::Packer) to produce a perl file that needs only a perl interpreter to execute. I use -x option to run the my_mapper script on blank input, as this forces all of the necessary modules to be loaded and thus tracked in my PAR ...
mkdir packed
pp my_mapper -B -P -Ilib -o packed/my_mapper -x my_mapper < /dev/null
hadoop \
jar $streaming_jar_name \
-input my_input_file \
-output my_output_hdfs_path \
-file packed/my_mapper \
-mapper ./my_mapper
To simplify this process and reduce errors, I use make to produce the packed binaries. Indented lines after "name :" lines are indented with a literal tab, as per Makefile requirements.
#Makefile for PAR packed apps
PERLTOPACK = \
region-dma-mapper.pl \
lib/Hadoop/Streaming/Mapper.pm view on Meta::CPAN
Package->run();
This method starts the Hadoop::Streaming::Mapper instance.
After creating a new object instance, it reads from STDIN and calls
$object->map() on each line of input. Subclasses need only implement map()
to produce a complete Hadoop Streaming compatible mapper.
=head1 INTERFACE DETAILS
The default inputformat for streaming jobs is TextInputFormat, which returns lines without keys in the streaming context. Because of this, map is not provided a key/value pair, instead it is given the value (the input line).
If you change your jar options to use a different JavaClassName as inputformat, you may need to deal with key and value. TBD.
quoting from: http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Specifying+Other+Plugins+for+Jobs
=over 4
Specifying Other Plugins for Jobs
Just as with a normal Map/Reduce job, you can specify other plugins for a streaming job:
-inputformat JavaClassName
-outputformat JavaClassName
-partitioner JavaClassName
-combiner JavaClassName
The class you supply for the input format should return key/value pairs of Text class. If you do not specify an input format class, the TextInputFormat is used as the default. Since the TextInputFormat returns keys of LongWritable class, which are ac...
The class you supply for the output format is expected to take key/value pairs of Text class. If you do not specify an output format class, the TextOutputFormat is used as the default.
=back
=head1 AUTHORS
=over 4
=item *
( run in 0.327 second using v1.01-cache-2.11-cpan-a5abf4f5562 )