streaming results from the CPAN

Hadoop-Streaming

view release on metacpan or search on metacpan

            "Test::More" : "0",
            "strict" : "0",
            "warnings" : "0"
         }
      }
   },
   "release_status" : "stable",
   "resources" : {
      "repository" : {
         "type" : "git",
         "url" : "git://github.com/spazm/hadoop-streaming-frontend",
         "web" : "http://github.com/spazm/hadoop-streaming-frontend"
      }
   },
   "version" : "0.143060"
}

META.yml view on Meta::CPAN

  url: http://module-build.sourceforge.net/META-spec-v1.4.html
  version: '1.4'
name: Hadoop-Streaming
requires:
  IO::Handle: '0'
  Moo: '0'
  Moo::Role: '0'
  Params::Validate: '0'
  Safe::Isa: '0'
resources:
  repository: git://github.com/spazm/hadoop-streaming-frontend
version: '0.143060'

dist.ini view on Meta::CPAN

[@Git]
changelog   = Changes             ; this is the default
allow_dirty = dist.ini            ; see Git::Check...
allow_dirty = Changes             ; ... and Git::Commit
commit_msg  = v%v%n%n%c           ; see Git::Commit
tag_format  = v%v                 ; see Git::Tag
tag_message = %v                  ; see Git::Tag
push_to     = origin              ; see Git::Push

[MetaResources]
repository.web  = http://github.com/spazm/hadoop-streaming-frontend

examples/wordcount/example_hadoop.sh view on Meta::CPAN

#!/bin/sh -x

#1) copy the input file to your hadoop dfs
#2) build the hadoop job, (using 0.20 file layout)

hadoop dfs -copyFromLocal input ./

hadoop                     \
    jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+152-streaming.jar \
    -input   input          \
    -output  myoutput       \
    -mapper  map.pl         \
    -reducer reduce.pl      \
    -file    map.pl         \
    -file    reduce.pl

lib/Hadoop/Streaming.pm view on Meta::CPAN

=item Map/Reduce at wikipedia

http://en.wikipedia.org/wiki/MapReduce

=item Hadoop

http://hadoop.apache.org

=item Hadoop Streaming interface:

http://hadoop.apache.org/common/docs/r0.20.1/streaming.html

=item PAR::Packer

http://search.cpan.org/perldoc?PAR::Packer

=back

=head1 EXAMPLES

=over 4

lib/Hadoop/Streaming.pm view on Meta::CPAN

  my_mapper < test_input_file > output.map        && \
  sort output.map > output.mapsort                && \
  my_combiner < output.mapsort > output.combine   && \
  my_reducer < output.combine > output.reduce

=item hadoop commandline

Run this in hadoop from the shell:

  hadoop                                     \
      jar $streaming_jar_name                \
      -D mapred.job.name="my hadoop example" \
      -input    my_input_file                \
      -output   my_output_hdfs_path          \
      -mapper   my_mapper                    \
      -combiner my_combiner                  \
      -reducer  my_reducer

$streaming_jar_name is the full path to the streaming jar provided by the installed hadoop.  For my 0.20 install the path is:

  /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+152-streaming.jar

The -D line is optional.  If included, -D lines must come directly after the jar name and before other options.

For this hadoop job to work, the mapper, combiner and reducer must be full paths that are valid on each box in the hadoop cluster.  There are a few ways to make this work.

=item hadoop job -files option

Additional files may be bundled into the hadoop jar via the '-files' option to hadoop jar.  These files will be included in the jar that is distributed to each host.  The files will be visible in the current working directory of the process.  Subdire...

example:
  hadoop                                     \
      jar $streaming_jar_name                \
      -D mapred.job.name="my hadoop example" \
      -input    my_input_file                \
      -output   my_output_hdfs_path          \
      -mapper   my_mapper                    \
      -combiner my_combiner                  \
      -reducer  my_reducer                   \
      -file     /path/to/my_mapper           \
      -file     /path/to/my_combiner         \
      -file     /path/to/my_reducer

lib/Hadoop/Streaming.pm view on Meta::CPAN

  use strict; use warnings;
  use lib qw(/apps/perl5);
  use My::Example::Job;
  My::Example::Job::Mapper->run();

* The mapper/reducer/combiner files can be included with the job via -file options to hadoop jar or they can be referenced directly if they are in the shared environment.

=item full path of shared file

  hadoop                                     \
      jar $streaming_jar_name                \
      -input    my_input_file                \
      -output   my_output_hdfs_path          \
      -mapper   /apps/perl5/bin/my_mapper    \
      -combiner /apps/perl5/bin/my_combiner  \
      -reducer  /apps/perl5/bin/my_reducer

=item local path of included -file file

  hadoop                                     \
      jar $streaming_jar_name                \
      -input    my_input_file                \
      -output   my_output_hdfs_path          \
      -file     /apps/perl5/bin/my_mapper    \
      -file     /apps/perl5/bin/my_combiner  \
      -file     /apps/perl5/bin/my_reducer   \
      -mapper   ./my_mapper                  \
      -combiner ./my_combiner                \
      -reducer  ./my_reducer

=back

lib/Hadoop/Streaming.pm view on Meta::CPAN


=item PAR::Packer / pp

Deprecated.

Use pp (installed via PAR::Packer) to produce a perl file that needs only a perl interpreter to execute.  I use -x option to run the my_mapper script on blank input, as this forces all of the necessary modules to be loaded and thus tracked in my PAR ...

  mkdir packed
  pp my_mapper -B -P -Ilib -o packed/my_mapper -x my_mapper < /dev/null
  hadoop                                     \
      jar $streaming_jar_name                \
      -input    my_input_file                \
      -output   my_output_hdfs_path          \
      -file     packed/my_mapper             \
      -mapper   ./my_mapper

To simplify this process and reduce errors, I use make to produce the packed binaries.  Indented lines after "name :" lines are indented with a literal tab, as per Makefile requirements.

    #Makefile for PAR packed apps
    PERLTOPACK     =                              \
        region-dma-mapper.pl                      \

lib/Hadoop/Streaming/Mapper.pm view on Meta::CPAN

    Package->run();

This method starts the Hadoop::Streaming::Mapper instance.  

After creating a new object instance, it reads from STDIN and calls 
$object->map() on each line of input.  Subclasses need only implement map() 
to produce a complete Hadoop Streaming compatible mapper.

=head1 INTERFACE DETAILS

The default inputformat for streaming jobs is TextInputFormat, which returns lines without keys in the streaming context.  Because of this, map is not provided a key/value pair, instead it is given the value (the input line).

If you change your jar options to use a different JavaClassName as inputformat, you may need to deal with key and value. TBD.

quoting from:  http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Specifying+Other+Plugins+for+Jobs 
=over 4
Specifying Other Plugins for Jobs

Just as with a normal Map/Reduce job, you can specify other plugins for a streaming job:

   -inputformat JavaClassName
   -outputformat JavaClassName
   -partitioner JavaClassName
   -combiner JavaClassName

The class you supply for the input format should return key/value pairs of Text class. If you do not specify an input format class, the TextInputFormat is used as the default. Since the TextInputFormat returns keys of LongWritable class, which are ac...

The class you supply for the output format is expected to take key/value pairs of Text class. If you do not specify an output format class, the TextOutputFormat is used as the default. 

=back

=head1 AUTHORS

=over 4

=item *

( run in 0.327 second using v1.01-cache-2.11-cpan-a5abf4f5562 )