streaming results from the CPAN

Net-Amazon-EMR

view release on metacpan or search on metacpan

lib/Net/Amazon/EMR.pm view on Meta::CPAN

  my $emr = Net::Amazon::EMR->new(
    AWSAccessKeyId  => $AWS_ACCESS_KEY_ID,
    SecretAccessKey => $SECRET_ACCESS_KEY,
    ssl             => 1,
    );

  # start a job flow
  my $id = $emr->run_job_flow(Name => "Example Job",
                              Instances => {
                                  Ec2KeyName => 'myKeyId',
                                  InstanceCount => 10,
                                  KeepJobFlowAliveWhenNoSteps => 1,
                                  MasterInstanceType => 'm1.small',
                                  Placement => { AvailabilityZone => 'us-east-1a' },
                                  SlaveInstanceType => 'm1.small',
                              },
                              BootstrapActions => [{
                                Name => 'Bootstrap-configure',
                                ScriptBootstrapAction => {
                                  Path => 's3://elasticmapreduce/bootstrap-actions/configure-hadoop',
                                  Args => [ '-m', 'mapred.compress.map.output=true' ],
                                },
                              }],
                              Steps => [{
                                ActionOnFailure => 'TERMINATE_JOB_FLOWS',
                                Name => "Set up debugging",
                                HadoopJarStep => { 
                                           Jar => 's3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar',
                                           Args => [ 's3://us-east-1.elasticmapreduce/libs/state-pusher/0.1/fetch' ],
                                           },
                              }],
                            );

  print "Job flow id = " . $id->JobFlowId . "\n";

  # Get details of just-launched job
  $result = $emr->describe_job_flows(JobFlowIds => [ $id->JobFlowId ]);

  # or get details of all jobs created after a given time
  $result = $emr->describe_job_flows(CreatedAfter => '2012-12-17T07:19:57Z');

  # or use DateTime
  $result = $emr->describe_job_flows(CreatedAfter => DateTime->new(year => 2012, month => 12, day => 17));

  # See the details of the typed result
  use Data::Dumper; print Dumper($result);

  # or dispense with types and see the details as a perl hash
  use Data::Dumper; print Dumper($result->as_hash);

  # Flexible Booleans - 1, 0, undef, 'true', 'false'
  $emr->set_visible_to_all_users(JobFlowIds => $id, VisibleToAllUsers => 1);
  $emr->set_termination_protection(JobFlowIds => [ $id->JobFlowId ], TerminationProtected => 'false');

  # Add map-reduce steps and execute
  $emr->add_job_flow_steps(JobFlowId => $job_id,
                           Steps => [{
            ActionOnFailure => 'CANCEL_AND_WAIT',
            Name => "Example",
            HadoopJarStep => { 
              Jar => '/home/hadoop/contrib/streaming/hadoop-streaming.jar',
              Args => [ '-input', 's3://my-bucket/my-input',
                        '-output', 's3://my-bucket/my-output',
                        '-mapper', '/path/to/mapper-script',
                        '-reducer', '/path/to/reducer-script',
                      ],
              Properties => [ { Key => 'reduce_tasks_speculative_execution', Value => 'false' } ],
              },
        }, ... ]);

=head1 DESCRIPTION

This is an implementation of the Amazon Elastic Map-Reduce API.

=head1 CONSTRUCTOR

=head2 new(%options)

This is the constructor.  Options are as follows:

=over 4

=item * AWSAccessKeyId (required)

Your AWS access key.

=item * SecretAccessKey (required)

Your secret key.

=item * base_url (optional)

The base URL for your chosen Amazon region; see L<http://docs.aws.amazon.com/general/latest/gr/rande.html#emr_region>.  If not specified, the default URL is used (which implies region us-east-1).

  my $emr = Net::Amazon::EMR->new(
      AWSAccessKeyId  => $AWS_ACCESS_KEY_ID,
      SecretAccessKey => $SECRET_ACCESS_KEY,
      base_url => 'https://elasticmapreduce.us-west-2.amazonaws.com',
  );


=item * ssl (optional)

If set to a true value, the default base_url will use https:// instead of http://. Defaults to true.  

The ssl flag is not used if base_url is set explicitly.

=item * max_failures (optional)

Number of times to retry if a communications failure occurs, before raising an exception.  Defaults to 5.

=back

=head1 METHODS

Detailed information on each of the methods can be found in the Amazon EMR API documentation.  Each method takes a hash of parameters using the names given in the documentation.  Parameter passing uses the following rules:

=over 4

=item * Array inputs such as InstanceGroups.member.N use their primary name and a Perl ArrayRef, i.e. InstanceGroups => [ ... ] in this example.

lib/Net/Amazon/EMR.pm view on Meta::CPAN

    log4perl.rootLogger = DEBUG, Screen, Logfile
    log4perl.appender.Logfile = Log::Log4perl::Appender::File
    log4perl.appender.Logfile.filename = debug.log
    log4perl.appender.Logfile.layout = Log::Log4perl::Layout::PatternLayout
    log4perl.appender.Logfile.layout.ConversionPattern = "%d %-5p %c - %m%n"
    log4perl.appender.Screen = Log::Log4perl::Appender::ScreenColoredLevels
    log4perl.appender.Screen.stderr = 1
    log4perl.appender.Screen.layout = Log::Log4perl::Layout::PatternLayout
    log4perl.appender.Screen.layout.ConversionPattern = "[%d] [%p] %c %m%n"
  </log4perl>

=head2 Logging Verbosity
 
At DEBUG level, the output can be very lengthy.  To see only important messages for Net::Amazon::EMR whilst debugging other parts of your code, you could raise the threshold just for Net::Amazon::EMR by adding the following to your Log4perl configura...

  log4perl.logger.Net.Amazon.EMR = WARN

=head1 Map-Reduce Notes

This is somewhat beyond the scope of the documentation for using Net::Amazon::EMR.  Nevertheless, here are a few notes about using EMR with Perl.

=head2 Installing Perl Libraries

Undoubtedly, to run any serious processing, you will need to install additional libraries on the map-reduce servers.  A practical way to do this is to pre-configure all of the libraries using local::lib and use a bootstrap task to install them when t...

=over 4

=item * Start an interactive EMR job on a single instance using the same machine architecture (e.g. m1.large) that you plan to use for running your jobs.

=item * ssh to instance

=item * setup CPAN, get L<local::lib> and install

=item * setup .bashrc to contain the environment variables required to use L<local::lib>

=item * install all the other modules you need via cpan

=item * clean up files from .cpan that you don't need, such as build and source directories

=item * Create a tar file, e.g. tar cfz local-perl5.tar.gz perl5 .cpan .bashrc

=item * Copy the tar file to your bucket on S3.

=item * Set up a bootstrap script to copy back the tar file from S3 and untar it into the hadoop home directory, e.g.

    #!/bin/bash
    set -e
    bucket=mybucketname
    tarfile=local-perl5.tar.gz
    arch=large
    cd $HOME
    hadoop fs -get s3://$bucket/$arch/$tarfile .
    tar xfz $tarfile

=item * Put the bootstrap script on S3 and use it when creating a new job flow.

=back

=head2 Mappers and Reducers

Assuming the reader is familiar with the basic principles of map-reduce, in terms of implementation in Perl with hadoop-streaming.jar, a mapper/reducer is simply a script that reads from STDIN and writes to STDOUT, typically line by line using a tab-...

    while (my $line = <>) {
      chomp $line;
      my ($key, $value) = split(/\t/, @line);
      ... do something with key and value
      print "$newkey\t$newvalue\n";
    }

Scripts can be uploaded to S3 using the web interface, or placed in the bootstrap bundle described above, or uploaded to the master instance using scp and distributed using the hadoop-streaming.jar -file option, or no doubt by many other mechanisms. ...

  Args => [ '-mapper', '"perl -e MyClass->new->mapper"', ... ]

=head1 AUTHOR

Jon Schutz

L<http://notes.jschutz.net>

=head1 BUGS

Please report any bugs or feature requests to C<bug-net-amazon-emr at rt.cpan.org>, or through
the web interface at L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Net-Amazon-EMR>.  I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

=head1 SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Net::Amazon::EMR


You can also look for information at:

=over 4

=item * RT: CPAN's request tracker (report bugs here)

L<http://rt.cpan.org/NoAuth/Bugs.html?Dist=Net-Amazon-EMR>

=item * AnnoCPAN: Annotated CPAN documentation

L<http://annocpan.org/dist/Net-Amazon-EMR>

=item * CPAN Ratings

L<http://cpanratings.perl.org/d/Net-Amazon-EMR>

=item * Search CPAN

L<http://search.cpan.org/dist/Net-Amazon-EMR/>

=back


=head1 ACKNOWLEDGEMENTS

The core interface code was adapted from L<Net::Amazon::EC2>.


=head1 LICENSE AND COPYRIGHT

Copyright 2012 Jon Schutz.

This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

See http://dev.perl.org/licenses/ for more information.

=head1 SEE ALSO

( run in 0.626 second using v1.01-cache-2.11-cpan-39bf76dae61 )