streaming results from the CPAN

Fsdb
#!/usr/bin/perl -w

#
# dbmapreduce.pm
# Copyright (C) 1991-2024 by John Heidemann <johnh@isi.edu>
#
# This program is distributed under terms of the GNU general
# public license, version 2.  See the file COPYING
# in $dblibdir for details.
#


package Fsdb::Filter::dbmapreduce;

=head1 NAME

dbmapreduce - reduce all input rows with the same key

=head1 SYNOPSIS

    dbmapreduce [-dMS] [-k KeyField] [-f CodeFile] [-C Filtercode] [--] [ReduceCommand [ReduceArguments...]]

=head1 DESCRIPTION

Group input data by KeyField,
then apply a function (the "reducer") to each group.
The reduce function can be an external program
given by ReduceCommand and ReduceArguments,
or an Perl subroutine given in CodeFile or FilterCode.

If a "--" appears before reduce command,
arguments after the -- passed the the command.


=head2 Grouping (The Mapper)

By default the KeyField is the first field in the row.
Unlike Hadoop streaming, the -k KeyField option can explicitly
name where the key is in any column of each input row.

By default, we sort the data to make sure data is grouped by key.
If the input is already grouped, the C<-S> option avoids this cost.


=head2 The Reducer

Reduce functions default to be shell commands.
However, with C<-C>, one can use arbitrary Perl code

(see the C<-C> option below for details).
the C<-f> option is useful to specify complex Perl code
somewhere other than the command line.

Finally, as a special case, if there are no rows of input,
the reducer will be invoked once with the empty value (if it's an external 
reducer) or with undef (if it's a subroutine).
It is expected to generate the output header,
and it may generate no data rows itself, or a null data row
of its choosing.

=head2 Output

For non-multi-key-aware reducers,
we add the KeyField use for each Reduce
is in the output stream.
(If the reducer passes the key we trust that it gives a correct value.)
We also insure that the output field separator is the
same as the input field separator.

Adding the key and adjusting the output field separator
is not possible for 
non-multi-key-aware reducers.


=head2 Comparison to Related Work

This program thus implements Google-style map/reduce,
but executed sequentially.

For input, these systems include a map function and apply it to input data
to generate the key.
We assume this key generation (the map function)
has occurred head of time.

We also allow the grouping key to be in any column.  
Hadoop Streaming requires it to be in the first column.

By default, the reducer gets exactly (and only) one key.
This invariant is stronger than Google and Hadoop.
They both pass multiple keys to the
reducer, insuring that each key is grouped together.
With the C<-M> option, we also pass multiple multiple groups to the reducer.

Unlike those systems, with the C<-S> option
we do not require the groups arrive in any particular
order, just that they be grouped together.
(They guarantees they arrive in lexically sorted order).
However, with C<-S> we create lexical ordering.
( run in 1.769 second using v1.01-cache-2.11-cpan-39bf76dae61 )