process results from the CPAN

process
Algorithm-Numerical-Sample
view release on metacpan or search on metacpan
lib/Algorithm/Numerical/Sample.pm view on Meta::CPAN
}

sub data {
    my $self   = shift;

    foreach my $sample (@_) {
        if ($self -> {seen} < $self -> {sample_size}) {
            # Initialize reservoir.
            $self -> {reservoir} -> [$self -> {seen}] =
                                    [$self -> {seen}, $sample];
        }
        else {
            # Draw number.
            my $U = int rand ($self -> {seen} + 1);
            if ($U < $self -> {sample_size}) {
                $self -> {reservoir} -> [$U] = [$self -> {seen}, $sample];
            }
        }

        $self -> {seen} ++;
    }

    return;
}

sub extract {
    my $self = shift;

    my @result = map {$_ -> [1]}
                 sort {$a -> [0] <=> $b -> [0]} @{$self -> {reservoir}};

    $self -> {seen}      = 0;
    $self -> {reservoir} = [(undef) x $self -> {sample_size}];

    wantarray ? @result : $result [0];
}


__END__

=head1 NAME

Algorithm::Numerical::Sample - Draw samples from a set

=head1 SYNOPSIS

    use Algorithm::Numerical::Sample  qw /sample/;

    @sample = sample (-set         => [1 .. 10000],
                      -sample_size => 100);

    $sampler = Algorithm::Numerical::Sample::Stream -> new;
    while (<>) {$sampler -> data ($_)}
    $random_line = $sampler -> extract;

=head1 DESCRIPTION

This package gives two methods to draw fair, random samples from a set.
There is a procedural interface for the case the entire set is known,
and an object oriented interface when the a set with unknown size has
to be processed. 

=head2 B<A>: C<sample (set =E<gt> ARRAYREF [,sample_size =E<gt> EXPR])>

The C<sample> function takes a set and a sample size as arguments.
If the sample size is omitted, a sample of C<1> is taken. The keywords
C<set> and C<sample_size> may be preceeded with an optional C<->.
The function returns the sample list, or a reference to the sample
list, depending on the context.

=head2 B<B>: C<Algorithm::Numerical::Sample::Stream>

The class C<Algorithm::Numerical::Sample::Stream> has the following
methods:

=over

=item C<new>

This function returns an object of the
C<Algorithm::Numerical::Sample::Stream> class.
It will take an optional argument of the form
C<sample_size =E<gt> EXPR>, where C<EXPR> evaluates to the
sample size to be taken. If this argument is missing,
a sample of size C<1> will be taken.
The keyword C<sample_size> may be preceeded by an optional dash.

=item C<data (LIST)>

The method C<data> takes a list of parameters which are elements
of the set we are sampling. Any number of arguments can be given.

=item C<extract>

This method will extract the sample from the object, and reset it
to a fresh state, such that a sample of the same size but from a
different set, can be taken. C<extract> will return a list in list
context, or the first element of the sample in scalar context.

=back

=head1 CORRECTNESS PROOFS

=head2 Algorithm A.

Crucial to see that the C<sample> algorithm is correct is the
fact that when we sample C<n> elements from a set of size C<N>
that the C<t + 1>st element is choosen with probability
C<(n - m)/(N - t)>, when already C<m> elements have been
choosen. We can immediately see that we will never pick too
many elements (as the probability is 0 as soon as C<n == m>),
nor too few, as the probability will be 1 if we have C<k>
elements to choose from the remaining C<k> elements, for some
C<k>. For the proof that the sampling is unbiased, we refer to [3].
(Section 3.4.2, Exercise 3).

=head2 Algorithm B.

It is easy to see that the second algorithm returns the correct
number of elements. For a sample of size C<n>, the first C<n>
elements go into the reservoir, and after that, the reservoir
( run in 0.631 second using v1.01-cache-2.11-cpan-600a1bdf6e4 )