Algorithm-Numerical-Sample
view release on metacpan or search on metacpan
lib/Algorithm/Numerical/Sample.pm view on Meta::CPAN
}
sub data {
my $self = shift;
foreach my $sample (@_) {
if ($self -> {seen} < $self -> {sample_size}) {
# Initialize reservoir.
$self -> {reservoir} -> [$self -> {seen}] =
[$self -> {seen}, $sample];
}
else {
# Draw number.
my $U = int rand ($self -> {seen} + 1);
if ($U < $self -> {sample_size}) {
$self -> {reservoir} -> [$U] = [$self -> {seen}, $sample];
}
}
$self -> {seen} ++;
}
return;
}
sub extract {
my $self = shift;
my @result = map {$_ -> [1]}
sort {$a -> [0] <=> $b -> [0]} @{$self -> {reservoir}};
$self -> {seen} = 0;
$self -> {reservoir} = [(undef) x $self -> {sample_size}];
wantarray ? @result : $result [0];
}
__END__
=head1 NAME
Algorithm::Numerical::Sample - Draw samples from a set
=head1 SYNOPSIS
use Algorithm::Numerical::Sample qw /sample/;
@sample = sample (-set => [1 .. 10000],
-sample_size => 100);
$sampler = Algorithm::Numerical::Sample::Stream -> new;
while (<>) {$sampler -> data ($_)}
$random_line = $sampler -> extract;
=head1 DESCRIPTION
This package gives two methods to draw fair, random samples from a set.
There is a procedural interface for the case the entire set is known,
and an object oriented interface when the a set with unknown size has
to be processed.
=head2 B<A>: C<sample (set =E<gt> ARRAYREF [,sample_size =E<gt> EXPR])>
The C<sample> function takes a set and a sample size as arguments.
If the sample size is omitted, a sample of C<1> is taken. The keywords
C<set> and C<sample_size> may be preceeded with an optional C<->.
The function returns the sample list, or a reference to the sample
list, depending on the context.
=head2 B<B>: C<Algorithm::Numerical::Sample::Stream>
The class C<Algorithm::Numerical::Sample::Stream> has the following
methods:
=over
=item C<new>
This function returns an object of the
C<Algorithm::Numerical::Sample::Stream> class.
It will take an optional argument of the form
C<sample_size =E<gt> EXPR>, where C<EXPR> evaluates to the
sample size to be taken. If this argument is missing,
a sample of size C<1> will be taken.
The keyword C<sample_size> may be preceeded by an optional dash.
=item C<data (LIST)>
The method C<data> takes a list of parameters which are elements
of the set we are sampling. Any number of arguments can be given.
=item C<extract>
This method will extract the sample from the object, and reset it
to a fresh state, such that a sample of the same size but from a
different set, can be taken. C<extract> will return a list in list
context, or the first element of the sample in scalar context.
=back
=head1 CORRECTNESS PROOFS
=head2 Algorithm A.
Crucial to see that the C<sample> algorithm is correct is the
fact that when we sample C<n> elements from a set of size C<N>
that the C<t + 1>st element is choosen with probability
C<(n - m)/(N - t)>, when already C<m> elements have been
choosen. We can immediately see that we will never pick too
many elements (as the probability is 0 as soon as C<n == m>),
nor too few, as the probability will be 1 if we have C<k>
elements to choose from the remaining C<k> elements, for some
C<k>. For the proof that the sampling is unbiased, we refer to [3].
(Section 3.4.2, Exercise 3).
=head2 Algorithm B.
It is easy to see that the second algorithm returns the correct
number of elements. For a sample of size C<n>, the first C<n>
elements go into the reservoir, and after that, the reservoir
( run in 1.083 second using v1.01-cache-2.11-cpan-efa8479b9fe )