App-BloomUtils

 view release on metacpan or  search on metacpan

script/bloomgen  view on Meta::CPAN

);

$cmdline->run;

# ABSTRACT: Shorter alias for gen-bloom-filter
# PODNAME: bloomgen

__END__

=pod

=encoding UTF-8

=head1 NAME

bloomgen - Shorter alias for gen-bloom-filter

=head1 VERSION

This document describes version 0.007 of bloomgen (from Perl distribution App-BloomUtils), released on 2020-05-24.

=head1 SYNOPSIS

Usage:

 % bloomgen [--debug] [--false-positive-rate=s] [--fp-rate=s] [-k=s]
     [--log-level=level] [-m=s] [-n=s] [--num-bits=s] [--num-hashes=s]
     [--num-items=s] [-p=s] [--page-result[=program]] [--quiet] [--trace]
     [--verbose]

Examples:

Create a bloom filter for 100k items and 0.1% maximum false-positive rate (actual bloom size and false-positive rate will be shown on stderr):

 % bloomgen --num-items 100000 --fp-rate 0.1%

=head1 DESCRIPTION

You supply lines of text from STDIN and it will output the bloom filter bits on
STDOUT. You can also customize C<num_bits> (C<m>) and C<num_hashes> (C<k>), or, more
easily, C<num_items> and C<fp_rate>. Some rules of thumb to remember:

=over

=item * One byte per item in the input set gives about a 2% false positive rate. So if
you expect two have 1024 elements, create a 1KB bloom filter with about 2%
false positive rate. For other false positive rates:

10%    -  4.8 bits per item
 1%    -  9.6 bits per item
 0.1%  - 14.4 bits per item
 0.01% - 19.2 bits per item

=item * Optimal number of hash functions is 0.7 times number of bits per item. Note
that the number of hashes dominate performance. If you want higher
performance, pick a smaller number of hashes. But for most cases, use the the
optimal number of hash functions.

=item * What is an acceptable false positive rate? This depends on your needs. 1% (1
in 100) or 0.1% (1 in 1,000) is a good start. If you want to make sure that
user's chosen password is not in a known wordlist, a higher false positive
rates will annoy your user more by rejecting her password more often, while
lower false positive rates will require a higher memory usage.

=back

Ref: https://corte.si/posts/code/bloom-filter-rules-of-thumb/index.html

B<FAQ>

=over

=item * Why does two different false positive rates (e.g. 1% and 0.1%) give the same bloom filter size?

The parameter C<m> is rounded upwards to the nearest power of 2 (e.g. 1024*8
bits becomes 1024*8 bits but 1025*8 becomes 2048*8 bits), so sometimes two
false positive rates with different C<m> get rounded to the same value of C<m>.
Use the C<bloom_filter_calculator> routine to see the C<actual_m> and C<actual_p>
(actual false-positive rate).

=back

=head1 OPTIONS

C<*> marks required options.

=head2 Main options

=over

=item B<--false-positive-rate>=I<s>, B<-p>, B<--fp-rate>

=item B<--num-bits>=I<s>, B<-m>

The default is 16384*8 bits (generates a ~16KB bloom filter). If you supply 16k
items (meaning 1 byte per 1 item) then the false positive rate will be ~2%. If
you supply fewer items the false positive rate is smaller and if you supply more
than 16k items the false positive rate will be higher.


=item B<--num-hashes>=I<s>, B<-k>

=item B<--num-items>=I<s>, B<-n>

=back

=head2 Logging options

=over

=item B<--debug>

Shortcut for --log-level=debug.

=item B<--log-level>=I<s>

Set log level.

=item B<--quiet>

Shortcut for --log-level=error.



( run in 0.557 second using v1.01-cache-2.11-cpan-39bf76dae61 )