App-BloomUtils

 view release on metacpan or  search on metacpan

lib/App/BloomUtils.pm  view on Meta::CPAN

  in 100) or 0.1% (1 in 1,000) is a good start. If you want to make sure that
  user's chosen password is not in a known wordlist, a higher false positive
  rates will annoy your user more by rejecting her password more often, while
  lower false positive rates will require a higher memory usage.

Ref: https://corte.si/posts/code/bloom-filter-rules-of-thumb/index.html

**FAQ**

* Why does two different false positive rates (e.g. 1% and 0.1%) give the same bloom filter size?

  The parameter `m` is rounded upwards to the nearest power of 2 (e.g. 1024*8
  bits becomes 1024*8 bits but 1025*8 becomes 2048*8 bits), so sometimes two
  false positive rates with different `m` get rounded to the same value of `m`.
  Use the `bloom_filter_calculator` routine to see the `actual_m` and `actual_p`
  (actual false-positive rate).

_

$SPEC{gen_bloom_filter} = {
    v => 1.1,
    summary => 'Generate bloom filter',
    description => $desc1,
    args => {
        num_bits => {
            description => <<'_',

The default is 16384*8 bits (generates a ~16KB bloom filter). If you supply 16k
items (meaning 1 byte per 1 item) then the false positive rate will be ~2%. If
you supply fewer items the false positive rate is smaller and if you supply more
than 16k items the false positive rate will be higher.

_
            schema => 'posint*',
            #default => 8*16384,
            cmdline_aliases => {m=>{}},
        },
        num_hashes => {
            schema => 'posint*',
            cmdline_aliases => {k=>{}},
            #default => 6,
        },
        num_items => {
            schema => 'posint*',
            cmdline_aliases => {n=>{}},
        },
        false_positive_rate => {
            schema => ['float*', max=>0.5],
            cmdline_aliases => {
                fp_rate => {},
                p => {},
            },
        },
    },
    'cmdline.skip_format' => 1,
    args_rels => {
    },
    examples => [
        {
            summary => 'Create a bloom filter for 100k items and 0.1% maximum false-positive rate '.
                '(actual bloom size and false-positive rate will be shown on stderr)',
            argv => [qw/--num-items 100000 --fp-rate 0.1%/],
            'x.doc.show_result' => 0,
            test => 0,
        },
    ],
    links => [
        {url=>'prog:bloom-filter-calculator'},
    ],
};
sub gen_bloom_filter {
    require Algorithm::BloomFilter;

    my %args = @_;

    my $res;
    if (defined $args{num_items}) {
        $res = bloom_filter_calculator(
            num_items => $args{num_items},
            num_bits => $args{num_bits},
            num_hashes => $args{num_hashes},
            false_positive_rate => $args{false_positive_rate},
            num_hashes_to_bits_per_item_ratio => 0.7,
        );
    } else {
        $res = bloom_filter_calculator(
            num_bits => $args{num_bits} // 16384*8,
            num_hashes => $args{num_hashes} // 6,

            num_items => int($args{num_bits} / 8),
        );
    }
    return $res unless $res->[0] == 200;
    my $m = $args{num_bits} // $res->[2]{actual_m};
    my $k = $args{num_hashes} // $res->[2]{actual_k};
    log_info "Will be creating bloom filter with num_bits (m)=%d (actual %d), num_hashes (k)=%d, actual false-positive rate=%.5f%% (when num_items=%d), actual bloom filter size=%d bytes",
        $m, $res->[2]{actual_m}, $k, $res->[2]{actual_p}*100, $res->[2]{n}, $res->[2]{actual_bloom_size};

    my $bf = Algorithm::BloomFilter->new($m, $k);
    my $i = 0;
    while (defined(my $line = <STDIN>)) {
        chomp $line;
        $bf->add($line);
        $i++;
        if (defined $args{num_items} && $i == $args{num_items}+1) {
            log_warn "You created bloom filter for num_items=%d, but now have added more than that", $args{num_items};
        }
    }

    print $bf->serialize;

    [200];
}

$SPEC{check_with_bloom_filter} = {
    v => 1.1,
    summary => 'Check with bloom filter',
    description => <<'_',

You supply the bloom filter in STDIN, items to check as arguments, and this
utility will print lines containing 0 or 1 depending on whether items in the

lib/App/BloomUtils.pm  view on Meta::CPAN

(msg) is a string containing error message, or 'OK' if status is
200. Third element (payload) is optional, the actual result. Fourth
element (meta) is called result metadata and is optional, a hash
that contains extra information.

Return value:  (any)



=head2 check_with_bloom_filter

Usage:

 check_with_bloom_filter(%args) -> [status, msg, payload, meta]

Check with bloom filter.

You supply the bloom filter in STDIN, items to check as arguments, and this
utility will print lines containing 0 or 1 depending on whether items in the
arguments are tested to be, respectively, not in the set (0) or probably in the
set (1).

This function is not exported.

Arguments ('*' denotes required arguments):

=over 4

=item * B<items>* => I<array[str]>

Items to check.


=back

Returns an enveloped result (an array).

First element (status) is an integer containing HTTP status code
(200 means OK, 4xx caller error, 5xx function error). Second element
(msg) is a string containing error message, or 'OK' if status is
200. Third element (payload) is optional, the actual result. Fourth
element (meta) is called result metadata and is optional, a hash
that contains extra information.

Return value:  (any)



=head2 gen_bloom_filter

Usage:

 gen_bloom_filter(%args) -> [status, msg, payload, meta]

Generate bloom filter.

Examples:

=over

=item * Create a bloom filter for 100k items and 0.1% maximum false-positive rate (actual bloom size and false-positive rate will be shown on stderr):

 gen_bloom_filter( false_positive_rate => "0.1%", num_items => 100000);

=back

You supply lines of text from STDIN and it will output the bloom filter bits on
STDOUT. You can also customize C<num_bits> (C<m>) and C<num_hashes> (C<k>), or, more
easily, C<num_items> and C<fp_rate>. Some rules of thumb to remember:

=over

=item * One byte per item in the input set gives about a 2% false positive rate. So if
you expect two have 1024 elements, create a 1KB bloom filter with about 2%
false positive rate. For other false positive rates:

10%    -  4.8 bits per item
 1%    -  9.6 bits per item
 0.1%  - 14.4 bits per item
 0.01% - 19.2 bits per item

=item * Optimal number of hash functions is 0.7 times number of bits per item. Note
that the number of hashes dominate performance. If you want higher
performance, pick a smaller number of hashes. But for most cases, use the the
optimal number of hash functions.

=item * What is an acceptable false positive rate? This depends on your needs. 1% (1
in 100) or 0.1% (1 in 1,000) is a good start. If you want to make sure that
user's chosen password is not in a known wordlist, a higher false positive
rates will annoy your user more by rejecting her password more often, while
lower false positive rates will require a higher memory usage.

=back

Ref: https://corte.si/posts/code/bloom-filter-rules-of-thumb/index.html

B<FAQ>

=over

=item * Why does two different false positive rates (e.g. 1% and 0.1%) give the same bloom filter size?

The parameter C<m> is rounded upwards to the nearest power of 2 (e.g. 1024*8
bits becomes 1024*8 bits but 1025*8 becomes 2048*8 bits), so sometimes two
false positive rates with different C<m> get rounded to the same value of C<m>.
Use the C<bloom_filter_calculator> routine to see the C<actual_m> and C<actual_p>
(actual false-positive rate).

=back

This function is not exported.

Arguments ('*' denotes required arguments):

=over 4

=item * B<false_positive_rate> => I<float>

=item * B<num_bits> => I<posint>

The default is 16384*8 bits (generates a ~16KB bloom filter). If you supply 16k



( run in 2.871 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )