App-BloomUtils
view release on metacpan or search on metacpan
lib/App/BloomUtils.pm view on Meta::CPAN
package App::BloomUtils;
our $AUTHORITY = 'cpan:PERLANCAR'; # AUTHORITY
our $DATE = '2020-05-24'; # DATE
our $DIST = 'App-BloomUtils'; # DIST
our $VERSION = '0.007'; # VERSION
use 5.010001;
use strict;
use warnings;
use Log::ger;
use POSIX qw(ceil);
our %SPEC;
my $desc1 = <<'_';
You supply lines of text from STDIN and it will output the bloom filter bits on
STDOUT. You can also customize `num_bits` (`m`) and `num_hashes` (`k`), or, more
easily, `num_items` and `fp_rate`. Some rules of thumb to remember:
* One byte per item in the input set gives about a 2% false positive rate. So if
you expect two have 1024 elements, create a 1KB bloom filter with about 2%
false positive rate. For other false positive rates:
10% - 4.8 bits per item
1% - 9.6 bits per item
0.1% - 14.4 bits per item
0.01% - 19.2 bits per item
* Optimal number of hash functions is 0.7 times number of bits per item. Note
that the number of hashes dominate performance. If you want higher
performance, pick a smaller number of hashes. But for most cases, use the the
optimal number of hash functions.
* What is an acceptable false positive rate? This depends on your needs. 1% (1
in 100) or 0.1% (1 in 1,000) is a good start. If you want to make sure that
user's chosen password is not in a known wordlist, a higher false positive
rates will annoy your user more by rejecting her password more often, while
lower false positive rates will require a higher memory usage.
Ref: https://corte.si/posts/code/bloom-filter-rules-of-thumb/index.html
**FAQ**
* Why does two different false positive rates (e.g. 1% and 0.1%) give the same bloom filter size?
The parameter `m` is rounded upwards to the nearest power of 2 (e.g. 1024*8
bits becomes 1024*8 bits but 1025*8 becomes 2048*8 bits), so sometimes two
false positive rates with different `m` get rounded to the same value of `m`.
Use the `bloom_filter_calculator` routine to see the `actual_m` and `actual_p`
(actual false-positive rate).
_
$SPEC{gen_bloom_filter} = {
v => 1.1,
summary => 'Generate bloom filter',
description => $desc1,
args => {
num_bits => {
description => <<'_',
The default is 16384*8 bits (generates a ~16KB bloom filter). If you supply 16k
items (meaning 1 byte per 1 item) then the false positive rate will be ~2%. If
you supply fewer items the false positive rate is smaller and if you supply more
than 16k items the false positive rate will be higher.
_
schema => 'posint*',
#default => 8*16384,
cmdline_aliases => {m=>{}},
},
num_hashes => {
schema => 'posint*',
cmdline_aliases => {k=>{}},
#default => 6,
},
num_items => {
schema => 'posint*',
cmdline_aliases => {n=>{}},
},
false_positive_rate => {
schema => ['float*', max=>0.5],
cmdline_aliases => {
fp_rate => {},
p => {},
},
},
},
'cmdline.skip_format' => 1,
args_rels => {
},
examples => [
{
summary => 'Create a bloom filter for 100k items and 0.1% maximum false-positive rate '.
'(actual bloom size and false-positive rate will be shown on stderr)',
argv => [qw/--num-items 100000 --fp-rate 0.1%/],
'x.doc.show_result' => 0,
lib/App/BloomUtils.pm view on Meta::CPAN
App::BloomUtils - Utilities related to bloom filters
=head1 VERSION
This document describes version 0.007 of App::BloomUtils (from Perl distribution App-BloomUtils), released on 2020-05-24.
=head1 DESCRIPTION
This distributions provides the following command-line utilities:
=over
=item * L<bloom-filter-calculator>
=item * L<bloomcalc>
=item * L<bloomchk>
=item * L<bloomgen>
=item * L<check-with-bloom-filter>
=item * L<gen-bloom-filter>
=back
=head1 FUNCTIONS
=head2 bloom_filter_calculator
Usage:
bloom_filter_calculator(%args) -> [status, msg, payload, meta]
Help calculate num_bits (m) and num_hashes (k).
You supply lines of text from STDIN and it will output the bloom filter bits on
STDOUT. You can also customize C<num_bits> (C<m>) and C<num_hashes> (C<k>), or, more
easily, C<num_items> and C<fp_rate>. Some rules of thumb to remember:
=over
=item * One byte per item in the input set gives about a 2% false positive rate. So if
you expect two have 1024 elements, create a 1KB bloom filter with about 2%
false positive rate. For other false positive rates:
10% - 4.8 bits per item
1% - 9.6 bits per item
0.1% - 14.4 bits per item
0.01% - 19.2 bits per item
=item * Optimal number of hash functions is 0.7 times number of bits per item. Note
that the number of hashes dominate performance. If you want higher
performance, pick a smaller number of hashes. But for most cases, use the the
optimal number of hash functions.
=item * What is an acceptable false positive rate? This depends on your needs. 1% (1
in 100) or 0.1% (1 in 1,000) is a good start. If you want to make sure that
user's chosen password is not in a known wordlist, a higher false positive
rates will annoy your user more by rejecting her password more often, while
lower false positive rates will require a higher memory usage.
=back
Ref: https://corte.si/posts/code/bloom-filter-rules-of-thumb/index.html
B<FAQ>
=over
=item * Why does two different false positive rates (e.g. 1% and 0.1%) give the same bloom filter size?
The parameter C<m> is rounded upwards to the nearest power of 2 (e.g. 1024*8
bits becomes 1024*8 bits but 1025*8 becomes 2048*8 bits), so sometimes two
false positive rates with different C<m> get rounded to the same value of C<m>.
Use the C<bloom_filter_calculator> routine to see the C<actual_m> and C<actual_p>
(actual false-positive rate).
=back
This function is not exported.
Arguments ('*' denotes required arguments):
=over 4
=item * B<false_positive_rate> => I<float> (default: 0.02)
=item * B<num_bits> => I<posint>
Number of bits to set for the bloom filter.
=item * B<num_hashes> => I<posint>
=item * B<num_hashes_to_bits_per_item_ratio> => I<num>
0.7 (the default) is optimal.
=item * B<num_items>* => I<posint>
Expected number of items to add to bloom filter.
=back
Returns an enveloped result (an array).
First element (status) is an integer containing HTTP status code
(200 means OK, 4xx caller error, 5xx function error). Second element
(msg) is a string containing error message, or 'OK' if status is
200. Third element (payload) is optional, the actual result. Fourth
element (meta) is called result metadata and is optional, a hash
that contains extra information.
Return value: (any)
=head2 check_with_bloom_filter
lib/App/BloomUtils.pm view on Meta::CPAN
=item * B<items>* => I<array[str]>
Items to check.
=back
Returns an enveloped result (an array).
First element (status) is an integer containing HTTP status code
(200 means OK, 4xx caller error, 5xx function error). Second element
(msg) is a string containing error message, or 'OK' if status is
200. Third element (payload) is optional, the actual result. Fourth
element (meta) is called result metadata and is optional, a hash
that contains extra information.
Return value: (any)
=head2 gen_bloom_filter
Usage:
gen_bloom_filter(%args) -> [status, msg, payload, meta]
Generate bloom filter.
Examples:
=over
=item * Create a bloom filter for 100k items and 0.1% maximum false-positive rate (actual bloom size and false-positive rate will be shown on stderr):
gen_bloom_filter( false_positive_rate => "0.1%", num_items => 100000);
=back
You supply lines of text from STDIN and it will output the bloom filter bits on
STDOUT. You can also customize C<num_bits> (C<m>) and C<num_hashes> (C<k>), or, more
easily, C<num_items> and C<fp_rate>. Some rules of thumb to remember:
=over
=item * One byte per item in the input set gives about a 2% false positive rate. So if
you expect two have 1024 elements, create a 1KB bloom filter with about 2%
false positive rate. For other false positive rates:
10% - 4.8 bits per item
1% - 9.6 bits per item
0.1% - 14.4 bits per item
0.01% - 19.2 bits per item
=item * Optimal number of hash functions is 0.7 times number of bits per item. Note
that the number of hashes dominate performance. If you want higher
performance, pick a smaller number of hashes. But for most cases, use the the
optimal number of hash functions.
=item * What is an acceptable false positive rate? This depends on your needs. 1% (1
in 100) or 0.1% (1 in 1,000) is a good start. If you want to make sure that
user's chosen password is not in a known wordlist, a higher false positive
rates will annoy your user more by rejecting her password more often, while
lower false positive rates will require a higher memory usage.
=back
Ref: https://corte.si/posts/code/bloom-filter-rules-of-thumb/index.html
B<FAQ>
=over
=item * Why does two different false positive rates (e.g. 1% and 0.1%) give the same bloom filter size?
The parameter C<m> is rounded upwards to the nearest power of 2 (e.g. 1024*8
bits becomes 1024*8 bits but 1025*8 becomes 2048*8 bits), so sometimes two
false positive rates with different C<m> get rounded to the same value of C<m>.
Use the C<bloom_filter_calculator> routine to see the C<actual_m> and C<actual_p>
(actual false-positive rate).
=back
This function is not exported.
Arguments ('*' denotes required arguments):
=over 4
=item * B<false_positive_rate> => I<float>
=item * B<num_bits> => I<posint>
The default is 16384*8 bits (generates a ~16KB bloom filter). If you supply 16k
items (meaning 1 byte per 1 item) then the false positive rate will be ~2%. If
you supply fewer items the false positive rate is smaller and if you supply more
than 16k items the false positive rate will be higher.
=item * B<num_hashes> => I<posint>
=item * B<num_items> => I<posint>
=back
Returns an enveloped result (an array).
First element (status) is an integer containing HTTP status code
(200 means OK, 4xx caller error, 5xx function error). Second element
(msg) is a string containing error message, or 'OK' if status is
200. Third element (payload) is optional, the actual result. Fourth
element (meta) is called result metadata and is optional, a hash
that contains extra information.
Return value: (any)
=head1 HOMEPAGE
Please visit the project's homepage at L<https://metacpan.org/release/App-BloomUtils>.
=head1 SOURCE
Source repository is at L<https://github.com/perlancar/perl-App-BloomUtils>.
( run in 0.480 second using v1.01-cache-2.11-cpan-39bf76dae61 )