App-BloomUtils
view release on metacpan or search on metacpan
lib/App/BloomUtils.pm view on Meta::CPAN
in 100) or 0.1% (1 in 1,000) is a good start. If you want to make sure that
user's chosen password is not in a known wordlist, a higher false positive
rates will annoy your user more by rejecting her password more often, while
lower false positive rates will require a higher memory usage.
Ref: https://corte.si/posts/code/bloom-filter-rules-of-thumb/index.html
**FAQ**
* Why does two different false positive rates (e.g. 1% and 0.1%) give the same bloom filter size?
The parameter `m` is rounded upwards to the nearest power of 2 (e.g. 1024*8
bits becomes 1024*8 bits but 1025*8 becomes 2048*8 bits), so sometimes two
false positive rates with different `m` get rounded to the same value of `m`.
Use the `bloom_filter_calculator` routine to see the `actual_m` and `actual_p`
(actual false-positive rate).
_
$SPEC{gen_bloom_filter} = {
v => 1.1,
summary => 'Generate bloom filter',
description => $desc1,
args => {
num_bits => {
description => <<'_',
The default is 16384*8 bits (generates a ~16KB bloom filter). If you supply 16k
items (meaning 1 byte per 1 item) then the false positive rate will be ~2%. If
you supply fewer items the false positive rate is smaller and if you supply more
than 16k items the false positive rate will be higher.
_
schema => 'posint*',
#default => 8*16384,
cmdline_aliases => {m=>{}},
},
num_hashes => {
schema => 'posint*',
cmdline_aliases => {k=>{}},
#default => 6,
},
num_items => {
schema => 'posint*',
cmdline_aliases => {n=>{}},
},
false_positive_rate => {
schema => ['float*', max=>0.5],
cmdline_aliases => {
fp_rate => {},
p => {},
},
},
},
'cmdline.skip_format' => 1,
args_rels => {
},
examples => [
{
summary => 'Create a bloom filter for 100k items and 0.1% maximum false-positive rate '.
'(actual bloom size and false-positive rate will be shown on stderr)',
argv => [qw/--num-items 100000 --fp-rate 0.1%/],
'x.doc.show_result' => 0,
test => 0,
},
],
links => [
{url=>'prog:bloom-filter-calculator'},
],
};
sub gen_bloom_filter {
require Algorithm::BloomFilter;
my %args = @_;
my $res;
if (defined $args{num_items}) {
$res = bloom_filter_calculator(
num_items => $args{num_items},
num_bits => $args{num_bits},
num_hashes => $args{num_hashes},
false_positive_rate => $args{false_positive_rate},
num_hashes_to_bits_per_item_ratio => 0.7,
);
} else {
$res = bloom_filter_calculator(
num_bits => $args{num_bits} // 16384*8,
num_hashes => $args{num_hashes} // 6,
num_items => int($args{num_bits} / 8),
);
}
return $res unless $res->[0] == 200;
my $m = $args{num_bits} // $res->[2]{actual_m};
my $k = $args{num_hashes} // $res->[2]{actual_k};
log_info "Will be creating bloom filter with num_bits (m)=%d (actual %d), num_hashes (k)=%d, actual false-positive rate=%.5f%% (when num_items=%d), actual bloom filter size=%d bytes",
$m, $res->[2]{actual_m}, $k, $res->[2]{actual_p}*100, $res->[2]{n}, $res->[2]{actual_bloom_size};
my $bf = Algorithm::BloomFilter->new($m, $k);
my $i = 0;
while (defined(my $line = <STDIN>)) {
chomp $line;
$bf->add($line);
$i++;
if (defined $args{num_items} && $i == $args{num_items}+1) {
log_warn "You created bloom filter for num_items=%d, but now have added more than that", $args{num_items};
}
}
print $bf->serialize;
[200];
}
$SPEC{check_with_bloom_filter} = {
v => 1.1,
summary => 'Check with bloom filter',
description => <<'_',
You supply the bloom filter in STDIN, items to check as arguments, and this
utility will print lines containing 0 or 1 depending on whether items in the
lib/App/BloomUtils.pm view on Meta::CPAN
(msg) is a string containing error message, or 'OK' if status is
200. Third element (payload) is optional, the actual result. Fourth
element (meta) is called result metadata and is optional, a hash
that contains extra information.
Return value: (any)
=head2 check_with_bloom_filter
Usage:
check_with_bloom_filter(%args) -> [status, msg, payload, meta]
Check with bloom filter.
You supply the bloom filter in STDIN, items to check as arguments, and this
utility will print lines containing 0 or 1 depending on whether items in the
arguments are tested to be, respectively, not in the set (0) or probably in the
set (1).
This function is not exported.
Arguments ('*' denotes required arguments):
=over 4
=item * B<items>* => I<array[str]>
Items to check.
=back
Returns an enveloped result (an array).
First element (status) is an integer containing HTTP status code
(200 means OK, 4xx caller error, 5xx function error). Second element
(msg) is a string containing error message, or 'OK' if status is
200. Third element (payload) is optional, the actual result. Fourth
element (meta) is called result metadata and is optional, a hash
that contains extra information.
Return value: (any)
=head2 gen_bloom_filter
Usage:
gen_bloom_filter(%args) -> [status, msg, payload, meta]
Generate bloom filter.
Examples:
=over
=item * Create a bloom filter for 100k items and 0.1% maximum false-positive rate (actual bloom size and false-positive rate will be shown on stderr):
gen_bloom_filter( false_positive_rate => "0.1%", num_items => 100000);
=back
You supply lines of text from STDIN and it will output the bloom filter bits on
STDOUT. You can also customize C<num_bits> (C<m>) and C<num_hashes> (C<k>), or, more
easily, C<num_items> and C<fp_rate>. Some rules of thumb to remember:
=over
=item * One byte per item in the input set gives about a 2% false positive rate. So if
you expect two have 1024 elements, create a 1KB bloom filter with about 2%
false positive rate. For other false positive rates:
10% - 4.8 bits per item
1% - 9.6 bits per item
0.1% - 14.4 bits per item
0.01% - 19.2 bits per item
=item * Optimal number of hash functions is 0.7 times number of bits per item. Note
that the number of hashes dominate performance. If you want higher
performance, pick a smaller number of hashes. But for most cases, use the the
optimal number of hash functions.
=item * What is an acceptable false positive rate? This depends on your needs. 1% (1
in 100) or 0.1% (1 in 1,000) is a good start. If you want to make sure that
user's chosen password is not in a known wordlist, a higher false positive
rates will annoy your user more by rejecting her password more often, while
lower false positive rates will require a higher memory usage.
=back
Ref: https://corte.si/posts/code/bloom-filter-rules-of-thumb/index.html
B<FAQ>
=over
=item * Why does two different false positive rates (e.g. 1% and 0.1%) give the same bloom filter size?
The parameter C<m> is rounded upwards to the nearest power of 2 (e.g. 1024*8
bits becomes 1024*8 bits but 1025*8 becomes 2048*8 bits), so sometimes two
false positive rates with different C<m> get rounded to the same value of C<m>.
Use the C<bloom_filter_calculator> routine to see the C<actual_m> and C<actual_p>
(actual false-positive rate).
=back
This function is not exported.
Arguments ('*' denotes required arguments):
=over 4
=item * B<false_positive_rate> => I<float>
=item * B<num_bits> => I<posint>
The default is 16384*8 bits (generates a ~16KB bloom filter). If you supply 16k
( run in 2.871 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )