Algorithm-MinPerfHashTwoLevel
view release on metacpan or search on metacpan
MinPerfHashTwoLevel.xs view on Meta::CPAN
/* this is used to quickly tell if we have used a particular bucket yet */
Newxz(is_used,bucket_count,char);
SAVEFREEPV(is_used);
/* used to keep track the indexes that a set of keys map into
* stored in an SV just because - we actually treat it as an array of U32 */
Newxz(idx2_start, av_top_index(by_length_av)+1, U32);
SAVEFREEPV(idx2_start);
/* now loop through and process the keysets from most collisions to least */
for (len_idx= av_top_index(by_length_av); len_idx > 0 && !bad_idx; len_idx--) {
SV **idx1_packed_sv= av_fetch(by_length_av, len_idx, 0);
/* deal with the possibility that there are gaps in the length grouping,
* for instance we might have some 13 way collisions and some 11 way collisions
* without any 12-way collisions. (this should be rare - but is possible) */
if (!idx1_packed_sv || !SvPOK(*idx1_packed_sv))
continue;
if (len_idx == 1) {
bad_idx= place_singletons(aTHX_ bucket_count, *idx1_packed_sv, keybuckets_av,
MinPerfHashTwoLevel.xs view on Meta::CPAN
} else {
croak("panic: out of memory in lvalue fetch for 'buckets' in self");
}
/**** build an array of hashes in keys_av based on the normalized contents of source_hv */
keys_av= (AV *)sv_2mortal((SV*)newAV());
bucket_count= normalize_source_hash(aTHX_ source_hv, keys_av, compute_flags, buf_length_sv, state_pv);
max_xor_val= INT32_MAX;
/* if the caller wants deterministic results we sort the keys_av
* before we compute collisions - depending on the order we process
* the keys we might resolve the collisions differently */
if (compute_flags & MPH_F_DETERMINISTIC)
sortsv(AvARRAY(keys_av),bucket_count,_compare);
/**** find the collisions from the data we just computed, build an AoAoH and AoS of the
**** collision data */
keybuckets_av= (AV*)sv_2mortal((SV*)newAV()); /* AoAoH - hashes from keys_av */
h2_packed_av= (AV*)sv_2mortal((SV*)newAV()); /* AoS - packed h1 */
find_first_level_collisions(aTHX_ bucket_count, keys_av, keybuckets_av, h2_packed_av);
/* Sort the buckets by size by constructing an AoS, with the outer array indexed by length,
* and the inner string being the list of items of that length. (Thus the contents of index
* 0 is empty/undef).
* The end result is we can process the collisions from the most keys to a bucket to the
* least in O(N) and not O(N log2 N).
*
* the length of the array (av_top_index+1) reflect the number of items in the bucket
* with the most collisions - we use this later to size some of our data structures.
*/
by_length_av= idx_by_length(aTHX_ keybuckets_av);
RETVAL= solve_collisions_by_length(aTHX_ bucket_count, max_xor_val, by_length_av, h2_packed_av, keybuckets_av,
variant, buckets_av);
}
Algorithm-MinPerfHashTwoLevel version 0.01
==========================================
This distribution contains two modules, Algorith::MinPerfHashTwoLevel
which can construct a "two level" minimal perfect hash in an abstract
form, and Tie::Hash::MinPerfHashTwoLevel::OnDisk which uses the
abstract representation to create an ondisk image of the hash which
can be efficiently mounted via mmap, allowing memory efficient access
to the hash by multiple processes at the same time.
Requires a 64 bit operating environment, and a little endian CPU.
INSTALLATION
To install these modules type the following:
perl Makefile.PL
make
make test
lib/Algorithm/MinPerfHashTwoLevel.pm view on Meta::CPAN
hash value to determine the actual bucket to read from. This module computes
the appropriate mask values.
In this implementation only one 64 bit hash value is computed, but the high
and low 32 bits are used as if there were two hash values. The hash function
used is SipHash 1-3. The full 64 bit hash is called h0, the high 32 bits are
called h1, and the low 32 bits are called h2.)
Computing the hash and mask is done in C (via XS).
The process for looking up a value in a two level hash with n buckets is
as follows:
0. compute the h0 for the key. (giving: h1 = h0 >> 32; h2 = h0 & 0xFFFFFFFF;)
1. compute idx1 = h1 % n;
2. find the xor_val for bucket[idx1]
3. if the xor_val is zero we are done, the key is not in the hash
4. compute idx2:
if int(xor_val) < 0
idx2 = -xor_val-1
else
lib/Tie/Hash/MinPerfHashTwoLevel/OnDisk.pm view on Meta::CPAN
debug => 0,
);
my %hash;
tie %hash, "Tie::Hash::MinPerfHashTwoLevel::OnDisk", $some_file;
=head1 DESCRIPTION
This module allows one to either construct, or use a precomputed minimal
perfect hash on disk via tied interface. The disk image of the hash is
loaded by using mmap, which means that multiple processes may use the
same disk image at the same time without memory duplication. The hash
is readonly, and may only contain string values.
=head2 METHODS
=over 4
=item make_file
Construct a new file from a given 'source_hash' argument. Takes the following arguments:
lib/Tie/Hash/MinPerfHashTwoLevel/OnDisk.pm view on Meta::CPAN
=item comment
An arbitrary piece of text of your choosing. Can be extracted from
the file later if needed. Only practical restriction on the value is
that it cannot contain a null.
=item seed
A 16 byte string (or a reference to one) to use as the seed for
the hashing and generation process. If this is omitted a standard default
is chosen.
If it should prove impossible to construct a solution using the seed chosen
then a new one will be constructed deterministically from the old until a
solution is found (see L<max_tries>) (prior to version v0.10 this used rand()).
Should you wish to access the seed actually used for the final solution
then you can pass in a reference to a scalar containing your chosen seed.
The reference scalar will be updated after successful construction.
Thus both of the following are valid:
Tie::Hash::MinPerfHashTwoLevel::OnDisk->make_file(seed => "1234567812345678", ...);
Tie::Hash::MinPerfHashTwoLevel::OnDisk->make_file(seed => \my $seed= "1234567812345678", ...);
=item compute_flags
This is an integer which contains various flags which control construction.
They are as follows:
MPH_F_FILTER_UNDEF => 1 - filter keys with undef values
MPH_F_DETERMINISTIC => 2 - repeatable results (sort keys during processing)
MPH_F_NO_DEDUPE => 4 - do not dedupe strings in final buffer
These constants can be imported via the ":flags" tag, but there are also options that
have the equivalent result, see C<no_dedupe>, C<deterministic> and C<filter_undef>.
=item no_dedupe
Speed up construction at the cost of a larger string buffer by disabling
deduplication of values and keys. Same as setting the MPH_F_NO_DEDUPE bit
in compute_flags.
lib/Tie/Hash/MinPerfHashTwoLevel/OnDisk.pm view on Meta::CPAN
=back
=head2 TIED INTERFACE
my %hash;
tie %hash, "Tie::Hash::MinPerfHashTwoLevel::OnDisk", $some_file, $flags;
will setup %hash to read from the mmapped image on disk as created by make_file().
The underlying image is never altered, and copies of the keys and values are made
when necessary. The flags field is an integer which contains bit-flags to control
the reading process, currently only one flag is supported MPH_F_VALIDATE which enables
a full file checksum before returning (forcing the data to be loaded and then read).
By default this validation is disabled, however basic checks of that the header is
sane will be performed on loading (or "mounting") the image. The tie operation may
die if the file is not found or any of these checks fail.
As this is somewhat cumbersome to type you may want to look at the mph2l_tied_hashref()
function which is wrapper around this function.
=head2 FILE FORMAT
populate_isa|||v
pregcomp||5.009005|
pregexec|||
pregfree2||5.011000|
pregfree|||
prescan_version||5.011004|
print_bytes_for_locale|||
print_collxfrm_input_and_return|||
printbuf|||
printf_nocontext|||vn
process_special_blocks|||
ptr_hash|||n
ptr_table_fetch||5.009005|
ptr_table_find|||n
ptr_table_free||5.009005|
ptr_table_new||5.009005|
ptr_table_split||5.009005|
ptr_table_store||5.009005|
push_scope|||
put_charclass_bitmap_innards_common|||
put_charclass_bitmap_innards_invlist|||
( run in 0.573 second using v1.01-cache-2.11-cpan-8d75d55dd25 )