Algorithm-AhoCorasick-XS
view release on metacpan or search on metacpan
# DESCRIPTION
Implements Aho-Corasick, which given an input string and a set of substrings, will
tell you which of those substrings are present in the input, and where.
Aho-Corasick matches all substrings at once, so no matter how many you have, it
runs in roughly linear time (proportional to the size of the input string + the sum
of sizes of all substrings + the number of matches).
# MOTIVATION
The excellent [Algorithm::AhoCorasick](https://metacpan.org/pod/Algorithm::AhoCorasick) is pure Perl, and roughly 100 times slower.
Other modules suffer from bugs (false negatives) when given overlapping substrings,
segfault, or won't tell you precisely which substrings matched.
# FUNCTIONS
- new ( ARRAYREF\_OF\_SUBSTRINGS )
Constructs a matcher object given an arrayref of substrings. Builds the internal
automaton.
- matches ( INPUT )
Given a string, returns a list of the substrings which are present in the input.
There may be duplicates if a substring occurs more than once.
- unique\_matches ( INPUT )
As above but runs `uniq` on the list for you.
- first\_match ( INPUT )
Returns the first match only (or undef if none). This is efficient - the matcher
will stop once it encounters the first match, and the rest of the string will be
ignored.
- match\_details ( INPUT )
Returns a list of hashrefs, containing the keys `word`, `start` and `end`.
These correspond to an occurence of a substring - the word, start and end offset
within the string.
# ENCODING SUPPORT
The matcher runs at the byte level, so you can use any encoding you like. If you
want to match strings regardless of encoding, I recommend that you encode everything
into UTF-8 and apply NFC normalization (or perhaps NFD).
## Passing Unicode strings
If you pass Unicode strings to the matcher, they will be interpreted as a sequence
of UTF-8 bytes. This means the output of `matches`, `match_details` etc. will also
be in terms of bytes.
You can simply call ` decode('UTF-8', ...) ` on the substrings to get their
Unicode versions. The offsets will be in bytes though; converting them to character
offsets in the Unicode string is a little more tricky:
use Encode qw(decode);
my $unicode_start = length(decode('UTF-8', bytes::substr($string, 0, $start)));
my $unicode_end = $start + length(decode('UTF-8', $word)) - 1;
This will be handled for you in a future version.
# CAVEATS
This is an early release and has not been tested thoroughly, use at your own risk.
The API is subject to change until version 1.0.
If your keyword list contains duplicates, you will get duplicate matches.
# COPYRIGHT & LICENSE
Copyright 2017 Richard Harris. This library is free software; you may redistribute
it and/or modify it under the same terms as Perl itself.
# AUTHOR
Richard Harris <richardjharris@gmail.com>
# SEE ALSO
[Algorithm::AhoCorasick](https://metacpan.org/pod/Algorithm::AhoCorasick)
( run in 1.200 second using v1.01-cache-2.11-cpan-39bf76dae61 )