Algorithm-AhoCorasick-XS
view release on metacpan or search on metacpan
If you pass Unicode strings to the matcher, they will be interpreted as a sequence
of UTF-8 bytes. This means the output of `matches`, `match_details` etc. will also
be in terms of bytes.
You can simply call ` decode('UTF-8', ...) ` on the substrings to get their
Unicode versions. The offsets will be in bytes though; converting them to character
offsets in the Unicode string is a little more tricky:
use Encode qw(decode);
my $unicode_start = length(decode('UTF-8', bytes::substr($string, 0, $start)));
my $unicode_end = $start + length(decode('UTF-8', $word)) - 1;
This will be handled for you in a future version.
# CAVEATS
This is an early release and has not been tested thoroughly, use at your own risk.
The API is subject to change until version 1.0.
If your keyword list contains duplicates, you will get duplicate matches.
lib/Algorithm/AhoCorasick/XS.pm view on Meta::CPAN
If you pass Unicode strings to the matcher, they will be interpreted as a sequence
of UTF-8 bytes. This means the output of C<matches>, C<match_details> etc. will also
be in terms of bytes.
You can simply call C< decode('UTF-8', ...) > on the substrings to get their
Unicode versions. The offsets will be in bytes though; converting them to character
offsets in the Unicode string is a little more tricky:
use Encode qw(decode);
my $unicode_start = length(decode('UTF-8', bytes::substr($string, 0, $start)));
my $unicode_end = $start + length(decode('UTF-8', $word)) - 1;
This will be handled for you in a future version.
=head1 CAVEATS
This is an early release and has not been tested thoroughly, use at your own risk.
The API is subject to change until version 1.0.
If your keyword list contains duplicates, you will get duplicate matches.
parse_fullexpr||5.013008|
parse_fullstmt||5.013005|
parse_gv_stash_name|||
parse_ident|||
parse_label||5.013007|
parse_listexpr||5.013008|
parse_lparen_question_flags|||
parse_stmtseq||5.013006|
parse_subsignature|||
parse_termexpr||5.013008|
parse_unicode_opts|||
parser_dup|||
parser_free_nexttoke_ops|||
parser_free|||
path_is_searchable|||n
peep|||
pending_ident|||
perl_alloc_using|||n
perl_alloc|||n
perl_clone_using|||n
perl_clone|||n
#endif
#ifndef PERL_PV_PRETTY_DUMP
# define PERL_PV_PRETTY_DUMP PERL_PV_PRETTY_ELLIPSES|PERL_PV_PRETTY_QUOTE
#endif
#ifndef PERL_PV_PRETTY_REGPROP
# define PERL_PV_PRETTY_REGPROP PERL_PV_PRETTY_ELLIPSES|PERL_PV_PRETTY_LTGT|PERL_PV_ESCAPE_RE
#endif
/* Hint: pv_escape
* Note that unicode functionality is only backported to
* those perl versions that support it. For older perl
* versions, the implementation will fall back to bytes.
*/
#ifndef pv_escape
#if defined(NEED_pv_escape)
static char * DPPP_(my_pv_escape)(pTHX_ SV * dsv, char const * const str, const STRLEN count, const STRLEN max, STRLEN * const escaped, const U32 flags);
static
#else
extern char * DPPP_(my_pv_escape)(pTHX_ SV * dsv, char const * const str, const STRLEN count, const STRLEN max, STRLEN * const escaped, const U32 flags);
( run in 0.274 second using v1.01-cache-2.11-cpan-88abd93f124 )