unicode results from the CPAN

unicode
DBD-SQLcipher
view release on metacpan or search on metacpan
lib/DBD/SQLcipher/Fulltext_search.pod view on Meta::CPAN
A former version, called the "standard query syntax", used to
support tokens prefixed with '+' or '-' signs (for token inclusion
or exclusion); if your application needs to support this old
syntax, use  L<DBD::SQLcipher::FTS3Transitional> (published
in a separate distribution) for doing the conversion.

=head1 TOKENIZERS

=head2 Concept

The behaviour of full-text indexes strongly depends on how
documents are split into I<tokens>; therefore FTS table
declarations can explicitly specify how to perform
tokenization: 

  CREATE ... USING fts4(<columns>, tokenize=<tokenizer>)

where C<< <tokenizer> >> is a sequence of space-separated
words that triggers a specific tokenizer. Tokenizers can
be SQLcipher builtins, written in C code, or Perl tokenizers.
Both are as explained below.

=head2 SQLcipher builtin tokenizers

SQLcipher comes with some builtin tokenizers (see
L<http://www.sqlite.org/fts3.html#tokenizer>) :

=over

=item simple

Under the I<simple> tokenizer, a term is a contiguous sequence of
eligible characters, where eligible characters are all alphanumeric
characters, the "_" character, and all characters with UTF codepoints
greater than or equal to 128. All other characters are discarded when
splitting a document into terms. They serve only to separate adjacent
terms.

All uppercase characters within the ASCII range (UTF codepoints less
than 128), are transformed to their lowercase equivalents as part of
the tokenization process. Thus, full-text queries are case-insensitive
when using the simple tokenizer.

=item porter

The I<porter> tokenizer uses the same rules to separate the input
document into terms, but as well as folding all terms to lower case it
uses the Porter Stemming algorithm to reduce related English language
words to a common root.

=item icu

The I<icu> tokenizer uses the ICU library to decide how to
identify word characters in different languages; however, this
requires SQLcipher to be compiled with the C<SQLITE_ENABLE_ICU>
pre-processor symbol defined. So, to use this tokenizer, you need
edit F<Makefile.PL> to add this flag in C<@CC_DEFINE>, and then
recompile C<DBD::SQLcipher>; of course, the prerequisite is to have
an ICU library available on your system.

=item unicode61

The I<unicode61> tokenizer works very much like "simple" except that it
does full unicode case folding according to rules in Unicode Version
6.1 and it recognizes unicode space and punctuation characters and
uses those to separate tokens. By contrast, the simple tokenizer only
does case folding of ASCII characters and only recognizes ASCII space
and punctuation characters as token separators.

By default, "unicode61" also removes all diacritics from Latin script
characters. This behaviour can be overridden by adding the tokenizer
argument C<"remove_diacritics=0">. For example:

  -- Create tables that remove diacritics from Latin script characters
  -- as part of tokenization.
  CREATE VIRTUAL TABLE txt1 USING fts4(tokenize=unicode61);
  CREATE VIRTUAL TABLE txt2 USING fts4(tokenize=unicode61 "remove_diacritics=1");

  -- Create a table that does not remove diacritics from Latin script
  -- characters as part of tokenization.
  CREATE VIRTUAL TABLE txt3 USING fts4(tokenize=unicode61 "remove_diacritics=0");

Additional options can customize the set of codepoints that unicode61
treats as separator characters or as token characters -- see the
documentation in L<http://www.sqlite.org/fts3.html#unicode61>.

=back

If a more complex tokenizing algorithm is required, for example to
implement stemming, discard punctuation, or to recognize compound words,
use the perl tokenizer to implement your own logic, as explained below.

=head2 Perl tokenizers

=head3 Declaring a perl tokenizer

In addition to the builtin SQLcipher tokenizers, C<DBD::SQLcipher>
implements a I<perl> tokenizer, that can hook to any tokenizing
algorithm written in Perl. This is specified as follows :

  CREATE ... USING fts4(<columns>, tokenize=perl '<perl_function>')

where C<< <perl_function> >> is a fully qualified Perl function name
(i.e. prefixed by the name of the package in which that function is
declared). So for example if the function is C<my_func> in the main 
program, write

  CREATE ... USING fts4(<columns>, tokenize=perl 'main::my_func')

=head3 Writing a perl tokenizer by hand

That function should return a code reference that takes a string as
single argument, and returns an iterator (another function), which
returns a tuple C<< ($term, $len, $start, $end, $index) >> for each
term. Here is a simple example that tokenizes on words according to
the current perl locale

  sub locale_tokenizer {
    return sub {
      my $string = shift;

      use locale;
      my $regex      = qr/\w+/;
      my $term_index = 0;

      return sub { # closure
        $string =~ /$regex/g or return; # either match, or no more token
        my ($start, $end) = ($-[0], $+[0]);
        my $len           = $end-$start;
        my $term          = substr($string, $start, $len);
        return ($term, $len, $start, $end, $term_index++);
      }
    };
  }

There must be three levels of subs, in a kind of "Russian dolls" structure,
because :

=over

=item *

the external, named sub is called whenever accessing a FTS table
with that tokenizer
( run in 0.448 second using v1.01-cache-2.11-cpan-bbcb1afb8fc )