DBIx-FullTextSearch
view release on metacpan or search on metacpan
lib/DBIx/FullTextSearch.pm view on Meta::CPAN
=head2 Frontends
From the user, application point of view, the DBIx::FullTextSearch index stores
documents that are named in a certain way, allows adding new documents,
and provides methods to ask: "give me list of names of documents that
contain this list of words". The DBIx::FullTextSearch index doesn't store the
documents itself. Instead, it stores information about words in the
documents in such a structured way that it makes easy and fast to look
up what documents contain certain words and return names of the
documents.
DBIx::FullTextSearch provides a couple of predefined frontend classes that specify
various types of documents (and the way they relate to their names).
=over 4
=item default
By default, user specifies the integer number of the document and the
content (body) of the document. The code would for example read
$fts->index_document(53, 'zastavujeme vyplaty vkladu');
and DBIx::FullTextSearch will remember that the document 53 contains three words.
When looking for all documents containing word (string) vklad, a call
my @docs = $fts->contains('vklad*');
would return numbers of all documents containing words starting with
'vklad', 53 among them.
So here it's user's responsibility to maintain a relation between the
document numbers and their content, to know that a document 53 is about
vklady. Perhaps the documents are already stored somewhere and have
unique numeric id.
Note that the numeric id must be no larger than 2^C<doc_id_bits>.
=item string
Frontend B<string> allows the user to specify the names of the documents as
strings, instead of numbers. Still the user has to specify both the
name of the document and the content:
$fts->index_document('foobar',
'the quick brown fox jumped over lazy dog!');
After that,
$fts->contains('dog')
will return 'foobar' as one of the names of documents with word
'dog' in it.
=item file
To index files, use the frontend B<file>. Here the content of the document
is clearly the content of the file specified by the filename, so in
a call to index_document, only the name is needed -- the content of the
file is read by the DBIx::FullTextSearch transparently:
$fts->index_document('/usr/doc/FAQ/Linux-FAQ');
my @files = $fts->contains('penguin');
=item url
Web document can be indexed by the frontend B<url>. DBIx::FullTextSearch uses L<LWP> to
get the document and then parses it normally:
$fts->index_document('http://www.perl.com/');
Note that the HTML tags themselves are indexed along with the text.
=item table
You can have a DBIx::FullTextSearch index that indexes char or blob fields in MySQL
table. Since MySQL doesn't support triggers, you have to call the
C<index_document> method of DBIx::FullTextSearch any time something changes
in the table. So the sequence probably will be
$dbh->do('insert into the_table (id, data, other_fields)
values (?, ?, ?)', {}, $name, $data, $date_or_something);
$fts->index_document($name);
When calling C<contains>, the id (name) of the record will be returned. If
the id in the_table is numeric, it's directly used as the internal
numeric id, otherwise a string's way of converting the id to numeric
form is used.
When creating this index, you'll have to pass it three additionial options,
C<table_name>, C<column_name>, and C<column_id_name>. You may use the optional
column_process option to pre-process data in the specified columns.
=back
The structure of DBIx::FullTextSearch is very flexible and adding new frontend
(what will be indexed) is very easy.
=head2 Backends
While frontend specifies what is indexed and how the user sees the
collection of documents, backend is about low level database way of
actually storing the information in the tables. Three types are
available:
=over 4
=item blob
For each word, a blob holding list of all documents containing that word
is stored in the table, with the count (number of occurencies)
associated with each document number. That makes it for very compact
storage. Since the document names (for example URL) are internally
converted to numbers, storing and fetching the data is fast. However,
updating the information is very slow, since information concerning one
document is spread across all table, without any direct database access.
Updating a document (or merely reindexing it) requires update of all
blobs, which is slow.
The list of documents is stored sorted by document name so that
( run in 2.180 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )