transparent results from the CPAN

CWB-Web

package CWB::Web::Cache;
# -*-cperl-*-

##
## Short overview of module and cache architecture:
##
## The CWB::Web::Cache module uses a cache directory to keep selected named queries
## persistent between CQP sessions.  It is mainly intended for use in simple Web front-ends.
## Therefore, it is neither completely safe in heavy-duty CGI applications (where race
## conditions may occur) nor optimised for speed.  First, a CWB::Web::Cache object is created
## for an existing CQP process (using the CWB::CQP module) and must be initialised with settings
## for the cache directory path and caching strategy.  In order to make a named query result
## (of the running CQP process) persistent, it is stored in the disk cache directory, and a unique
## identifier is returned to the calling program.  This identifier can then be used to recover
## the persistent named query in a subsequent session (unless the result has already expired from
## the cache, which case must be handled by the caller).
##
## The CWB::Web::Cache module can also execute simple CQP queries and make their results
## persistent.  The query results are identified by corpus, query string, and an optional sort
## (stored as metadata) rather than a single unique identifier, and can be shared among different
## processes using the same cache directory.  When a persistent query result has expired from the
## cache, it is re-created in a way transparent to the calling program (by re-executing the query
## expression in the CQP process).
##
## The cache directory contains two subdirectories and an optional CONFIG file:
##   index/  ...  text files as 'markers' for cached queries (may contain 'metadata' about the cached query)
##   data/   ...  named query results stored in CQP's internal format
## A persistent named query is stored in a file with the name <corpus>:<query_name> in the data/
## subdirectory (e.g. DICKENS:ResultA-1121, where the numerical suffix is used to create a unique filename
## if necessary).  A text file with the same name is created in the index/ directory and may hold
## meta-information about the cached query result.  Storing a named query proceeds in the following steps:
##   1. create cache directory and subdirectories if they do not exist
##   1a recover cache settings (size, expiration time, ...) from CONFIG file unless initialised by program (OPTIONAL)
##   2. if cache directory size exceeds allowed maximum (checked with "du -k"), delete oldest query results
##      from cache until size has been reduced sufficiently (index file is always deleted first to keep other
##      processes from trying to read data while it is being erased);  if expiration time is set, all files
##      older than the specified limit are removed first
##   3. extend query name with numerical suffix to unique filename if necessary (checking files in index directory)
##   4. create empty file with unique name in index directory to "lock" data file for other processes
##   5. set CQP DataDirectory to data subdirectory, copy named query to unique name and save to disk
##   6. overwrite index file with meta-information (OPTIONAL)
## Recovering the named query result requires the following steps:
##   1. if cache directory and subdirectories do not exist, return "expired" status
##   2. check index directory for unique filename specified by caller, return "expired" if not found
##   3. touch index file so it won't be deleted by another process cleaning up the cache while we're reading it;
##      this also ensures that frequently accessed query results do not expire from the cache (or makes it
##      very unlikely, at least)
##   4. set CQP DataDirectory to data subdirectory and load cached query results from disk into CQP process
##      (force loading with "size <unique_name>;" command)
##   5. return internal name of restored query result to calling process
##
## Running a persistent query involves the following steps:
##   1. create a numeric hash key from the query expression (perhaps also the optional sort clause)
##   2. find all index files that match the given corpus and the hash key (possibly extended with numerical suffix)
##   3. check metadata stored in index files for exact query expression and sort clause
##   4. if found, recover query result with the retrieve() method (which touches the index file)
##      and return unique query name to caller (which may copy it to a simple name if desired)
##   5. otherwise, execute the specified query in the CQP process;  if an index file matching the query
##      string but not the sort clause has been found, the corresponding query result may be loaded instead,
##      which is presumably faster than re-running the query (OPTIONAL, unless sort clause included in hash key)
##   6. sort the query result according to the specified sort clause (otherwise, unsort the result)
##   7. use the store() method to create unique filename and cache query result, storing query expression
##      and sort clause as metadata in index file;  note that it is important to create an empty index file first,
##      to keep other processes (executing the same query) from loading the data file while it is being written
##   8. return name of query result to caller (need not be the unique identifier)
##
## Note that there is no support for a user-defined error handler, as all error conditions are serious
## internal faults and deserve to die (or rather croak).  A named query that has expired from the cache
## is a normal result rather than an error condition and must be expected by the caller.

use Carp;
use DirHandle;
use CWB;
use CWB::CQP;

## CWB::Web::Cache object struct:
##   'cqp'    =>  CWB::CQP object
##   'dir'    =>  cache directory
##   'index'  =>  index subdirectory
##   'data'   =>  data subdirectory
##   'size'   =>  maximum size of data directory (in MBytes, default = 5MB)
##   'expire' =>  time after which query results expire from cache (in hours, default = 24h)

lib/CWB/Web/Cache.pm view on Meta::CPAN

    $cqp->exec("$name = $unique");
  }
  else {                                        # ... or execute query in query lock mode (highly recommended for CGI scripts)
    if ($subquery) {
      $cqp->exec_query("$name-TEMP = $query");       # run primary query, assign to temporary name, then run subquery
      my ($N) = $cqp->exec("size $name-TEMP");
      if ($N > 0) {
        $cqp->exec("$name-TEMP");
        $cqp->exec_query("$name = $subquery");
        $cqp->exec("$corpus");
      }
      else {                                    # don't run subquery if primary returns no matches
        $cqp->exec("$name = $name-TEMP");
      }
      $cqp->exec("discard $name-TEMP");
    }
    else {
      $cqp->exec_query("$name = $query");            # run primary query only
    }
  }
  my ($N) = $cqp->exec("size $name");
  if ($N > 0) {
    if ($matching_lines < 3) {                  # no match or partial match without appropriate keyword
      if ($keyword eq "") {
        $cqp->exec("set $name keyword NULL")    # if $keyword == "", make sure to delete keywords that may have been loaded from partial match
        if $matching_lines >= 2;
        ## but only if the result has been loaded from cache with a non-empty set keyword clause ($matching_lines >= 2)
        ## in order to avoid overwriting keyword anchors that have been set directly in the query
      }
      else {
        $cqp->exec("set $name keyword $keyword"); # otherwise execute keyword command
      }
    }
    ## sort command cannot have matched in partial match -> execute
    $cqp->exec("sort $name $sort");           # "unsorts" (i.e. sorts in cpos order) if $sort == ""
  }

  # make query result persistent and return its unique name
  return $self->store("$corpus:$name", $query, $subquery, $keyword, $sort);
}


return 1;

__END__


=head1 NAME

CWB::Web::Cache - A simple shared cache for CQP query results

=head1 SYNOPSIS

  use CWB::CQP;
  use CWB::Web::Cache;

  $cqp = new CWB::CQP;
  $cache = new CWB::Web::Cache -cqp => $cqp, -cachedir => $dir,
    [-cachesize => $cache_size,] [-cachetime => $expiration_time];

  # transparently execute and cache simple CQP queries
  $id = $cache->query(-corpus => "DICKENS", -query => '[pos="NN"] "of" "England"');
  ($size) = $cqp->exec("size $id");

  # optional features: sort clause, set keyword, subquery, and maximal number of matches
  $id = $cache->query(
    -corpus => "DICKENS", -query => $query,
    -sort => $sort_clause,
    -keyword => $set_keyword_command,
    -subquery => $subquery,
    -cut => $max_nr_of_matches  # resonable default calculated from cache size
  );


  ## The functions below are for internal use only and subject to change in future releases!
  $id = $cache->store("DICKENS:Query1");        # activates DICKENS corpus
  $id = $cache->store("DICKENS:Query1", "Metadata line #1", ...);

  $size = $cache->retrieve($id);                # (re-)activates DICKENS corpus
  die 'Sorry, named query has expired from the cache.'
    unless defined $size;
  $cqp->exec("Query1 = $id");                   # copy query result to desired name

  $id = $cache->retrieve("DICKENS:Query", "Metadata line #1", ...);
  die 'Sorry, no named query matching your metadata found in cache.'
    unless defined $id;
  $cqp->exec("Query = $id");


=head1 DESCRIPTION

The B<CWB::Web::Cache> module provides a simple shared caching meachnism
for CQP query results, making them persistent across multiple CQP sessions.
Old data files are automatically deleted when they pass the specified I<$expiration_time>, or
to keep the cache from growing beyond the specified I<$cache_size> limit.

Note that a B<CWB::Web::Cache> handle must be created with a pre-initialised CQP backend (i.e.
a B<CWB::CQP> object), which will be used to access the cache and (re-)run a query when necessary.

Most scripts will access the cache through the B<query()> method, which executes and caches CQP queries
in a fully transparent way (with optional C<sort> clause, C<set keyword> command, subquery,
and C<cut> to limit the maximal number of matches).  After successful execution, the query result is
loaded into the CQP backend, the appropriate corpus is activated, and the I<$id> of the named query is
returned.

The C<sort> clause is executed I<after> a C<set keyword> command 
so that C<keyword> anchors can be used in sorting.

Direct access to cache entries is provided by the low-level methods B<store()> and B<retrieve()>.
Note that these are intended for internal use only and may change in future releases.


=head1 METHODS

B<TODO>


=head1 COPYRIGHT

Copyright (C) 1999-2022 Stephanie Evert [http::/purl.org/stephanie.evert]

This software is provided AS IS and the author makes no warranty as to
its use and performance. You may use the software, redistribute and
modify it under the same terms as Perl itself.

=cut

( run in 1.889 second using v1.01-cache-2.11-cpan-39bf76dae61 )