validate results from the CPAN

validate

Alt-CWB-ambs

view release on metacpan or search on metacpan

lib/CWB/CEQL/Parser.pm view on Meta::CPAN

called a B<D>eterministic B<P>erl B<P>arser (B<DPP>).  This parsing algorithm
was designed specifically for automatic translation of simplified, user-friendly
query and scripting languages (such as the B<C>ommon B<E>lementary B<Q>uery
B<L>anguage provided by B<CWB::CEQL>) into low-level code (e.g. B<CQP> syntax).

The DPP architecture was motivated by the observation that simplified queries
are often very similar in structure to the corresponding low-level queries,
and that many authors use cascaded regular expression substitutions to
transform one into the other.  While such cascades are very easy to write in
Perl and perform efficiently, there are two important limitations: it would
often be useful (i) to validate and transform recursive structures, and (ii)
to restrict a particular transformation to a certain scope.  Because of these
limitations, incorrect user input -- and sometimes even correct input -- leads
to malformed low-level queries.  Without an intimate knowledge of the
implementation, it is often impossible to guess the true location of the
problem from the cryptic error messages generated by the backend processor.
Moreover, simplified query languages based on regular expression substitution
typically have rather limited expressiveness and flexibility (because the
substitutions are applied unconditionally, so symbols cannot have different
meanings in different contexts).

lib/CWB/CEQL/Parser.pm view on Meta::CPAN

is either a wildcard expression or regular expression, according to the DPP
rules defined above), and returns an equivalent CQP query.

  sub wordform_sequence {
    my ($self, $input) = @_;
    my @items = split " ", $input;
    my @cqp_patterns = $self->Apply("wordform_pattern", @items);
    return "@cqp_patterns";
  }

Recall that the list returned by B<Apply> does not have to be validated: if
any error occurs, the respective subrule will B<die> and abort the complete
parse.

=head2 The shift-reduce parser for nested bracketing

The B<Apply> method is more than a convenient shorthand for parsing lists of
constituents.  Its main purpose is to parse nested bracketing structures,
which are very common in the syntax of formal languages (examples include
arithmetical formulae, regular expressions and most computer programming
languages).  When parsing the constituents of a list with nested bracketing,

lib/CWB/CEQL/Parser.pm view on Meta::CPAN


The obvious drawback of this approach is the difficulty of signaling the
precise location of a syntax error to the user (in the example grammar above,
the parser will simply print C<syntax error> if there is any problem in a
sequence of terms and operators).  By the time the error is detected, all
items in the active group have already been pre-processed and subexpressions
have been collapsed.  Printing the current list of terms and operators would
only add to the user's confusion.

In order to signal errors immediately where they occur, each item should be
validated before it is added to the result list (e.g. an operator may not be
pushed as first item on a result list), and the reduce operation (C<< Term Op
Term => Term >>) should be applied as soon as possible.  The rule
C<arithmetic_item> needs direct access to the currently active result list for
this purpose: (1) to check how many items have already been pushed when
validating a new item, and (2) to reduce a sequence C<Term Op Term> to a single
C<Term> in the result list.

A pointer to the currently active result list is obtained with the internal
B<currentGroup> method, allowing a grammar rule to manipulate the result list.
The B<proximity queries> in the B<CWB::CEQL> grammar illustrate this advanced

lib/CWB/Encoder.pm view on Meta::CPAN

  use CWB::Encoder;


  $bnc = new CWB::Indexer "BNC";
  $bnc = new CWB::Indexer "/path/to/registry:BNC";

  $bnc->group("corpora");     # optional: group and access
  $bnc->perm("640");          # permissions for newly created files

  $bnc->memory(400);          # use up to 400 MB of RAM (default: 75)
  $bnc->validate(0);          # disable validation for faster indexing
  $bnc->debug(1);             # enable debugging output

  $bnc->make("word", "pos");  # build index & compress
  $bnc->makeall;              # process all p-attributes


  $bnc = new CWB::Encoder "BNC";

  $bnc->registry("/path/to/registry");  # will try to guess otherwise
  $bnc->dir("/path/to/data/directory"); # directory for corpus data files

lib/CWB/Encoder.pm view on Meta::CPAN

  $bnc->p_attributes(qw<pos lemma>);  # may be called repeatedly
  $bnc->null_attributes("teiHeader"); # declare null atts (ignored)
  $bnc->s_attributes("s");    # s-attributes in cwb-encode syntax
  $bnc->s_attributes(qw<div0* div1*>);# * = store annotations (-V)
  $bnc->s_attributes("bncDoc:0+id");  # recursion & XML attributes

  $bnc->decode_entities(0);        # don't decode XML entities (with -x flag)
  $bnc->undef_symbol("__UNDEF__"); # mark missing values like cwb-encode

  $bnc->memory(400);          # use up to 400 MB of RAM (default: 75)
  $bnc->validate(0);          # disable validation for faster indexing
  $bnc->verbose(1);           # print some progress information
  $bnc->debug(1);             # enable debugging output

  $bnc->encode(@files);       # encoding, indexing, and compression

  $pipe = $bnc->encode_pipe;  # can also feed input text from Perl script
  while (...) {
    print $pipe "$line\n";
  }
  $bnc->close_pipe;

lib/CWB/Encoder.pm view on Meta::CPAN


use CWB;
use Carp;

# makefile-like rules for creating / updating components
#   TRIGGER .. update component when one of these comps exists & is newer
#   NEEDED  .. componentes required by command below
#   CREATES .. these files will be created by COMMAND
#   COMMAND .. shell command to create this component 
#              interpolates '#C' (corpus id), '#A' (attribute name), '#R' (registry flag), 
#                           '#M' (memory limit), '#T' (no validate), '#V' (validate)
#              (issues "can't create" error message if COMMAND starts with "ERROR")
#   DELETE  .. delete these components when target exist or has been created
our %RULES =
  (
   DIR => {
           TRIGGER => [],
           NEEDED  => [],
           CREATES => [],
           COMMAND => "ERROR: Corpus data directory must be created manually.",
           DELETE  => [],

lib/CWB/Encoder.pm view on Meta::CPAN


=cut

sub memory {
  my ($self, $mem) = @_;
  croak "CWB::Indexer:  memory limit ($mem) must be positive integer number (aborted).\n"
    unless $mem =~ /^[1-9][0-9]*$/;
  $self->{MEMORY} = $mem;
}

=item $idx->validate(0);

Turn off validation of index and compressed files, which may give 
substantial speed improvements for larger corpora.

=cut

sub validate {
  my ($self, $yesno) = @_;
  $self->{VALIDATE} = $yesno;
}

=item $idx->debug(1);

Activate debugging output (on STDERR). 

=cut

lib/CWB/Encoder.pm view on Meta::CPAN


=cut

sub memory {
  my ($self, $mem) = @_;
  croak "CWB::Indexer: memory limit ($mem) must be positive integer number (aborted).\n"
    unless $mem =~ /^[1-9][0-9]*$/;
  $self->{MEMORY} = $mem;
}

=item $enc->validate(0);

Turn off validation of index and compressed files, which may give 
substantial speed improvements for larger corpora.

=cut

sub validate {
  my ($self, $yesno) = @_;
  $self->{VALIDATE} = $yesno;
}

=item $enc->decode_entities(0);

Whether B<cwb-encode> is allowed to decode XML entities and skip XML 
comments (with the C<-x> option).  Set this option to false if you
want an HTML-compatible encoding of the CWB corpus that does not need
to be converted before display in a Web browser.

lib/CWB/Encoder.pm view on Meta::CPAN

    if $perm;
  CWB::Shell::Cmd("chgrp $group '$regfile'")
    if $group;

  my $idx = new CWB::Indexer "$reg:".(uc $name); # build indices and compress p-attributes
  $idx->group($group)
    if $group;
  $idx->perm($perm)
    if $perm;
  $idx->memory($self->{MEMORY});
  $idx->validate($self->{VALIDATE});
  $idx->debug($self->{DEBUG});
  print "Building indices and compressing p-attributes ...\n"
    if $self->{VERBOSE};
  $idx->makeall;

}

=back

=cut

script/cwb-make view on Meta::CPAN

##
$| = 1;
use strict;
use warnings;

use CWB;
use CWB::Encoder;
use Getopt::Long;

our $Debug = 0;                     # -D, --debug
our $Validate = 0;                  # -V, --validate
our $Memory = 0;                    # -M, --memory  [uses CWB::Indexer default]
our $Registry = undef;              # -r, --registry
our $Group = undef;                 # -g, --group
our $Permissions = undef;           # -p, --permissions
our $Help = 0;                      # -h, --help

my $ok = GetOptions("D|debug" => \$Debug,
                    "V|validate" => \$Validate,
                    "M|memory=i" => \$Memory,
                    "r|registry=s" => \$Registry,
                    "g|group=s" => \$Group,
                    "p|permissions=s" => \$Permissions,
                    "h|help" => \$Help,
                    );

die "\nUsage:  cwb-make [options] CORPUS [<attributes>]\n\n",
  "Options:\n",
  "  -r <dir>  use registry directory <dir> [system default]\n",
  "     --registry=<dir>\n",
  "  -M <n>    use <n> MBytes of RAM for indexing [default: 75]\n",
  "     --memory=<n>\n",
  "  -V        validate newly created files\n",
  "     --validate\n",
  "  -g <name> put newly created files into group <name>\n",
  "     --group=<name>\n",
  "  -p <nnn>  set access permissions of created files to <nnn>\n",
  "     --permissions=<nnn>\n",
  "  -D        activate debugging output\n",
  "     --debug\n",
  "  -h        show this help page\n",
  "     --help\n",
  "\n"
  if $Help or @ARGV == 0 or not $ok;

script/cwb-make view on Meta::CPAN

  $indexer = new CWB::Indexer $Corpus;
}

$indexer->group($Group)
  if defined $Group;
$indexer->perm($Permissions)
  if defined $Permissions;
$indexer->debug($Debug);
$indexer->memory($Memory)
  if $Memory > 0;
$indexer->validate($Validate);

if (@ARGV) {
  $indexer->make(@ARGV);
}
else {
  $indexer->makeall;
}

__END__

script/cwb-make view on Meta::CPAN

cwb-make - Automated indexing and compression for CWB corpora

=head1 SYNOPSIS

  cwb-make [options] CORPUS [<attributes>]

Options:

  -r <dir>   use registry directory <dir> [system default]
  -M <n>     use <n> MBytes of RAM for indexing [default: 75]
  -V         validate newly created files
  -g <name>  put newly created files into group <name>
  -p <nnn>   set access permissions of created files to <nnn>
  -D         activate debugging output
  -h         show help page

Long forms of command-line options are listed below.


=head1 DESCRIPTION

script/cwb-make view on Meta::CPAN

specified by C<CORPUS_REGISTRY> environment variable).

=item B<--memory>=I<n>, B<-M> I<n>

Use approx. I<n> megabytes (MiB) of RAM for indexing.  The default of 75 MiB
is safe even for computers with a small amount of memory or many concurrent users.
If more RAM is available, indexing can be speeded up considerably by setting 
higher memory limit.  For instance, C<-M 500> or C<-M 1000> is a good choice on
a machine with 2 GiB of RAM and a low work load.

=item B<--validate>, B<-V>

Validate newly created data files (index files and compressed corpus data).
This is normally not required, as the CWB indexing and compression algorithms
have been tested thoroughly by a large user community.

=item B<--group>=I<name>, B<-g> I<name>

=item B<--permissions>=I<ddd>, B<-p> I<ddd>

Set group membership (I<name>) and access permissions (octal code I<ddd>) of

t/20_encode_vss.t view on Meta::CPAN

$enc->charset("latin1");
$enc->language("en");

$enc->perm("640");              # set non-standard access permissions (but not group)

$enc->p_attributes(qw(word pos lemma)); # declare attributes
$enc->null_attributes("collection");
$enc->s_attributes(qw(story:0+num+title+author+year chapter:0+num p:0 s:0));

$enc->memory(100);              # corpus is very small and should use little memory
$enc->validate(1);              # validate all generated files
$enc->verbose(0);               # don't show any progress messages when running as self test
$enc->debug(0);

our $T0 = time;
eval { $enc->encode($vrt_file) };
ok(! $@, "corpus encoding and indexing"); # T2
our $elapsed = time - $T0;
diag(sprintf "VSS corpus encoded in %.1f seconds", $elapsed);

## now compare all created data files against reference corpus

( run in 0.331 second using v1.01-cache-2.11-cpan-4d50c553e7e )