Alt-CWB-ambs

 view release on metacpan or  search on metacpan

lib/CWB/CQP.pm  view on Meta::CPAN

  ## error handling (messages on stderr)
  $self->{'error_handler'} = undef; # set to subref for user-defined error handler
  $self->{'status'} = 'ok';         # status of last executed command ('ok' or 'error')
  $self->{'error_message'} = [];    # arrayref to array containing message produced by last command (if any)

  ## handling of CQP progress messages
  $self->{'progress'} = 0;             # whether progress messages are activated
  $self->{'progress_handler'} = undef; # optional callback for progress messages
  $self->{'progress_info'} = [];       # contains last available progress information: [$total_percent, $pass, $n_passes, $message, $percent]

  ## debugging (prints more or less everything on stdout)
  $self->{'debug'} = 0;

  ## select vectors for CQP output (stdout, stderr, stdout|stderr)
  $self->{'select_err'} = new IO::Select($err);
  $self->{'select_out'} = new IO::Select($out);
  $self->{'select_any'} = new IO::Select($err, $out);

  ## CQP object setup complete
  bless($self, $class);

  ## the following command will collect and ignore any output which may have been produced during startup

lib/CWB/CQP.pm  view on Meta::CPAN

It is an error to B<run> a new command before the output of the previous command has completely
been processed.

=cut

sub run {
  croak 'USAGE:  $cqp->run($cmd]);'
    unless @_ == 2;
  my $self = shift;
  my $cmd = shift;
  my $debug = $self->{'debug'};

  $cmd =~ s/\n+/ /g;            # make sure there are no newline characters (to be on the safe side)
  $cmd =~ s/(;\s*)+$//;         # ";" will be added when $cmd is sent to CQP

  my $active_cmd = $self->{'command'};
  croak "Error: new CQP command issued while '$active_cmd' is still being processed"
    if $active_cmd;

  $self->{'command'} = "$cmd;";
  $self->{'status'} = 'ok';
  $self->{'buffer'} = "";
  $self->{'lines'} = [];
  $self->{'error_message'} = [];

  print "CQP << $cmd;\n"
    if $debug;
  $self->{'in'}->print("$cmd;\n .EOL.;\n"); # append .EOL. command to mark end of CQP output
}

=item I<$num_of_lines> = I<$cqp>->B<ready>;

=item I<$num_of_lines> = I<$cqp>->B<ready>(I<$timeout>);

Check if output from current CQP command is available for reading with B<getline> etc.,
returning the number of lines currently held in the input buffer (possibly including an
end-of-output marker line that will not be returned by B<getline> etc.).  If there is no

lib/CWB/CQP.pm  view on Meta::CPAN

Returns B<undef> when all output from the current CQP command has been read.

=cut

sub getline {
  croak 'USAGE:  $line = $cqp->getline;'
    unless @_ == 1;
  my $self = shift;
  croak 'CWB::CQP:  $cqp->getline called without active CQP command'
    unless $self->{'command'};
  my $debug = $self->{'debug'};

  $self->_update(-1)            # fill line buffer if necessary (blocking mode)
    unless @{$self->{'lines'}} > 0;

  my $line = shift @{$self->{'lines'}};
  if ($line eq '-::-EOL-::-') { 
    ## special line printed by ".EOL.;" marks end of CQP output
    print "CQP ", "-" x 60, "\n"
      if $debug;
    $self->_eol;
    return undef;               # undef return value marks end of output
  }
  else {
    print "CQP >> $line\n"
      if $debug;
    return $line;
  }
}

=item I<@lines> = I<$cqp>->B<getlines>(I<$n>);

Read I<$n> lines of output from the CQP process, blocking as long as necessary.  An explicit B<undef> element is included at the end of the output of a CQP command.  Note that B<getlines> may return fewer than I<$n> lines if the end of output is reac...

Set C<I<$n> = 0> to read all complete lines currently held in the input buffer (as indicated by the B<ready> method), or specify a negative value to read the complete output of the active CQP command.

lib/CWB/CQP.pm  view on Meta::CPAN

  $cqp->run($cmd);
  @lines = $cqp->getlines(-1);

=cut

sub exec {
  croak 'USAGE:  $cqp->exec($cmd);'
    unless @_ == 2;
  my $self = shift;
  my $cmd = shift;
  my $debug = $self->{'debug'};

  $self->run($cmd);
  my @result = $self->getlines(-1);
  my $eol = pop @result;
  if (defined $eol) {
    die "CWB::CQP:  INTERNAL ERROR in _exec() -- missing 'undef' at end of command output (ignored)";
    push @result, $eol; # seems to be regular line, so push it back onto result list
  }
  return @result;
}

lib/CWB/CQP.pm  view on Meta::CPAN


## INTERNAL: built in error handlers for 'die' and 'ignore' modes
sub _error_handler_die {
  croak "\n", "=+===CWB::CQP ERROR=====\n", (map {" | $_\n"} @_), "=+== occurred";
}

sub _error_handler_ignore {
  # do nothing
}

=item I<$cqp>->B<debug>(1);

=item I<$cqp>->B<debug>(0);

Activate/deactivate debugging mode, which logs all executed commands and their complete output on STDOUT.  The B<debug> method returns the previous status for convenience.

=cut

sub debug {
  croak 'USAGE:  $prev_status = $cqp->debug( 1 | 0 ) ;'
    unless @_ == 2;
  my $self = shift;
  my $on = shift;
  my $prev = $self->{'debug'};
  $self->{'debug'} = $on;
  return $prev;
}

=item I<$cqp>->B<progress_on>;

=item I<$cqp>->B<progress_off>;

=item I<$message> = I<$cqp>->B<progress>;

=item (I<$total>, I<$pass>, I<$n_passes>, I<$msg>, I<$percent>) = I<$cqp>->B<progress_info>;

lib/CWB/Encoder.pm  view on Meta::CPAN



  $bnc = new CWB::Indexer "BNC";
  $bnc = new CWB::Indexer "/path/to/registry:BNC";

  $bnc->group("corpora");     # optional: group and access
  $bnc->perm("640");          # permissions for newly created files

  $bnc->memory(400);          # use up to 400 MB of RAM (default: 75)
  $bnc->validate(0);          # disable validation for faster indexing
  $bnc->debug(1);             # enable debugging output

  $bnc->make("word", "pos");  # build index & compress
  $bnc->makeall;              # process all p-attributes


  $bnc = new CWB::Encoder "BNC";

  $bnc->registry("/path/to/registry");  # will try to guess otherwise
  $bnc->dir("/path/to/data/directory"); # directory for corpus data files
  $bnc->overwrite(1);         # may overwrite existing files / directories

lib/CWB/Encoder.pm  view on Meta::CPAN

  $bnc->s_attributes("s");    # s-attributes in cwb-encode syntax
  $bnc->s_attributes(qw<div0* div1*>);# * = store annotations (-V)
  $bnc->s_attributes("bncDoc:0+id");  # recursion & XML attributes

  $bnc->decode_entities(0);        # don't decode XML entities (with -x flag)
  $bnc->undef_symbol("__UNDEF__"); # mark missing values like cwb-encode

  $bnc->memory(400);          # use up to 400 MB of RAM (default: 75)
  $bnc->validate(0);          # disable validation for faster indexing
  $bnc->verbose(1);           # print some progress information
  $bnc->debug(1);             # enable debugging output

  $bnc->encode(@files);       # encoding, indexing, and compression

  $pipe = $bnc->encode_pipe;  # can also feed input text from Perl script
  while (...) {
    print $pipe "$line\n";
  }
  $bnc->close_pipe;

=head1 DESCRIPTION

lib/CWB/Encoder.pm  view on Meta::CPAN

  my $self = {
              NAME => undef,    # name of the corpus (CWB corpus ID)
              REGISTRY => "",   # -r flag for non-default registry
              FILES => {},      # lookup hash for component filenames
              # $self->{FILES}->{$att}->{$comp} = $pathname;
              TYPES => {},      # attribute types: P / S
              GROUP => undef,   # optional: set group for new files
              PERM => undef,    # optional: set permissions for new files
              MEMORY => 75,     # memory limit for index creation
              VALIDATE => 1,    # enable/disable validation
              DEBUG => 0,       # enable/disable debugging output
             };
  croak 'USAGE:  $c = new CWB::Indexer $corpus_id;'
    unless @_ == 1;
  my $name = shift;
  if ($name =~ /^\s*(.+)\s*:\s*([^:]+)$/) {
    $self->{REGISTRY} = "-r '$1'";
    $name = $2;
  }
  $self->{NAME} = $name;

lib/CWB/Encoder.pm  view on Meta::CPAN

Turn off validation of index and compressed files, which may give 
substantial speed improvements for larger corpora.

=cut

sub validate {
  my ($self, $yesno) = @_;
  $self->{VALIDATE} = $yesno;
}

=item $idx->debug(1);

Activate debugging output (on STDERR). 

=cut

sub debug {
  my ($self, $yesno) = @_;
  $self->{DEBUG} = $yesno;
}

# internal method: get full pathname of a component file
sub filename {
  my ($self, $att, $comp) = @_;
  my $path = $self->{FILES}->{$att}->{$comp};
  croak "CWB::Indexer: can't determine filename for component $att/$comp (aborted).\n"
    unless defined $path;

lib/CWB/Encoder.pm  view on Meta::CPAN


Print some progress information (on STDOUT).

=cut

sub verbose {
  my ($self, $yesno) = @_;
  $self->{VERBOSE} = $yesno;
}

=item $enc->debug(1);

Activate debugging output (on STDERR).

=cut

sub debug {
  my ($self, $yesno) = @_;
  $self->{DEBUG} = $yesno;
  $self->{VERBOSE} = 1          # debugging also activates verbose output
    if $yesno;
}

# internal method: called _before_ running cwb-encode
sub prepare_encode {
  my $self = shift;
  my $overwrite = $self->{OVERWRITE};
  
  my $name = $self->{NAME};     # check that setup is complete
  croak "CWB::Encoder: Corpus ID hasn't been specified (with name() method)\n"

lib/CWB/Encoder.pm  view on Meta::CPAN

  CWB::Shell::Cmd("chgrp $group '$regfile'")
    if $group;

  my $idx = new CWB::Indexer "$reg:".(uc $name); # build indices and compress p-attributes
  $idx->group($group)
    if $group;
  $idx->perm($perm)
    if $perm;
  $idx->memory($self->{MEMORY});
  $idx->validate($self->{VALIDATE});
  $idx->debug($self->{DEBUG});
  print "Building indices and compressing p-attributes ...\n"
    if $self->{VERBOSE};
  $idx->makeall;

}

=back

=cut

script/cwb-make  view on Meta::CPAN

## Purpose: front-end to CWB::Indexer (improved substitute for cwb-makeall)
##
$| = 1;
use strict;
use warnings;

use CWB;
use CWB::Encoder;
use Getopt::Long;

our $Debug = 0;                     # -D, --debug
our $Validate = 0;                  # -V, --validate
our $Memory = 0;                    # -M, --memory  [uses CWB::Indexer default]
our $Registry = undef;              # -r, --registry
our $Group = undef;                 # -g, --group
our $Permissions = undef;           # -p, --permissions
our $Help = 0;                      # -h, --help

my $ok = GetOptions("D|debug" => \$Debug,
                    "V|validate" => \$Validate,
                    "M|memory=i" => \$Memory,
                    "r|registry=s" => \$Registry,
                    "g|group=s" => \$Group,
                    "p|permissions=s" => \$Permissions,
                    "h|help" => \$Help,
                    );

die "\nUsage:  cwb-make [options] CORPUS [<attributes>]\n\n",
  "Options:\n",
  "  -r <dir>  use registry directory <dir> [system default]\n",
  "     --registry=<dir>\n",
  "  -M <n>    use <n> MBytes of RAM for indexing [default: 75]\n",
  "     --memory=<n>\n",
  "  -V        validate newly created files\n",
  "     --validate\n",
  "  -g <name> put newly created files into group <name>\n",
  "     --group=<name>\n",
  "  -p <nnn>  set access permissions of created files to <nnn>\n",
  "     --permissions=<nnn>\n",
  "  -D        activate debugging output\n",
  "     --debug\n",
  "  -h        show this help page\n",
  "     --help\n",
  "\n"
  if $Help or @ARGV == 0 or not $ok;

our $Corpus = shift @ARGV;

our $indexer;
if ($Registry) {
  $indexer = new CWB::Indexer "$Registry:$Corpus";
}
else {
  $indexer = new CWB::Indexer $Corpus;
}

$indexer->group($Group)
  if defined $Group;
$indexer->perm($Permissions)
  if defined $Permissions;
$indexer->debug($Debug);
$indexer->memory($Memory)
  if $Memory > 0;
$indexer->validate($Validate);

if (@ARGV) {
  $indexer->make(@ARGV);
}
else {
  $indexer->makeall;
}

script/cwb-make  view on Meta::CPAN


  cwb-make [options] CORPUS [<attributes>]

Options:

  -r <dir>   use registry directory <dir> [system default]
  -M <n>     use <n> MBytes of RAM for indexing [default: 75]
  -V         validate newly created files
  -g <name>  put newly created files into group <name>
  -p <nnn>   set access permissions of created files to <nnn>
  -D         activate debugging output
  -h         show help page

Long forms of command-line options are listed below.


=head1 DESCRIPTION

The B<cwb-make> utility automates index building and compression for a CWB corpus,
calling B<cwb-makeall>, B<cwb-huffcode> and B<cwb-compress-rdx> as needed.
Main advantages over the manual procedure are:

script/cwb-make  view on Meta::CPAN

have been tested thoroughly by a large user community.

=item B<--group>=I<name>, B<-g> I<name>

=item B<--permissions>=I<ddd>, B<-p> I<ddd>

Set group membership (I<name>) and access permissions (octal code I<ddd>) of
new data files.  If these options are not specified, the system defaults
for newly created files are used.

=item B<--debug>, B<-D>

Activate debugging output.  Note that this is the only form of progress
information provided by B<cwb-make>, so you may want to specify C<-D>
simply in order to get some feedback during the indexing process.

=item B<--help>, B<-h>

Display help page with short usage summary (similar to SYNOPSIS above).

=back


t/20_encode_vss.t  view on Meta::CPAN


$enc->perm("640");              # set non-standard access permissions (but not group)

$enc->p_attributes(qw(word pos lemma)); # declare attributes
$enc->null_attributes("collection");
$enc->s_attributes(qw(story:0+num+title+author+year chapter:0+num p:0 s:0));

$enc->memory(100);              # corpus is very small and should use little memory
$enc->validate(1);              # validate all generated files
$enc->verbose(0);               # don't show any progress messages when running as self test
$enc->debug(0);

our $T0 = time;
eval { $enc->encode($vrt_file) };
ok(! $@, "corpus encoding and indexing"); # T2
our $elapsed = time - $T0;
diag(sprintf "VSS corpus encoded in %.1f seconds", $elapsed);

## now compare all created data files against reference corpus
our $ref_dir = "data/vss";
our $ref_regfile = "data/registry/vss";



( run in 1.160 second using v1.01-cache-2.11-cpan-49f99fa48dc )