view release on metacpan or search on metacpan
lib/CWB/CQP.pm view on Meta::CPAN
## error handling (messages on stderr)
$self->{'error_handler'} = undef; # set to subref for user-defined error handler
$self->{'status'} = 'ok'; # status of last executed command ('ok' or 'error')
$self->{'error_message'} = []; # arrayref to array containing message produced by last command (if any)
## handling of CQP progress messages
$self->{'progress'} = 0; # whether progress messages are activated
$self->{'progress_handler'} = undef; # optional callback for progress messages
$self->{'progress_info'} = []; # contains last available progress information: [$total_percent, $pass, $n_passes, $message, $percent]
## debugging (prints more or less everything on stdout)
$self->{'debug'} = 0;
## select vectors for CQP output (stdout, stderr, stdout|stderr)
$self->{'select_err'} = new IO::Select($err);
$self->{'select_out'} = new IO::Select($out);
$self->{'select_any'} = new IO::Select($err, $out);
## CQP object setup complete
bless($self, $class);
## the following command will collect and ignore any output which may have been produced during startup
lib/CWB/CQP.pm view on Meta::CPAN
It is an error to B<run> a new command before the output of the previous command has completely
been processed.
=cut
sub run {
croak 'USAGE: $cqp->run($cmd]);'
unless @_ == 2;
my $self = shift;
my $cmd = shift;
my $debug = $self->{'debug'};
$cmd =~ s/\n+/ /g; # make sure there are no newline characters (to be on the safe side)
$cmd =~ s/(;\s*)+$//; # ";" will be added when $cmd is sent to CQP
my $active_cmd = $self->{'command'};
croak "Error: new CQP command issued while '$active_cmd' is still being processed"
if $active_cmd;
$self->{'command'} = "$cmd;";
$self->{'status'} = 'ok';
$self->{'buffer'} = "";
$self->{'lines'} = [];
$self->{'error_message'} = [];
print "CQP << $cmd;\n"
if $debug;
$self->{'in'}->print("$cmd;\n .EOL.;\n"); # append .EOL. command to mark end of CQP output
}
=item I<$num_of_lines> = I<$cqp>->B<ready>;
=item I<$num_of_lines> = I<$cqp>->B<ready>(I<$timeout>);
Check if output from current CQP command is available for reading with B<getline> etc.,
returning the number of lines currently held in the input buffer (possibly including an
end-of-output marker line that will not be returned by B<getline> etc.). If there is no
lib/CWB/CQP.pm view on Meta::CPAN
Returns B<undef> when all output from the current CQP command has been read.
=cut
sub getline {
croak 'USAGE: $line = $cqp->getline;'
unless @_ == 1;
my $self = shift;
croak 'CWB::CQP: $cqp->getline called without active CQP command'
unless $self->{'command'};
my $debug = $self->{'debug'};
$self->_update(-1) # fill line buffer if necessary (blocking mode)
unless @{$self->{'lines'}} > 0;
my $line = shift @{$self->{'lines'}};
if ($line eq '-::-EOL-::-') {
## special line printed by ".EOL.;" marks end of CQP output
print "CQP ", "-" x 60, "\n"
if $debug;
$self->_eol;
return undef; # undef return value marks end of output
}
else {
print "CQP >> $line\n"
if $debug;
return $line;
}
}
=item I<@lines> = I<$cqp>->B<getlines>(I<$n>);
Read I<$n> lines of output from the CQP process, blocking as long as necessary. An explicit B<undef> element is included at the end of the output of a CQP command. Note that B<getlines> may return fewer than I<$n> lines if the end of output is reac...
Set C<I<$n> = 0> to read all complete lines currently held in the input buffer (as indicated by the B<ready> method), or specify a negative value to read the complete output of the active CQP command.
lib/CWB/CQP.pm view on Meta::CPAN
$cqp->run($cmd);
@lines = $cqp->getlines(-1);
=cut
sub exec {
croak 'USAGE: $cqp->exec($cmd);'
unless @_ == 2;
my $self = shift;
my $cmd = shift;
my $debug = $self->{'debug'};
$self->run($cmd);
my @result = $self->getlines(-1);
my $eol = pop @result;
if (defined $eol) {
die "CWB::CQP: INTERNAL ERROR in _exec() -- missing 'undef' at end of command output (ignored)";
push @result, $eol; # seems to be regular line, so push it back onto result list
}
return @result;
}
lib/CWB/CQP.pm view on Meta::CPAN
## INTERNAL: built in error handlers for 'die' and 'ignore' modes
sub _error_handler_die {
croak "\n", "=+===CWB::CQP ERROR=====\n", (map {" | $_\n"} @_), "=+== occurred";
}
sub _error_handler_ignore {
# do nothing
}
=item I<$cqp>->B<debug>(1);
=item I<$cqp>->B<debug>(0);
Activate/deactivate debugging mode, which logs all executed commands and their complete output on STDOUT. The B<debug> method returns the previous status for convenience.
=cut
sub debug {
croak 'USAGE: $prev_status = $cqp->debug( 1 | 0 ) ;'
unless @_ == 2;
my $self = shift;
my $on = shift;
my $prev = $self->{'debug'};
$self->{'debug'} = $on;
return $prev;
}
=item I<$cqp>->B<progress_on>;
=item I<$cqp>->B<progress_off>;
=item I<$message> = I<$cqp>->B<progress>;
=item (I<$total>, I<$pass>, I<$n_passes>, I<$msg>, I<$percent>) = I<$cqp>->B<progress_info>;
lib/CWB/Encoder.pm view on Meta::CPAN
$bnc = new CWB::Indexer "BNC";
$bnc = new CWB::Indexer "/path/to/registry:BNC";
$bnc->group("corpora"); # optional: group and access
$bnc->perm("640"); # permissions for newly created files
$bnc->memory(400); # use up to 400 MB of RAM (default: 75)
$bnc->validate(0); # disable validation for faster indexing
$bnc->debug(1); # enable debugging output
$bnc->make("word", "pos"); # build index & compress
$bnc->makeall; # process all p-attributes
$bnc = new CWB::Encoder "BNC";
$bnc->registry("/path/to/registry"); # will try to guess otherwise
$bnc->dir("/path/to/data/directory"); # directory for corpus data files
$bnc->overwrite(1); # may overwrite existing files / directories
lib/CWB/Encoder.pm view on Meta::CPAN
$bnc->s_attributes("s"); # s-attributes in cwb-encode syntax
$bnc->s_attributes(qw<div0* div1*>);# * = store annotations (-V)
$bnc->s_attributes("bncDoc:0+id"); # recursion & XML attributes
$bnc->decode_entities(0); # don't decode XML entities (with -x flag)
$bnc->undef_symbol("__UNDEF__"); # mark missing values like cwb-encode
$bnc->memory(400); # use up to 400 MB of RAM (default: 75)
$bnc->validate(0); # disable validation for faster indexing
$bnc->verbose(1); # print some progress information
$bnc->debug(1); # enable debugging output
$bnc->encode(@files); # encoding, indexing, and compression
$pipe = $bnc->encode_pipe; # can also feed input text from Perl script
while (...) {
print $pipe "$line\n";
}
$bnc->close_pipe;
=head1 DESCRIPTION
lib/CWB/Encoder.pm view on Meta::CPAN
my $self = {
NAME => undef, # name of the corpus (CWB corpus ID)
REGISTRY => "", # -r flag for non-default registry
FILES => {}, # lookup hash for component filenames
# $self->{FILES}->{$att}->{$comp} = $pathname;
TYPES => {}, # attribute types: P / S
GROUP => undef, # optional: set group for new files
PERM => undef, # optional: set permissions for new files
MEMORY => 75, # memory limit for index creation
VALIDATE => 1, # enable/disable validation
DEBUG => 0, # enable/disable debugging output
};
croak 'USAGE: $c = new CWB::Indexer $corpus_id;'
unless @_ == 1;
my $name = shift;
if ($name =~ /^\s*(.+)\s*:\s*([^:]+)$/) {
$self->{REGISTRY} = "-r '$1'";
$name = $2;
}
$self->{NAME} = $name;
lib/CWB/Encoder.pm view on Meta::CPAN
Turn off validation of index and compressed files, which may give
substantial speed improvements for larger corpora.
=cut
sub validate {
my ($self, $yesno) = @_;
$self->{VALIDATE} = $yesno;
}
=item $idx->debug(1);
Activate debugging output (on STDERR).
=cut
sub debug {
my ($self, $yesno) = @_;
$self->{DEBUG} = $yesno;
}
# internal method: get full pathname of a component file
sub filename {
my ($self, $att, $comp) = @_;
my $path = $self->{FILES}->{$att}->{$comp};
croak "CWB::Indexer: can't determine filename for component $att/$comp (aborted).\n"
unless defined $path;
lib/CWB/Encoder.pm view on Meta::CPAN
Print some progress information (on STDOUT).
=cut
sub verbose {
my ($self, $yesno) = @_;
$self->{VERBOSE} = $yesno;
}
=item $enc->debug(1);
Activate debugging output (on STDERR).
=cut
sub debug {
my ($self, $yesno) = @_;
$self->{DEBUG} = $yesno;
$self->{VERBOSE} = 1 # debugging also activates verbose output
if $yesno;
}
# internal method: called _before_ running cwb-encode
sub prepare_encode {
my $self = shift;
my $overwrite = $self->{OVERWRITE};
my $name = $self->{NAME}; # check that setup is complete
croak "CWB::Encoder: Corpus ID hasn't been specified (with name() method)\n"
lib/CWB/Encoder.pm view on Meta::CPAN
CWB::Shell::Cmd("chgrp $group '$regfile'")
if $group;
my $idx = new CWB::Indexer "$reg:".(uc $name); # build indices and compress p-attributes
$idx->group($group)
if $group;
$idx->perm($perm)
if $perm;
$idx->memory($self->{MEMORY});
$idx->validate($self->{VALIDATE});
$idx->debug($self->{DEBUG});
print "Building indices and compressing p-attributes ...\n"
if $self->{VERBOSE};
$idx->makeall;
}
=back
=cut
script/cwb-make view on Meta::CPAN
## Purpose: front-end to CWB::Indexer (improved substitute for cwb-makeall)
##
$| = 1;
use strict;
use warnings;
use CWB;
use CWB::Encoder;
use Getopt::Long;
our $Debug = 0; # -D, --debug
our $Validate = 0; # -V, --validate
our $Memory = 0; # -M, --memory [uses CWB::Indexer default]
our $Registry = undef; # -r, --registry
our $Group = undef; # -g, --group
our $Permissions = undef; # -p, --permissions
our $Help = 0; # -h, --help
my $ok = GetOptions("D|debug" => \$Debug,
"V|validate" => \$Validate,
"M|memory=i" => \$Memory,
"r|registry=s" => \$Registry,
"g|group=s" => \$Group,
"p|permissions=s" => \$Permissions,
"h|help" => \$Help,
);
die "\nUsage: cwb-make [options] CORPUS [<attributes>]\n\n",
"Options:\n",
" -r <dir> use registry directory <dir> [system default]\n",
" --registry=<dir>\n",
" -M <n> use <n> MBytes of RAM for indexing [default: 75]\n",
" --memory=<n>\n",
" -V validate newly created files\n",
" --validate\n",
" -g <name> put newly created files into group <name>\n",
" --group=<name>\n",
" -p <nnn> set access permissions of created files to <nnn>\n",
" --permissions=<nnn>\n",
" -D activate debugging output\n",
" --debug\n",
" -h show this help page\n",
" --help\n",
"\n"
if $Help or @ARGV == 0 or not $ok;
our $Corpus = shift @ARGV;
our $indexer;
if ($Registry) {
$indexer = new CWB::Indexer "$Registry:$Corpus";
}
else {
$indexer = new CWB::Indexer $Corpus;
}
$indexer->group($Group)
if defined $Group;
$indexer->perm($Permissions)
if defined $Permissions;
$indexer->debug($Debug);
$indexer->memory($Memory)
if $Memory > 0;
$indexer->validate($Validate);
if (@ARGV) {
$indexer->make(@ARGV);
}
else {
$indexer->makeall;
}
script/cwb-make view on Meta::CPAN
cwb-make [options] CORPUS [<attributes>]
Options:
-r <dir> use registry directory <dir> [system default]
-M <n> use <n> MBytes of RAM for indexing [default: 75]
-V validate newly created files
-g <name> put newly created files into group <name>
-p <nnn> set access permissions of created files to <nnn>
-D activate debugging output
-h show help page
Long forms of command-line options are listed below.
=head1 DESCRIPTION
The B<cwb-make> utility automates index building and compression for a CWB corpus,
calling B<cwb-makeall>, B<cwb-huffcode> and B<cwb-compress-rdx> as needed.
Main advantages over the manual procedure are:
script/cwb-make view on Meta::CPAN
have been tested thoroughly by a large user community.
=item B<--group>=I<name>, B<-g> I<name>
=item B<--permissions>=I<ddd>, B<-p> I<ddd>
Set group membership (I<name>) and access permissions (octal code I<ddd>) of
new data files. If these options are not specified, the system defaults
for newly created files are used.
=item B<--debug>, B<-D>
Activate debugging output. Note that this is the only form of progress
information provided by B<cwb-make>, so you may want to specify C<-D>
simply in order to get some feedback during the indexing process.
=item B<--help>, B<-h>
Display help page with short usage summary (similar to SYNOPSIS above).
=back
t/20_encode_vss.t view on Meta::CPAN
$enc->perm("640"); # set non-standard access permissions (but not group)
$enc->p_attributes(qw(word pos lemma)); # declare attributes
$enc->null_attributes("collection");
$enc->s_attributes(qw(story:0+num+title+author+year chapter:0+num p:0 s:0));
$enc->memory(100); # corpus is very small and should use little memory
$enc->validate(1); # validate all generated files
$enc->verbose(0); # don't show any progress messages when running as self test
$enc->debug(0);
our $T0 = time;
eval { $enc->encode($vrt_file) };
ok(! $@, "corpus encoding and indexing"); # T2
our $elapsed = time - $T0;
diag(sprintf "VSS corpus encoded in %.1f seconds", $elapsed);
## now compare all created data files against reference corpus
our $ref_dir = "data/vss";
our $ref_regfile = "data/registry/vss";