Alt-CWB-ambs

 view release on metacpan or  search on metacpan

script/cwb-make  view on Meta::CPAN


our $Corpus = shift @ARGV;

our $indexer;
if ($Registry) {
  $indexer = new CWB::Indexer "$Registry:$Corpus";
}
else {
  $indexer = new CWB::Indexer $Corpus;
}

$indexer->group($Group)
  if defined $Group;
$indexer->perm($Permissions)
  if defined $Permissions;
$indexer->debug($Debug);
$indexer->memory($Memory)
  if $Memory > 0;
$indexer->validate($Validate);

if (@ARGV) {
  $indexer->make(@ARGV);
}
else {
  $indexer->makeall;
}

__END__

=head1 NAME

cwb-make - Automated indexing and compression for CWB corpora

=head1 SYNOPSIS

  cwb-make [options] CORPUS [<attributes>]

Options:

  -r <dir>   use registry directory <dir> [system default]
  -M <n>     use <n> MBytes of RAM for indexing [default: 75]
  -V         validate newly created files
  -g <name>  put newly created files into group <name>
  -p <nnn>   set access permissions of created files to <nnn>
  -D         activate debugging output
  -h         show help page

Long forms of command-line options are listed below.


=head1 DESCRIPTION

The B<cwb-make> utility automates index building and compression for a CWB corpus,
calling B<cwb-makeall>, B<cwb-huffcode> and B<cwb-compress-rdx> as needed.
Main advantages over the manual procedure are:

=over 4

=item *

Old index files are updated automatically (unlike B<cwb-makeall>, which does
not check the age of index files), and it is safe to call B<cwb-make> on an
indexed and compressed corpus (again, unlike B<cwb-makeall>).

=item *

Data files that are no longer needed after compression are immediately deleted.

=item *

The build process is optimised to reduce the amount of temporary disk space and
memory needed.  This is particularly important when indexing large corpora on
32-bit platforms, where B<cwb-makeall> might easily run out of address space when
called directly.

=back

The basic usage pattern is

S< > B<cwb-make> [I<options>] CORPUS [I<attribute> ...]

where CORPUS is the CWB name (ID) of the corpus to be indexed (after encoding
with B<cwb-encode>) and should be written in upper case.  If positional attributes
are added at a later time, they can be indexed separately by specifying the
attribute names after the corpus ID.  Note that it is always safe simply to call
B<cwb-make>: existing indexed and compressed attributes will be ignored.
Further command-line options are detailed below.

B<cwb-make> is a minimal front-end to the B<CWB::Indexer> functionality provided
by the B<CWB::Encoder> module, which can also be used directly from a Perl script.
See L<CWB::Encoder/"CWB::Indexer METHODS"> manpage for further information.


=head1 COMMAND-LINE OPTIONS

=over 4

=item B<--registry>=I<dir>, B<-r> I<dir>

Use registry directory I<dir> instead of standard registry (CWB default or 
specified by C<CORPUS_REGISTRY> environment variable).

=item B<--memory>=I<n>, B<-M> I<n>

Use approx. I<n> megabytes (MiB) of RAM for indexing.  The default of 75 MiB
is safe even for computers with a small amount of memory or many concurrent users.
If more RAM is available, indexing can be speeded up considerably by setting 
higher memory limit.  For instance, C<-M 500> or C<-M 1000> is a good choice on
a machine with 2 GiB of RAM and a low work load.

=item B<--validate>, B<-V>

Validate newly created data files (index files and compressed corpus data).
This is normally not required, as the CWB indexing and compression algorithms
have been tested thoroughly by a large user community.

=item B<--group>=I<name>, B<-g> I<name>

=item B<--permissions>=I<ddd>, B<-p> I<ddd>

Set group membership (I<name>) and access permissions (octal code I<ddd>) of



( run in 1.123 second using v1.01-cache-2.11-cpan-5a3173703d6 )