Alt-CWB-ambs

 view release on metacpan or  search on metacpan

lib/CWB.pm  view on Meta::CPAN

      unless $comm =~ /^(\#|\s)/;
    print $fh "\t#$comm";
  }
  print $fh "\n";
}

# helper function: write HOME or INFO path as double-quoted string if it is not a simple ID
sub _quote_path ( $ ) {
  my $path = shift;
  if ($path !~ /^[A-Za-z0-9_\-\/][A-Za-z0-9_\.\-\/]+$/) {
    $path =~ s/"/\\"/g; # escape all literal double quotes
    $path = "\"$path\"";
  }
  return $path;
}

=item $reg->write($filename);

Write registry file to disk in canonical format. I<$filename> has to be a full
absolute or relative path.  For safety reasons, the B<write()> method does
I<not> automatically save a file in the default registry directory.  Make sure

lib/CWB/CEQL.pm  view on Meta::CPAN

Evaluate complete token expression with word form (or lemma) constraint and or
part-of-speech (or simple POS) constraint.  The two parts of the token
expression are passed on to C<word_or_lemma_constraint> and C<pos_constraint>,
respectively.  This rule returns a CQP token expression enclosed in square
brackets.

=cut

sub token_expression {
  my ($self, $input) = @_;
  my @parts = split /(?<!\\)_/, $input; # split input on unescaped underscores
  die "only a single ''_'' separator allowed between word form and POS constraint (use ''\\_'' to match literal underscore)\n"
    if @parts > 2;
  my ($word, $pos) = @parts;
  $word = "" unless defined $word;
  $pos = "" unless defined $pos;
  my ($cqp_word, $cqp_pos) = (undef, undef);
  if ($word ne "" and           # optimise *_ITJ to _ITJ (to avoid word form constraint matching all words)
      not ($word =~ /^[+*]$/ and $pos ne "")) {
    $cqp_word = $self->Call("word_or_lemma_constraint", $word);
  }

lib/CWB/CEQL.pm  view on Meta::CPAN

rule.

=cut

sub wildcard_pattern {
  my ($self, $input) = @_;
  die "literal backslash ''\\\\'' is not allowed in wildcard pattern '' $input '')\n"
    if $input =~ /\\\\/; # / (temporary workaround: TextMate is confused by the regexp)
  die "wildcard pattern must not end in a backslash ('' $input '')\n"
    if $input =~ /\\$/;
  ## add whitespace around (unescaped) wildcard metacharacters
  $input =~ s/(?<!\\)([?*+\[,\]])/ $1 /g;
  $input =~ s/(\\[aAlLuUdDwW])/ $1 /g;
  ## trim whitespace, then split wildcard pattern on whitespace into items
  $input =~ s/^\s+//;
  $input =~ s/\s+$//;
  my @items = split " ", $input;
  die "empty wildcard pattern '' $_[1] '' is not allowed\n"
    unless @items > 0;
  my @regexp_comps = $self->Apply("wildcard_item", @items);
  return '"'.join("", @regexp_comps).'"';

lib/CWB/CEQL.pm  view on Meta::CPAN

    return $self->Call("literal_string", $item);
  }
}

=item C<literal_string>

Translate literal string into regular expression, escaping all metacharacters
with backslashes (backslashes in the input string are removed first).

Note that escaping of C<^> and C<"> isn't fully reliable because CQP might
interpret the resulting escape sequences as latex-style accents if they are
followed by certain letters.  Future versions of CQP should provide a safer
escaping mechanism and/or allow interpretation of latex-style accents to be
turned off.

=cut

sub literal_string {
  my ($self, $input) = @_;
  $input =~ s/\\//g; # remove backslashes (used to escape CEQL metacharacters)
  $input =~ s/([.?*+|(){}\[\]"\^\$])/\\$1/g; # " needs to be escaped because CQP regexp will be enclosed in double quotes
  return $input;
}

=back


=head2 Internal Subroutines

=over 4

lib/CWB/CEQL/Parser.pm  view on Meta::CPAN

Technically, a B<DPP grammar> is a subclass of B<CWB::CEQL::Parser>, which
defines B<DPP rules> in the form of Perl B<methods>, and inherits parsing and
housekeeping methods from the base class.  Instantiating such a grammar class
yields an independent parser object.

By convention, the names of B<rule methods> are written in lowercase with
underscores (e.g., C<word_and_pos>), B<methods for users and grammar writers>
are written in mixed case (e.g., C<Parse> or C<SetParam>), and B<internal
methods> are written in mixed case starting with a lowercase letter (e.g.,
B<formatHtmlString>).  If you need to define helper subroutines in your grammar
class, their names should begin with an underscore (e.g., C<_escape_regexp>)
to avoid confusion with grammar rules.  The C<default> rule has to be
implemented by all grammars and will be applied to an input string if no
constituent type is specified.  The basic skeleton of a DPP grammar therefore
looks like this:

  package MyGrammar;
  use base 'CWB::CEQL::Parser';

  sub some_rule {
    ## body of grammar rule "some_rule" goes here

lib/CWB/CEQL/Parser.pm  view on Meta::CPAN

  sub wildcard_expression {
    my ($self, $input) = @_;
    die "the wildcard expression ''$input'' contains invalid characters\n"
      unless $input =~ /^[A-Za-z0-9?*-]+$/;
    my $regexp = $input;
    $regexp =~ s/\?/./g;
    $regexp =~ s/\*/.*/g;
    return $regexp;
  }

Alternatively, the rule could escape all regular expression metacharacters in
the input string so they are matched literally by the regular expression.
This version of the grammar might use an internal subroutine for translating
strings with wildcards safely into regular expressions:

  sub wildcard_expression {
    my ($self, $input) = @_;
    return _wildcard_to_regexp($input);
  }

  # note leading underscore for internal subroutine (this is not a method!)
  sub _wildcard_to_regexp {
    my $s = quotemeta(shift);
    $s =~ s/\\[?]/./g;  # wildcards will also have been escaped with a backslash
    $s =~ s/\\([*+])/.$1/g;  # works for wildcards * and +
    return $s;
  }

=head2 Handling parse errors

DPP rules should always carry out strict checks to ensure that their input is
a well-formed constituent of the required type, and B<die> with a clear and
informative error message otherwise.  This helps users to locate and correct
syntax errors in their input.  If errors are caught too late, i.e. in deeply

t/40_ceql_parser.t  view on Meta::CPAN

use base 'CWB::CEQL::Parser';

sub wildcard_expression {
  my ($self, $input) = @_;
  return _wildcard_to_regexp($input);
}

# note leading underscore for internal subroutine (this is not a method!)
sub _wildcard_to_regexp {
  my $s = quotemeta(shift);
  $s =~ s/\\[?]/./g;  # wildcards will also have been escaped with a backslash
  $s =~ s/\\([*+])/.$1/g;  # works for wildcards * and +
  return $s;
}

sub wordform_pattern {
  my ($self, $input) = @_;
  die "the wordform pattern ''$input'' must not contain whitespace or double quotes\n"
    if $input =~ /\s|\"/;
  if ($input =~ /^\/(.+)\/$/) {
    my $regexp = $1; # regular expression query: simply wrap in double quotes

t/44_ceql_bncweb.t  view on Meta::CPAN

  $unicode =~ s/\&hearts;/\x{2665}/g;
  $unicode =~ s/\&delta;/\x{03B4}/g;
  # end of dummy rules
  my $cqp_unicode = $self->Call("ceql_query", $unicode);
  return encode("iso-8859-1", $cqp_unicode, Encode::FB_CROAK);
}

# override literal_string rule to insert HTML entities (for non-Latin-1 characters and special treatment of ")
sub literal_string {
  my ($self, $input) = @_;
  $input =~ s/\\//g; # remove backslashes (used to escape CEQL metacharacters)
  ##-- # the real implementation uses the HTML::Entities module to insert HTML entities
  ##-- encode_entities($input, '<>&');            # unsafe characters <, >, & are HTML entities in the canonical BNCweb encoding
  ##-- encode_entities($input, '^\x{00}-\x{FF}'); # encode non-Latin-1 characters as HTML entities (but keep $input in Unicode for now)
  # here, dummy rules covering all entities in the test suite help us to avoid a dependency on the non-standard HTML::Entities module
  $input =~ s/&/&amp;/g;
  $input =~ s/</&lt;/g;
  $input =~ s/>/&gt;/g;
  $input =~ s/\x{03B1}/\&alpha;/g;
  $input =~ s/\x{2665}/\&hearts;/g;
  $input =~ s/\x{03B4}/\&delta;/g;
  # end of dummy rules
  $input =~ s/([.?*+|(){}\[\]\^\$])/\\$1/g;  # escape CQP regexp metacharacters (" is treated separately below)
  $input =~ s/"/&(lsquo|rsquo);/g;           # special handling of " to match both left and right quotes
  return $input;
}

# override lemma_pattern rule to provide support for {book/V} notation
sub lemma_pattern {
  my ($self, $lemma) = @_;
  my $simple_pos = $self->GetParam("simple_pos");
  die "simplified POST tags are not available (internal error)\n"
    unless ref($simple_pos) eq "HASH";



( run in 0.589 second using v1.01-cache-2.11-cpan-c21f80fb71c )