validate results from the CPAN

validate

Alt-CWB-ambs

view release on metacpan or search on metacpan

lib/CWB/CEQL/Parser.pm view on Meta::CPAN

package CWB::CEQL::Parser;

use Carp;
use Encode;

use warnings;
use strict;

=head1 NAME

CWB::CEQL::Parser - Deterministic Perl parser for simplified query languages

=head1 SYNOPSIS

  ##### DPP grammar in separate module #####
  package MyGrammar;
  use base 'CWB::CEQL::Parser';

  sub my_rule {
    my ($self, $input) = @_;
    ## body of grammar rule "my_rule" goes here, transforming $input => $output
    return $output;
  }

  sub default {
    ## default rule will be called if parser is applied to string
    my ($self, $input) = @_;
    return $self->Call("my_rule", $input);        # recursively call other grammar rules
  }

  1;

  ##### main program #####
  use MyGrammar;
  our $grammar = new MyGrammar;

  $result = $grammar->Parse($string);             # applies 'default' rule
  $result = $grammar->Parse($string, "my_rule");  # parse as given constituent

=head1 DESCRIPTION

B<CWB::CEQL::Parser> implements a B<heuristic>-driven, B<deterministic>,
B<top-down> parser for extended context-free grammars written in B<Perl>,
called a B<D>eterministic B<P>erl B<P>arser (B<DPP>).  This parsing algorithm
was designed specifically for automatic translation of simplified, user-friendly
query and scripting languages (such as the B<C>ommon B<E>lementary B<Q>uery
B<L>anguage provided by B<CWB::CEQL>) into low-level code (e.g. B<CQP> syntax).

The DPP architecture was motivated by the observation that simplified queries
are often very similar in structure to the corresponding low-level queries,
and that many authors use cascaded regular expression substitutions to
transform one into the other.  While such cascades are very easy to write in
Perl and perform efficiently, there are two important limitations: it would
often be useful (i) to validate and transform recursive structures, and (ii)
to restrict a particular transformation to a certain scope.  Because of these
limitations, incorrect user input -- and sometimes even correct input -- leads
to malformed low-level queries.  Without an intimate knowledge of the
implementation, it is often impossible to guess the true location of the
problem from the cryptic error messages generated by the backend processor.
Moreover, simplified query languages based on regular expression substitution
typically have rather limited expressiveness and flexibility (because the
substitutions are applied unconditionally, so symbols cannot have different
meanings in different contexts).

B<CWB::CEQL::Parser> aims to overcome these limitations by combining
regexp-based matching and substitution with a simple top-down parser for
context-free grammars, as well as a shift-reduce-style parser for nested
bracketing.  Parsing complexity is limited by enforcing a B<fully
deterministic> parsing algorithm: a B<DPP rule> (= constituent type,
corresponding to the LHS of a traditional CFG rule) may have multiple
expansions, but the Perl code implementing the rule has to employ heuristics
to choose a single expansion for a given input.  If the selected expansion
does not match the input string, the entire parsing process fails.  Again,
this decision was motivated by the observation that, in the case of simplified
query languages, it is often very easy to make such deterministic decisions
with a regular expression and/or a few lines of Perl code.

Each B<DPP rule> is implemented as a B<Perl subroutine> (or, more precisely,
B<method>).  It is invoked by the parser with an input string that is expected
to be a constituent of the respective type, and returns its analysis of this
constituent.  In the typical application of DPP grammars, the return value is
a string representing (part of) a low-level query expression, but grammar
authors may also decide to return arbitrary data structures.  In the B<rule
body>, other grammar rules can be applied to the full input string, a
substring, or an arbitrarily transformed substring using the B<Call> method.
It is also possible to invoke the shift-reduce-type parser with the B<Apply>
method.  Both methods return an analysis of the given substring, which can
then be integrated with the analyses of other substrings and the parsing
performed by the rule body itself.

The section on L<"WRITING GRAMMARS"> below explains in detail how to write new
DPP grammars from scratch; L<"GRAMMAR RULES"> presents some typical design
patterns for grammar rules and lists the methods available to grammar writers;
L<"EXTENDING GRAMMARS"> shows how to extend and modify existing grammars (such
as the standard CEQL implementation provided by the B<CWB::CEQL> module);
L<"USER-VISIBLE METHODS"> documents methods aimed at grammar users; L<"METHODS
USED BY GRAMMAR AUTHORS"> documents methods for grammar writers; and
L<"INTERNAL METHODS"> contains short descriptions of methods used internally
by the parser.


=head1 WRITING GRAMMARS

Technically, a B<DPP grammar> is a subclass of B<CWB::CEQL::Parser>, which
defines B<DPP rules> in the form of Perl B<methods>, and inherits parsing and
housekeeping methods from the base class.  Instantiating such a grammar class
yields an independent parser object.

By convention, the names of B<rule methods> are written in lowercase with
underscores (e.g., C<word_and_pos>), B<methods for users and grammar writers>
are written in mixed case (e.g., C<Parse> or C<SetParam>), and B<internal
methods> are written in mixed case starting with a lowercase letter (e.g.,
B<formatHtmlString>).  If you need to define helper subroutines in your grammar
class, their names should begin with an underscore (e.g., C<_escape_regexp>)

lib/CWB/CEQL/Parser.pm view on Meta::CPAN

are not mistaken as formatting instructions.

=head2 Calling subrules

Most DPP rules divide the input string into one or more subconstituents,
similar to the rules of a standard context-free grammar.  The main difference
is that a DPP rule has to settle on the specific positions and categories
of the subconstituents, rather than just listing possible category sequences.
Many DPP rules will also remove syntactic operators and delimiters, so that
only complex subconstituents are passed to other rules for parsing with the
B<Call> method.

The following example allows users to search for a word form using either a
wildcard pattern or a regular expression enclosed in C</.../>.  The return
value is a CQP query.  As an additional optimisation, wildcard patterns that
do not contain any wildcards are matched literally (which is faster than a
regular expression and avoids possible conflicts with regexp metacharacters).

  sub wordform_pattern {
    my ($self, $input) = @_;
    die "the wordform pattern ''$input'' must not contain whitespace or double quotes\n"
      if $input =~ /\s|\"/;
    if ($input =~ /^\/(.+)\/$/) {
      my $regexp = $1; # regular expression query: simply wrap in double quotes
      return "\"$regexp\"";
    }
    else {
      if ($input =~ /[?*+]/) {
        my $regexp = $self->Call("wildcard_expression", $input); # call subrule
        return "\"$regexp\"";
      }
      else {
        return "\"$input\"\%l";
      }
    }
  }

It would probably be a good idea to signal an error if the wordform pattern
starts or ends with a slash (C</>) but is not enclosed in C</.../> as a
regular expression query.  This is likely to be a typing mistake and the user
will be confused if the input is silently interpreted as a wildcard
expression.

=head2 Parsing sequences

If the input string consists of a variable number of subconstituents of the
same type, the B<Apply> method provides a convenient alternative to repeated
subrule calls.  It parses all specified subconstituents, collects the parse
results and returns them as a list.  The following example processes queries
that consist of a sequence wordform patterns separated by blanks (each pattern
is either a wildcard expression or regular expression, according to the DPP
rules defined above), and returns an equivalent CQP query.

  sub wordform_sequence {
    my ($self, $input) = @_;
    my @items = split " ", $input;
    my @cqp_patterns = $self->Apply("wordform_pattern", @items);
    return "@cqp_patterns";
  }

Recall that the list returned by B<Apply> does not have to be validated: if
any error occurs, the respective subrule will B<die> and abort the complete
parse.

=head2 The shift-reduce parser for nested bracketing

The B<Apply> method is more than a convenient shorthand for parsing lists of
constituents.  Its main purpose is to parse nested bracketing structures,
which are very common in the syntax of formal languages (examples include
arithmetical formulae, regular expressions and most computer programming
languages).  When parsing the constituents of a list with nested bracketing,
two special methods, B<BeginGroup> and B<EndGroup>, are called to mark opening
and closing delimiters.  Proper nesting will then automatically be verified by
the DPP parser.  If the syntax allows different types of groups to be mixed,
optional names can be passed to the B<BeginGroup> and B<EndGroup> calls in
order to ensure that the different group types match properly.

The output generated by the items of a bracketing group is collected
separately and returned when B<EndGroup> is called.  From this list, the rule
processing the closing delimiter has to construct a single expression for the
entire group.  Note that the return value of the DPP rule calling
B<BeginGroup> becomes part of the bracketing group output.  If this is not
desired, the rule must return an empty string (C<"">).  Rules can also check
whether they are in a nested group with the help of the B<NestingLevel> method
(which returns 0 at the top level).

The example below extends our simple query language with regexp-style
parenthesised groups, quantifiers (C<?>, C<*>, C<+>) and alternatives (C<|>).
In order to simplify the implementation, metacharacters must be separated from
wordform patterns and from other metacharacters by blanks; and quantifiers
must be attached directly to a closing parenthesis (otherwise, the question
mark in C<) ?> would be ambiguous between a quantifier and a wildcard pattern
matching a single character).  Note that the C<simple_query> rule is
practically identical to C<wordform_sequence> above, but has been renamed to
reflect its new semantics.

  sub simple_query {
    my ($self, $input) = @_;
    my @items = split " ", $input;
    my @cqp_tokens = $self->Apply("simple_query_item", @items);
    return "@cqp_tokens";
  }

  # need to define single rule to parse all items of a list with nested bracketing
  sub simple_query_item {
    my ($self, $item) = @_;
    # opening delimiter: (
    if ($item eq "(") {
      $self->BeginGroup();
      return "";  # opening delimiter should not become part of group output
    }
    # alternatives separator: | (only within nested group)
    elsif ($item eq "|") {
      die "a group of alternatives (|) must be enclosed in parentheses\n"
        unless $self->NestingLevel > 0; # | metacharacter is not allowed at top level
      return "|";
    }
    # closing delimiter: ) with optional quantifier
    elsif ($item =~ /^\)([?*+]?)$/) {
      my $quantifier = $1;
      my @cqp_tokens = $self->EndGroup();

lib/CWB/CEQL/Parser.pm view on Meta::CPAN

sequence returned by the B<Apply> method.

  package Arithmetic;
  use base 'CWB::CEQL::Parser';
  use CWB::CEQL::String;

  sub default {
    my ($self, $input) = @_;
    return $self->Call("arithmetic_expression", $input);
  }

  sub arithmetic_expression {
    my ($self, $input) = @_;
    $input =~ s/([()+-])/ $1 /g;            # insert whitespace around metacharacters
    $input =~ s/^\s+//; $input =~ s/\s+$//; # strip leading/trailing whitespace
    my @items = split " ", $input;          # split on whitespace into items (numbers, operators, parentheses)
    my @terms_ops = $self->Apply("arithmetic_item", @items); # returns list of Term's and Op's
    return $self->_shift_reduce(@terms_ops);
  }

  sub arithmetic_item {
    my ($self, $item) = @_;
    if ($item eq "+")    { return new CWB::CEQL::String "add", "Op" }
    elsif ($item eq "-") { return new CWB::CEQL::String "sub", "Op" }
    elsif ($item eq "(") { $self->BeginGroup("subexpression"); return "" }
    elsif ($item eq ")") {
      my @terms_ops = $self->EndGroup("subexpression");
      return $self->_shift_reduce(@terms_ops);
    }
    elsif ($item =~ /^[0-9]+$/) { return new CWB::CEQL::String $item, "Term" }
    else { die "invalid element '' $item '' in arithmetic expression\n" }
  }

  sub _shift_reduce {
    my ($self, @terms_ops) = @_;
    while (@terms_ops >= 3) {
      # reduce first three items (which must be Term Op Term) to single Term
      my @types = map {$_->type} @terms_ops;
      die "syntax error in arithmetic expression\n"
        unless "@types" =~ /^Term Op Term/; # wrong sequence of terms and operators
      my $t1 = shift @terms_ops;
      my $op = shift @terms_ops;
      my $t2 = shift @terms_ops;
      my $new_term = new CWB::CEQL::String "$op($t1, $t2)", "Term";
      unshift @terms_ops, $new_term;
    }
    die "syntax error in arithmetic expression\n"
      unless @terms_ops == 1;     # wrong number of items
    return shift @terms_ops;
  }

The obvious drawback of this approach is the difficulty of signaling the
precise location of a syntax error to the user (in the example grammar above,
the parser will simply print C<syntax error> if there is any problem in a
sequence of terms and operators).  By the time the error is detected, all
items in the active group have already been pre-processed and subexpressions
have been collapsed.  Printing the current list of terms and operators would
only add to the user's confusion.

In order to signal errors immediately where they occur, each item should be
validated before it is added to the result list (e.g. an operator may not be
pushed as first item on a result list), and the reduce operation (C<< Term Op
Term => Term >>) should be applied as soon as possible.  The rule
C<arithmetic_item> needs direct access to the currently active result list for
this purpose: (1) to check how many items have already been pushed when
validating a new item, and (2) to reduce a sequence C<Term Op Term> to a single
C<Term> in the result list.

A pointer to the currently active result list is obtained with the internal
B<currentGroup> method, allowing a grammar rule to manipulate the result list.
The B<proximity queries> in the B<CWB::CEQL> grammar illustrate this advanced
form of shift-reduce parsing.


=head2 Backtracking with Try()

B<** TODO **>


=head1 EXTENDING GRAMMARS

B<** TODO **>


=head1 USER-VISIBLE METHODS

Methods that are called by the "end users" of a grammar.

=over 4

=item I<$grammar> = B<new> MyGrammar;

Create parser object I<$grammar> for the specified grammar (which must be a
class derived from B<CWB::CEQL::Parser>).  Note that the parser itself is not
reentrant, but multiple parsers for the same grammar can be run in parallel.
The return value I<$grammar> is an object of class B<MyGrammar>.

=cut

sub new {
  my $class = shift;
  my $self = {
              'PARAM_DEFAULTS' => {},  # globally set default values for parameters
              'PARAM' => undef,        # working copies of parameters during parse
              'INPUT' => undef,        # input string (defined while parsing)
              'ERROR' => undef,        # error message generated by last parse (undef = no error)
              'CALLSTACK' => [],       # call stack for backtrace in case of error
              'GROUPS' => undef,       # group structure for shift-reduce parser (undef if not active)
              'GROUPSTACK' => undef,   # stack of nested bracketing groups (undef if not active)
             };
  bless($self, $class);
}

=item I<$result> = I<$grammar>->B<Parse>(I<$string> [, I<$rule>]);

Parse input string I<$string> as a constituent of type I<$rule> (if
unspecified, the C<default> rule will be used).  The return value I<$result>
is typically a string containing the transformed query, but may also be an
arbitrary data structure or object (such as a parse tree for I<$input>).
Consult the relevant grammar documentation for details.  If parsing fails,
B<undef> is returned.

( run in 2.083 seconds using v1.01-cache-2.11-cpan-56fb94df46f )