Alt-CWB-ambs

 view release on metacpan or  search on metacpan

lib/CWB/CEQL/Parser.pm  view on Meta::CPAN

rules do not need to check the return values of subrules invoked with B<Call>
or B<Apply>.

Sometimes, it may be unavoidable to try different analyses of an input string
in sequence.  In such exceptional cases, grammar writers can use the B<Try>
method to perform a simple type of backtracking.  B<Try> works exactly like
B<Call>, but will catch any exception raised due to a parse failure and return
B<undef> in this case.  Grammar writers are strongly advised to avoid
backtracking whenever possible, though: the deterministic nature of DPP is
essential for efficient parsing, and repeated backtracking will greatly
increase its computational complexity.

DPP grammars can be B<customised> in two ways.  One possibility is to
B<override existing rules> by subclassing the grammar, as described in the
section on L<"EXTENDING GRAMMARS">.  This offers an extremely flexible way of
changing grammar behaviour, but requires a detailed knowledge of the
B<CWB::CEQL::Parser> module and the internal design of the grammar.

A much easier customisation strategy is for grammar writers to define named
B<parameters>, which can then be set by end users in order to control certain
features of the grammar.  Typical applications of parameters include the
following:

=over 4

=item *

customisation of corpus attribute names (e.g., a parameter C<pos_attribute>
might specify the appropriate positional attribute for part-of-speech tags,
such as C<pos> or C<tag>)

=item *

activating or deactivating certain grammar rules (e.g., a parameter
C<has_lemma> might indicate whether a corpus includes lemmatisation
information or not; if it is FALSE, then input strings including lemma
constraints will raise parse errors in the respective grammar rules)

=item *

definition of lookup tables for simplified part-of-speech tags (which have to
be adapted to the tagset used by a particular corpus)

=back

Named parameters have to be defined in the constructor (i.e. the B<new>
method) of a grammar by calling the B<NewParam> method, which also sets a
default value for the new parameter.  They can then be modified or read out at
any time using the B<SetParam> and B<GetParam> methods.  It is an error to
set or read the value of a parameter that hasn't previously been defined.

A typical skeletion of a DPP grammar with parameters looks as follows:

  package MyGrammar;
  use base 'CWB::CEQL::Parser';

  sub new {
    my $class = shift;
    my $self = new CWB::CEQL::Parser;
    $self->NewParam("pos_attribute", "pos");
    return bless($self, $class);
  }

  sub pos_tag {
    my ($self, $input) = @_;
    my $pos_att = $self->GetParam("pos_attribute");
    die "'$input' does not appear to be a valid POS tag\n"
      unless $input =~ /^[A-Z0-9]+$/;
    return "$pos_att = '$input'"; # CQP constraint for POS tag
  }

  # ... other grammar rules, including "default" ...

  1;

If your grammar does not define its own parameters, it is not necessary to
provide an explicit implementation of the B<new> method (unless some other
initialisation has to be performed).

A user can now apply B<MyGrammar> to a corpus that stores POS tags in
a p-attribute named C<tag>:

  use MyGrammar;

  our $grammar = new MyGrammar;
  $grammar->SetParam("pos_attribute", "tag");

  $cqp_query = $grammar->Parse($simple_query);

The following section presents some typical design patterns for DPP rules and
explains the use of B<Call>, B<Apply> and B<Try>.  Complete function
references are found in the sections L<"USER-VISIBLE METHODS"> and L<"METHODS
USED BY GRAMMAR AUTHORS">.  If you want to see an example of a complete DPP
grammar, it is a good idea to take a look at the implementation of the
standard CEQL grammar in the B<CWB::CEQL> module.  Knowledge of this grammar
implementation is essential if you want to build your own custom CEQL
extensions.


=head1 GRAMMAR RULES

=head2 Stand-alone rules

The simplest DPP rules are stand-alone rules that transform their input string
directly without invoking any subrules.  These rules typically make use of regular
expression substitutions and correspond to one part of the substitution cascade
in a traditional implementation of simple query languages.  In contrast to such
cascades, DPP rules apply only to relevant parts of the input string and cannot
accidentally modify other parts of the simple query.  The example below transforms
a search term with shell-style wildcards (C<?> and C<*>) into a regular expression.
Note how the input string is first checked to make sure it does not contain any
other metacharacters that might have a special meaning in the generated regular
expression, and B<die>s with an informative error message otherwise.

  sub wildcard_expression {
    my ($self, $input) = @_;
    die "the wildcard expression ''$input'' contains invalid characters\n"
      unless $input =~ /^[A-Za-z0-9?*-]+$/;
    my $regexp = $input;
    $regexp =~ s/\?/./g;
    $regexp =~ s/\*/.*/g;

lib/CWB/CEQL/Parser.pm  view on Meta::CPAN


The obvious drawback of this approach is the difficulty of signaling the
precise location of a syntax error to the user (in the example grammar above,
the parser will simply print C<syntax error> if there is any problem in a
sequence of terms and operators).  By the time the error is detected, all
items in the active group have already been pre-processed and subexpressions
have been collapsed.  Printing the current list of terms and operators would
only add to the user's confusion.

In order to signal errors immediately where they occur, each item should be
validated before it is added to the result list (e.g. an operator may not be
pushed as first item on a result list), and the reduce operation (C<< Term Op
Term => Term >>) should be applied as soon as possible.  The rule
C<arithmetic_item> needs direct access to the currently active result list for
this purpose: (1) to check how many items have already been pushed when
validating a new item, and (2) to reduce a sequence C<Term Op Term> to a single
C<Term> in the result list.

A pointer to the currently active result list is obtained with the internal
B<currentGroup> method, allowing a grammar rule to manipulate the result list.
The B<proximity queries> in the B<CWB::CEQL> grammar illustrate this advanced
form of shift-reduce parsing.


=head2 Backtracking with Try()

B<** TODO **>


=head1 EXTENDING GRAMMARS

B<** TODO **>


=head1 USER-VISIBLE METHODS

Methods that are called by the "end users" of a grammar.

=over 4

=item I<$grammar> = B<new> MyGrammar;

Create parser object I<$grammar> for the specified grammar (which must be a
class derived from B<CWB::CEQL::Parser>).  Note that the parser itself is not
reentrant, but multiple parsers for the same grammar can be run in parallel.
The return value I<$grammar> is an object of class B<MyGrammar>.

=cut

sub new {
  my $class = shift;
  my $self = {
              'PARAM_DEFAULTS' => {},  # globally set default values for parameters
              'PARAM' => undef,        # working copies of parameters during parse
              'INPUT' => undef,        # input string (defined while parsing)
              'ERROR' => undef,        # error message generated by last parse (undef = no error)
              'CALLSTACK' => [],       # call stack for backtrace in case of error
              'GROUPS' => undef,       # group structure for shift-reduce parser (undef if not active)
              'GROUPSTACK' => undef,   # stack of nested bracketing groups (undef if not active)
             };
  bless($self, $class);
}

=item I<$result> = I<$grammar>->B<Parse>(I<$string> [, I<$rule>]);

Parse input string I<$string> as a constituent of type I<$rule> (if
unspecified, the C<default> rule will be used).  The return value I<$result>
is typically a string containing the transformed query, but may also be an
arbitrary data structure or object (such as a parse tree for I<$input>).
Consult the relevant grammar documentation for details.  If parsing fails,
B<undef> is returned.

=cut

sub Parse {
  croak 'Usage:  $result = $grammar->Parse($string [, $rule]);'
    unless @_ == 2 or @_ == 3;
  my ($self, $input, $rule) = @_;
  $rule = "default"
    unless defined $rule;
  confess "CWB::CEQL::Parser: Parse() method is not re-entrant\n(tried to parse '$input' while parsing '".$self->{INPUT}."')"
    if defined $self->{INPUT};

  $self->{INPUT} = $input;
  %{$self->{PARAM}} = %{$self->{PARAM_DEFAULTS}}; # shallow copy of hash
  $self->{CALLSTACK} = [];              # re-initialise call stack (should destroy information from last parse)
  $self->{GROUPS} = undef;              # indicate that shift-reduce parser is not active
  $self->{CURRENT_GROUP} = undef;
  $self->{GROUPSTACK} = undef;
  $self->{ERROR} = undef;               # clear previous errors

  my $result = eval { $self->Call($rule, $input) }; # catch exceptions from parse errors
  if (not defined $result) {
    my $error = $@;
    chomp($error);                      # remove trailing newline
    $error = "parse of '' ".$self->{INPUT}." '' returned no result (reason unknown)"
      if $error eq "";
    $error =~ s/\s*\n\s*/ **::** /g;
    $self->{ERROR} = $error;
  }

  $self->{INPUT} = undef;               # no active parse
  $self->{PARAM} = undef;               # restore global parameter values (PARAM_DEFAULTS)
  return $result;                       # undef if parse failed
}

=item I<@lines_of_text> = I<$grammar>->B<ErrorMessage>;

If the last parse failed, returns a detailed error message and backtrace of
the callstack as a list of text lines (without newlines).  Otherwise, returns
empty list.

=cut

sub ErrorMessage {
  my $self = shift;
  my $error = $self->{ERROR};
  return ()
    unless defined $error;

  my @lines = "**Error:** $error";

lib/CWB/CEQL/Parser.pm  view on Meta::CPAN


=cut

sub SetParam {
  croak 'Usage:  $grammar->SetParam($name, $value)'
    unless @_ == 3;
  my ($self, $name, $value) = @_;
  ## select either global parameter values (user level) or working copy (during parse)
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  croak "CWB::CEQL::Parser: parameter '$name' does not exist"
    unless exists $param_set->{$name};
  $param_set->{$name} = $value;
}

sub GetParam {
  croak 'Usage:  $grammar->GetParam($name)'
    unless @_ == 2;
  my ($self, $name) = @_;
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  croak "CWB::CEQL::Parser: parameter '$name' does not exist"
    unless exists $param_set->{$name};
  return $param_set->{$name};
}

=back


=head1 METHODS USED BY GRAMMAR AUTHORS

Methods for grammar authors.  Since these methods are intended for use in the
rules of a DPP grammar, they are typically applied to the object I<$self>.

=over 4

=item I<$self>->B<NewParam>(I<$name>, I<$default_value>);

Define new parameter I<$name> with default value I<$default_value>.  This
method is normally called in the constructor (method B<new>) of a
parameterized grammar.  If it is used in a rule body, the new parameter
will be created in the working copy of the parameter set and will only be
available during the current parse.

=cut

sub NewParam {
  confess 'Usage:  $self->NewParam($name, $default_value)'
    unless @_ == 3;
  my ($self, $name, $value) = @_;
  ## select either global parameter values (user level) or working copy (during parse)
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  confess "CWB::CEQL::Parser: parameter '$name' already exists, cannot create with NewParam()"
    if exists $param_set->{$name};
  $param_set->{$name} = $value;
}

=item I<$result> = I<$self>->B<Call>(I<$rule>, I<$input>);

Apply rule I<$rule> to input string I<$input>.  The return value I<$result>
depends on the grammar rule, but is usually a string containing a translated
version of I<$input>.  Grammar rules may also annotate this string with
B<attributes> or by B<bless>ing it into a custom class, or return a complex
data structure such as a parse tree for I<$input>.  The caller has to be aware
what kind of value I<$rule> returns.

Note that B<Call> never returns B<undef>.  In case of an error, the entire
parse is aborted.

=cut

sub Call {
  confess 'Usage:  $result = $self->Call($rule, $input);'
    unless @_ == 3;
  my ($self, $rule, $input) = @_;
  confess "Sorry, we're not parsing yet"
    unless defined $self->{INPUT};
  my $method = $self->can("$rule");
  confess "the rule **$rule** does not exist in grammar **".ref($self)."** (internal error)\n"
    unless defined $method;
  my $frame = {RULE => $rule,
               INPUT => $input};
  push @{$self->{CALLSTACK}}, $frame; 
  my $result = $method->($self, $input);
  die "rule **$rule** failed to return a result (internal error)\n"
    unless defined $result;
  my $return_frame = pop @{$self->{CALLSTACK}};
  die "call stack has been corrupted (internal error)"
    unless $return_frame eq $frame;
  return $result;
}

=item I<$result> = I<$self>->B<Try>(I<$rule>, I<$input>);

Tentatively apply rule I<$rule> to the input string.  If I<$input> is parsed
successfully, B<Try> returns the translated version I<$result> (or an
arbitrary data structure such as a parse tree for I<$input>) just as B<Call>
would.  If parsing fails, B<Try> does not abort but simply returns B<undef>,
ignoring any error messages generated during the attempt.  In addition, the
call stack is restored and all parameters are reset to their previous values,
so that parsing can continue as if nothing had happened (note, however, that
this is based on flat backup copies, so complex data structures may have been
altered destructively).

=cut

sub Try {
  confess 'Usage:  $result = $self->Try($rule, $input);'
    unless @_ == 3;
  my ($self, $rule, $input) = @_;
  confess "Sorry, we're not parsing yet"
    unless defined $self->{INPUT};

  ## make flat backup copies of important data structures and ensure they are restored upon return
  ## (this is not completely safe, but should undo most changes that a failed parse may have made)
  my $back_param = [ @{$self->{PARAM}} ];
  my $back_callstack = [ @{$self->{CALLSTACK}} ]; 
  my ($back_groups, $back_current_group, $back_groupstack) = (undef, undef, undef);
  if (defined $self->{GROUPS}) {
    $back_groups = [ @{$self->{GROUPS}} ];
    $back_current_group = [ @{$back_groups->[0]} ]
      if @$back_groups > 0;
  }



( run in 1.865 second using v1.01-cache-2.11-cpan-39bf76dae61 )