Alt-CWB-ambs

 view release on metacpan or  search on metacpan

lib/CWB/CEQL/Parser.pm  view on Meta::CPAN


=head1 SYNOPSIS

  ##### DPP grammar in separate module #####
  package MyGrammar;
  use base 'CWB::CEQL::Parser';

  sub my_rule {
    my ($self, $input) = @_;
    ## body of grammar rule "my_rule" goes here, transforming $input => $output
    return $output;
  }

  sub default {
    ## default rule will be called if parser is applied to string
    my ($self, $input) = @_;
    return $self->Call("my_rule", $input);        # recursively call other grammar rules
  }

  1;

  ##### main program #####
  use MyGrammar;
  our $grammar = new MyGrammar;

  $result = $grammar->Parse($string);             # applies 'default' rule
  $result = $grammar->Parse($string, "my_rule");  # parse as given constituent

=head1 DESCRIPTION

B<CWB::CEQL::Parser> implements a B<heuristic>-driven, B<deterministic>,
B<top-down> parser for extended context-free grammars written in B<Perl>,
called a B<D>eterministic B<P>erl B<P>arser (B<DPP>).  This parsing algorithm
was designed specifically for automatic translation of simplified, user-friendly
query and scripting languages (such as the B<C>ommon B<E>lementary B<Q>uery
B<L>anguage provided by B<CWB::CEQL>) into low-level code (e.g. B<CQP> syntax).

The DPP architecture was motivated by the observation that simplified queries
are often very similar in structure to the corresponding low-level queries,
and that many authors use cascaded regular expression substitutions to
transform one into the other.  While such cascades are very easy to write in
Perl and perform efficiently, there are two important limitations: it would
often be useful (i) to validate and transform recursive structures, and (ii)
to restrict a particular transformation to a certain scope.  Because of these
limitations, incorrect user input -- and sometimes even correct input -- leads
to malformed low-level queries.  Without an intimate knowledge of the
implementation, it is often impossible to guess the true location of the
problem from the cryptic error messages generated by the backend processor.
Moreover, simplified query languages based on regular expression substitution
typically have rather limited expressiveness and flexibility (because the
substitutions are applied unconditionally, so symbols cannot have different
meanings in different contexts).

B<CWB::CEQL::Parser> aims to overcome these limitations by combining
regexp-based matching and substitution with a simple top-down parser for
context-free grammars, as well as a shift-reduce-style parser for nested
bracketing.  Parsing complexity is limited by enforcing a B<fully
deterministic> parsing algorithm: a B<DPP rule> (= constituent type,
corresponding to the LHS of a traditional CFG rule) may have multiple
expansions, but the Perl code implementing the rule has to employ heuristics
to choose a single expansion for a given input.  If the selected expansion
does not match the input string, the entire parsing process fails.  Again,
this decision was motivated by the observation that, in the case of simplified
query languages, it is often very easy to make such deterministic decisions
with a regular expression and/or a few lines of Perl code.

Each B<DPP rule> is implemented as a B<Perl subroutine> (or, more precisely,
B<method>).  It is invoked by the parser with an input string that is expected
to be a constituent of the respective type, and returns its analysis of this
constituent.  In the typical application of DPP grammars, the return value is
a string representing (part of) a low-level query expression, but grammar
authors may also decide to return arbitrary data structures.  In the B<rule
body>, other grammar rules can be applied to the full input string, a
substring, or an arbitrarily transformed substring using the B<Call> method.
It is also possible to invoke the shift-reduce-type parser with the B<Apply>
method.  Both methods return an analysis of the given substring, which can
then be integrated with the analyses of other substrings and the parsing
performed by the rule body itself.

The section on L<"WRITING GRAMMARS"> below explains in detail how to write new
DPP grammars from scratch; L<"GRAMMAR RULES"> presents some typical design
patterns for grammar rules and lists the methods available to grammar writers;
L<"EXTENDING GRAMMARS"> shows how to extend and modify existing grammars (such
as the standard CEQL implementation provided by the B<CWB::CEQL> module);
L<"USER-VISIBLE METHODS"> documents methods aimed at grammar users; L<"METHODS
USED BY GRAMMAR AUTHORS"> documents methods for grammar writers; and
L<"INTERNAL METHODS"> contains short descriptions of methods used internally
by the parser.


=head1 WRITING GRAMMARS

Technically, a B<DPP grammar> is a subclass of B<CWB::CEQL::Parser>, which
defines B<DPP rules> in the form of Perl B<methods>, and inherits parsing and
housekeeping methods from the base class.  Instantiating such a grammar class
yields an independent parser object.

By convention, the names of B<rule methods> are written in lowercase with
underscores (e.g., C<word_and_pos>), B<methods for users and grammar writers>
are written in mixed case (e.g., C<Parse> or C<SetParam>), and B<internal
methods> are written in mixed case starting with a lowercase letter (e.g.,
B<formatHtmlString>).  If you need to define helper subroutines in your grammar
class, their names should begin with an underscore (e.g., C<_escape_regexp>)
to avoid confusion with grammar rules.  The C<default> rule has to be
implemented by all grammars and will be applied to an input string if no
constituent type is specified.  The basic skeleton of a DPP grammar therefore
looks like this:

  package MyGrammar;
  use base 'CWB::CEQL::Parser';

  sub some_rule {
    ## body of grammar rule "some_rule" goes here
  }

  sub default {
    ## default rule will be called if parser is applied to string
  }

  1; # usually, a grammar is implemented as a separate module file

lib/CWB/CEQL/Parser.pm  view on Meta::CPAN

      my @remain = @{$frame->{APPLY_ITEMS}};
      push @lines, " - at this location: '' @done ''**<==**'' @remain ''";
    }
    else {
      my $input = $frame->{INPUT};
      my $previous_input = $previous_frame->{INPUT} || "";
      if (($previous_input eq $input) and ($previous_frame->{RULE} ne "APPLY")) {
        $lines[-1] .= ", **$rule**";
      }
      else {
        push @lines, " - when parsing '' $input '' as **$rule**";
      }
    }
    $previous_frame = $frame;
  }
  return @lines;
}

=item I<$html_code> = I<$grammar>->B<HtmlErrorMessage>;

If the last parse failed, returns HTML-formatted error message and backtrace
of the callstack.  The string I<$html_code> is valid HTML and can directly be
included in a generated Web page.  In particular, unsafe and non-ASCII
characters have been encoded as HTML entities.  Simple, non-recursive
wiki-style markup in an error message is interpreted in the following way:

  **<text>**    <text> is shown in bold font (<b> ... </b>)
  //<text>//    <text> is displayed in italics (<i> ... </i>)
  ''<text>''    <text> is shown in typewriter font (<code> ... </code>)

Lines starting with C< - > (note the two blanks) are converted into list items.

=cut

sub HtmlErrorMessage {
  my $self = shift;
  my @text_lines = $self->ErrorMessage();
  if (@text_lines > 0) {
    return $self->formatHtmlText(@text_lines);
  }
  else {
    return undef;
  }
}

=item I<$grammar>->B<SetParam>(I<$name>, I<$value>);

=item I<$value> = I<$grammar>->B<GetParam>(I<$name>);

Set the value of parameter I<$name> (B<SetParam>), or read its current value
(B<GetParam>).  The parameter I<$name> must have been defined by the grammar
class (which I<$grammar> is an instance of) and should be described in the
grammar's documentation.

=cut

sub SetParam {
  croak 'Usage:  $grammar->SetParam($name, $value)'
    unless @_ == 3;
  my ($self, $name, $value) = @_;
  ## select either global parameter values (user level) or working copy (during parse)
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  croak "CWB::CEQL::Parser: parameter '$name' does not exist"
    unless exists $param_set->{$name};
  $param_set->{$name} = $value;
}

sub GetParam {
  croak 'Usage:  $grammar->GetParam($name)'
    unless @_ == 2;
  my ($self, $name) = @_;
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  croak "CWB::CEQL::Parser: parameter '$name' does not exist"
    unless exists $param_set->{$name};
  return $param_set->{$name};
}

=back


=head1 METHODS USED BY GRAMMAR AUTHORS

Methods for grammar authors.  Since these methods are intended for use in the
rules of a DPP grammar, they are typically applied to the object I<$self>.

=over 4

=item I<$self>->B<NewParam>(I<$name>, I<$default_value>);

Define new parameter I<$name> with default value I<$default_value>.  This
method is normally called in the constructor (method B<new>) of a
parameterized grammar.  If it is used in a rule body, the new parameter
will be created in the working copy of the parameter set and will only be
available during the current parse.

=cut

sub NewParam {
  confess 'Usage:  $self->NewParam($name, $default_value)'
    unless @_ == 3;
  my ($self, $name, $value) = @_;
  ## select either global parameter values (user level) or working copy (during parse)
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  confess "CWB::CEQL::Parser: parameter '$name' already exists, cannot create with NewParam()"
    if exists $param_set->{$name};
  $param_set->{$name} = $value;
}

=item I<$result> = I<$self>->B<Call>(I<$rule>, I<$input>);

Apply rule I<$rule> to input string I<$input>.  The return value I<$result>
depends on the grammar rule, but is usually a string containing a translated
version of I<$input>.  Grammar rules may also annotate this string with
B<attributes> or by B<bless>ing it into a custom class, or return a complex
data structure such as a parse tree for I<$input>.  The caller has to be aware
what kind of value I<$rule> returns.

Note that B<Call> never returns B<undef>.  In case of an error, the entire
parse is aborted.

=cut

sub Call {
  confess 'Usage:  $result = $self->Call($rule, $input);'
    unless @_ == 3;
  my ($self, $rule, $input) = @_;
  confess "Sorry, we're not parsing yet"
    unless defined $self->{INPUT};
  my $method = $self->can("$rule");
  confess "the rule **$rule** does not exist in grammar **".ref($self)."** (internal error)\n"
    unless defined $method;
  my $frame = {RULE => $rule,
               INPUT => $input};
  push @{$self->{CALLSTACK}}, $frame; 
  my $result = $method->($self, $input);
  die "rule **$rule** failed to return a result (internal error)\n"
    unless defined $result;
  my $return_frame = pop @{$self->{CALLSTACK}};
  die "call stack has been corrupted (internal error)"
    unless $return_frame eq $frame;
  return $result;
}

=item I<$result> = I<$self>->B<Try>(I<$rule>, I<$input>);

Tentatively apply rule I<$rule> to the input string.  If I<$input> is parsed
successfully, B<Try> returns the translated version I<$result> (or an
arbitrary data structure such as a parse tree for I<$input>) just as B<Call>
would.  If parsing fails, B<Try> does not abort but simply returns B<undef>,
ignoring any error messages generated during the attempt.  In addition, the
call stack is restored and all parameters are reset to their previous values,
so that parsing can continue as if nothing had happened (note, however, that
this is based on flat backup copies, so complex data structures may have been
altered destructively).

=cut

sub Try {
  confess 'Usage:  $result = $self->Try($rule, $input);'
    unless @_ == 3;
  my ($self, $rule, $input) = @_;
  confess "Sorry, we're not parsing yet"



( run in 1.090 second using v1.01-cache-2.11-cpan-39bf76dae61 )