Alt-CWB-ambs

 view release on metacpan or  search on metacpan

lib/CWB/CEQL/Parser.pm  view on Meta::CPAN

package CWB::CEQL::Parser;

use Carp;
use Encode;

use warnings;
use strict;

=head1 NAME

CWB::CEQL::Parser - Deterministic Perl parser for simplified query languages

=head1 SYNOPSIS

  ##### DPP grammar in separate module #####
  package MyGrammar;
  use base 'CWB::CEQL::Parser';

  sub my_rule {
    my ($self, $input) = @_;
    ## body of grammar rule "my_rule" goes here, transforming $input => $output
    return $output;
  }

  sub default {
    ## default rule will be called if parser is applied to string
    my ($self, $input) = @_;
    return $self->Call("my_rule", $input);        # recursively call other grammar rules
  }

  1;

  ##### main program #####
  use MyGrammar;
  our $grammar = new MyGrammar;

  $result = $grammar->Parse($string);             # applies 'default' rule
  $result = $grammar->Parse($string, "my_rule");  # parse as given constituent

=head1 DESCRIPTION

B<CWB::CEQL::Parser> implements a B<heuristic>-driven, B<deterministic>,
B<top-down> parser for extended context-free grammars written in B<Perl>,
called a B<D>eterministic B<P>erl B<P>arser (B<DPP>).  This parsing algorithm
was designed specifically for automatic translation of simplified, user-friendly
query and scripting languages (such as the B<C>ommon B<E>lementary B<Q>uery
B<L>anguage provided by B<CWB::CEQL>) into low-level code (e.g. B<CQP> syntax).

The DPP architecture was motivated by the observation that simplified queries
are often very similar in structure to the corresponding low-level queries,
and that many authors use cascaded regular expression substitutions to
transform one into the other.  While such cascades are very easy to write in
Perl and perform efficiently, there are two important limitations: it would
often be useful (i) to validate and transform recursive structures, and (ii)
to restrict a particular transformation to a certain scope.  Because of these
limitations, incorrect user input -- and sometimes even correct input -- leads
to malformed low-level queries.  Without an intimate knowledge of the
implementation, it is often impossible to guess the true location of the
problem from the cryptic error messages generated by the backend processor.
Moreover, simplified query languages based on regular expression substitution
typically have rather limited expressiveness and flexibility (because the
substitutions are applied unconditionally, so symbols cannot have different
meanings in different contexts).

B<CWB::CEQL::Parser> aims to overcome these limitations by combining
regexp-based matching and substitution with a simple top-down parser for
context-free grammars, as well as a shift-reduce-style parser for nested
bracketing.  Parsing complexity is limited by enforcing a B<fully
deterministic> parsing algorithm: a B<DPP rule> (= constituent type,
corresponding to the LHS of a traditional CFG rule) may have multiple
expansions, but the Perl code implementing the rule has to employ heuristics
to choose a single expansion for a given input.  If the selected expansion
does not match the input string, the entire parsing process fails.  Again,
this decision was motivated by the observation that, in the case of simplified
query languages, it is often very easy to make such deterministic decisions
with a regular expression and/or a few lines of Perl code.

Each B<DPP rule> is implemented as a B<Perl subroutine> (or, more precisely,
B<method>).  It is invoked by the parser with an input string that is expected
to be a constituent of the respective type, and returns its analysis of this
constituent.  In the typical application of DPP grammars, the return value is
a string representing (part of) a low-level query expression, but grammar
authors may also decide to return arbitrary data structures.  In the B<rule
body>, other grammar rules can be applied to the full input string, a
substring, or an arbitrarily transformed substring using the B<Call> method.
It is also possible to invoke the shift-reduce-type parser with the B<Apply>
method.  Both methods return an analysis of the given substring, which can
then be integrated with the analyses of other substrings and the parsing
performed by the rule body itself.

The section on L<"WRITING GRAMMARS"> below explains in detail how to write new
DPP grammars from scratch; L<"GRAMMAR RULES"> presents some typical design
patterns for grammar rules and lists the methods available to grammar writers;
L<"EXTENDING GRAMMARS"> shows how to extend and modify existing grammars (such
as the standard CEQL implementation provided by the B<CWB::CEQL> module);
L<"USER-VISIBLE METHODS"> documents methods aimed at grammar users; L<"METHODS
USED BY GRAMMAR AUTHORS"> documents methods for grammar writers; and
L<"INTERNAL METHODS"> contains short descriptions of methods used internally
by the parser.


=head1 WRITING GRAMMARS

Technically, a B<DPP grammar> is a subclass of B<CWB::CEQL::Parser>, which
defines B<DPP rules> in the form of Perl B<methods>, and inherits parsing and
housekeeping methods from the base class.  Instantiating such a grammar class
yields an independent parser object.

By convention, the names of B<rule methods> are written in lowercase with
underscores (e.g., C<word_and_pos>), B<methods for users and grammar writers>
are written in mixed case (e.g., C<Parse> or C<SetParam>), and B<internal
methods> are written in mixed case starting with a lowercase letter (e.g.,

lib/CWB/CEQL/Parser.pm  view on Meta::CPAN

Named parameters have to be defined in the constructor (i.e. the B<new>
method) of a grammar by calling the B<NewParam> method, which also sets a
default value for the new parameter.  They can then be modified or read out at
any time using the B<SetParam> and B<GetParam> methods.  It is an error to
set or read the value of a parameter that hasn't previously been defined.

A typical skeletion of a DPP grammar with parameters looks as follows:

  package MyGrammar;
  use base 'CWB::CEQL::Parser';

  sub new {
    my $class = shift;
    my $self = new CWB::CEQL::Parser;
    $self->NewParam("pos_attribute", "pos");
    return bless($self, $class);
  }

  sub pos_tag {
    my ($self, $input) = @_;
    my $pos_att = $self->GetParam("pos_attribute");
    die "'$input' does not appear to be a valid POS tag\n"
      unless $input =~ /^[A-Z0-9]+$/;
    return "$pos_att = '$input'"; # CQP constraint for POS tag
  }

  # ... other grammar rules, including "default" ...

  1;

If your grammar does not define its own parameters, it is not necessary to
provide an explicit implementation of the B<new> method (unless some other
initialisation has to be performed).

A user can now apply B<MyGrammar> to a corpus that stores POS tags in
a p-attribute named C<tag>:

  use MyGrammar;

  our $grammar = new MyGrammar;
  $grammar->SetParam("pos_attribute", "tag");

  $cqp_query = $grammar->Parse($simple_query);

The following section presents some typical design patterns for DPP rules and
explains the use of B<Call>, B<Apply> and B<Try>.  Complete function
references are found in the sections L<"USER-VISIBLE METHODS"> and L<"METHODS
USED BY GRAMMAR AUTHORS">.  If you want to see an example of a complete DPP
grammar, it is a good idea to take a look at the implementation of the
standard CEQL grammar in the B<CWB::CEQL> module.  Knowledge of this grammar
implementation is essential if you want to build your own custom CEQL
extensions.


=head1 GRAMMAR RULES

=head2 Stand-alone rules

The simplest DPP rules are stand-alone rules that transform their input string
directly without invoking any subrules.  These rules typically make use of regular
expression substitutions and correspond to one part of the substitution cascade
in a traditional implementation of simple query languages.  In contrast to such
cascades, DPP rules apply only to relevant parts of the input string and cannot
accidentally modify other parts of the simple query.  The example below transforms
a search term with shell-style wildcards (C<?> and C<*>) into a regular expression.
Note how the input string is first checked to make sure it does not contain any
other metacharacters that might have a special meaning in the generated regular
expression, and B<die>s with an informative error message otherwise.

  sub wildcard_expression {
    my ($self, $input) = @_;
    die "the wildcard expression ''$input'' contains invalid characters\n"
      unless $input =~ /^[A-Za-z0-9?*-]+$/;
    my $regexp = $input;
    $regexp =~ s/\?/./g;
    $regexp =~ s/\*/.*/g;
    return $regexp;
  }

Alternatively, the rule could escape all regular expression metacharacters in
the input string so they are matched literally by the regular expression.
This version of the grammar might use an internal subroutine for translating
strings with wildcards safely into regular expressions:

  sub wildcard_expression {
    my ($self, $input) = @_;
    return _wildcard_to_regexp($input);
  }

  # note leading underscore for internal subroutine (this is not a method!)
  sub _wildcard_to_regexp {
    my $s = quotemeta(shift);
    $s =~ s/\\[?]/./g;  # wildcards will also have been escaped with a backslash
    $s =~ s/\\([*+])/.$1/g;  # works for wildcards * and +
    return $s;
  }

=head2 Handling parse errors

DPP rules should always carry out strict checks to ensure that their input is
a well-formed constituent of the required type, and B<die> with a clear and
informative error message otherwise.  This helps users to locate and correct
syntax errors in their input.  If errors are caught too late, i.e. in deeply
nested subrules, it may be difficult to recognise the true cause of the
problem.

The error message passed to B<die> should be limited to a single line of text
if possible.  Always append a newline character (C<\n>) in order to suppress
the automatic Perl stack trace, which provides no useful information for
grammar users and is likely to be confusing.  B<CWB::CEQL::Parser> will add
its own stack trace of subrule invocations so that users can pinpoint the
precise location of the syntax error.  In order to make this stack trace
readable and informative, DPP rules should always be given descriptive names: use
C<wildcard_expression> or C<part_of_speech> rather than C<rule1723a>.

The B<HtmlErrorMessage> method will automatically convert HTML metacharacters
and non-ASCII characters to entities, so it is safe to include the returned
HTML code directly in a Web page.  Error messages may use basic wiki-style
formatting: C<''...''> for typewriter font, C<//...//> for italics and
C<**...**> for bold font.  Note that such markup is non-recursive and nested
formatting will be ignored.  User input should always be enclosed in
C<''...''> in error messages so that C<//> and C<**> sequences in the input
are not mistaken as formatting instructions.



( run in 1.394 second using v1.01-cache-2.11-cpan-39bf76dae61 )