Alt-CWB-ambs
view release on metacpan or search on metacpan
lib/CWB/CEQL/Parser.pm view on Meta::CPAN
package CWB::CEQL::Parser;
use Carp;
use Encode;
use warnings;
use strict;
=head1 NAME
CWB::CEQL::Parser - Deterministic Perl parser for simplified query languages
=head1 SYNOPSIS
##### DPP grammar in separate module #####
package MyGrammar;
use base 'CWB::CEQL::Parser';
sub my_rule {
my ($self, $input) = @_;
## body of grammar rule "my_rule" goes here, transforming $input => $output
return $output;
}
sub default {
## default rule will be called if parser is applied to string
my ($self, $input) = @_;
return $self->Call("my_rule", $input); # recursively call other grammar rules
}
1;
##### main program #####
use MyGrammar;
our $grammar = new MyGrammar;
$result = $grammar->Parse($string); # applies 'default' rule
$result = $grammar->Parse($string, "my_rule"); # parse as given constituent
=head1 DESCRIPTION
B<CWB::CEQL::Parser> implements a B<heuristic>-driven, B<deterministic>,
B<top-down> parser for extended context-free grammars written in B<Perl>,
called a B<D>eterministic B<P>erl B<P>arser (B<DPP>). This parsing algorithm
was designed specifically for automatic translation of simplified, user-friendly
query and scripting languages (such as the B<C>ommon B<E>lementary B<Q>uery
B<L>anguage provided by B<CWB::CEQL>) into low-level code (e.g. B<CQP> syntax).
The DPP architecture was motivated by the observation that simplified queries
are often very similar in structure to the corresponding low-level queries,
and that many authors use cascaded regular expression substitutions to
transform one into the other. While such cascades are very easy to write in
Perl and perform efficiently, there are two important limitations: it would
often be useful (i) to validate and transform recursive structures, and (ii)
to restrict a particular transformation to a certain scope. Because of these
limitations, incorrect user input -- and sometimes even correct input -- leads
to malformed low-level queries. Without an intimate knowledge of the
implementation, it is often impossible to guess the true location of the
problem from the cryptic error messages generated by the backend processor.
Moreover, simplified query languages based on regular expression substitution
typically have rather limited expressiveness and flexibility (because the
substitutions are applied unconditionally, so symbols cannot have different
meanings in different contexts).
B<CWB::CEQL::Parser> aims to overcome these limitations by combining
regexp-based matching and substitution with a simple top-down parser for
context-free grammars, as well as a shift-reduce-style parser for nested
bracketing. Parsing complexity is limited by enforcing a B<fully
deterministic> parsing algorithm: a B<DPP rule> (= constituent type,
corresponding to the LHS of a traditional CFG rule) may have multiple
expansions, but the Perl code implementing the rule has to employ heuristics
to choose a single expansion for a given input. If the selected expansion
does not match the input string, the entire parsing process fails. Again,
this decision was motivated by the observation that, in the case of simplified
query languages, it is often very easy to make such deterministic decisions
with a regular expression and/or a few lines of Perl code.
Each B<DPP rule> is implemented as a B<Perl subroutine> (or, more precisely,
B<method>). It is invoked by the parser with an input string that is expected
to be a constituent of the respective type, and returns its analysis of this
constituent. In the typical application of DPP grammars, the return value is
a string representing (part of) a low-level query expression, but grammar
authors may also decide to return arbitrary data structures. In the B<rule
body>, other grammar rules can be applied to the full input string, a
substring, or an arbitrarily transformed substring using the B<Call> method.
It is also possible to invoke the shift-reduce-type parser with the B<Apply>
method. Both methods return an analysis of the given substring, which can
then be integrated with the analyses of other substrings and the parsing
performed by the rule body itself.
The section on L<"WRITING GRAMMARS"> below explains in detail how to write new
DPP grammars from scratch; L<"GRAMMAR RULES"> presents some typical design
patterns for grammar rules and lists the methods available to grammar writers;
L<"EXTENDING GRAMMARS"> shows how to extend and modify existing grammars (such
as the standard CEQL implementation provided by the B<CWB::CEQL> module);
L<"USER-VISIBLE METHODS"> documents methods aimed at grammar users; L<"METHODS
USED BY GRAMMAR AUTHORS"> documents methods for grammar writers; and
L<"INTERNAL METHODS"> contains short descriptions of methods used internally
by the parser.
=head1 WRITING GRAMMARS
Technically, a B<DPP grammar> is a subclass of B<CWB::CEQL::Parser>, which
defines B<DPP rules> in the form of Perl B<methods>, and inherits parsing and
housekeeping methods from the base class. Instantiating such a grammar class
yields an independent parser object.
By convention, the names of B<rule methods> are written in lowercase with
underscores (e.g., C<word_and_pos>), B<methods for users and grammar writers>
are written in mixed case (e.g., C<Parse> or C<SetParam>), and B<internal
methods> are written in mixed case starting with a lowercase letter (e.g.,
B<formatHtmlString>). If you need to define helper subroutines in your grammar
class, their names should begin with an underscore (e.g., C<_escape_regexp>)
to avoid confusion with grammar rules. The C<default> rule has to be
implemented by all grammars and will be applied to an input string if no
constituent type is specified. The basic skeleton of a DPP grammar therefore
looks like this:
( run in 0.471 second using v1.01-cache-2.11-cpan-e1769b4cff6 )