Alt-CWB-ambs
view release on metacpan or search on metacpan
lib/CWB/CEQL/Parser.pm view on Meta::CPAN
implementation, it is often impossible to guess the true location of the
problem from the cryptic error messages generated by the backend processor.
Moreover, simplified query languages based on regular expression substitution
typically have rather limited expressiveness and flexibility (because the
substitutions are applied unconditionally, so symbols cannot have different
meanings in different contexts).
B<CWB::CEQL::Parser> aims to overcome these limitations by combining
regexp-based matching and substitution with a simple top-down parser for
context-free grammars, as well as a shift-reduce-style parser for nested
bracketing. Parsing complexity is limited by enforcing a B<fully
deterministic> parsing algorithm: a B<DPP rule> (= constituent type,
corresponding to the LHS of a traditional CFG rule) may have multiple
expansions, but the Perl code implementing the rule has to employ heuristics
to choose a single expansion for a given input. If the selected expansion
does not match the input string, the entire parsing process fails. Again,
this decision was motivated by the observation that, in the case of simplified
query languages, it is often very easy to make such deterministic decisions
with a regular expression and/or a few lines of Perl code.
Each B<DPP rule> is implemented as a B<Perl subroutine> (or, more precisely,
B<method>). It is invoked by the parser with an input string that is expected
to be a constituent of the respective type, and returns its analysis of this
constituent. In the typical application of DPP grammars, the return value is
a string representing (part of) a low-level query expression, but grammar
authors may also decide to return arbitrary data structures. In the B<rule
body>, other grammar rules can be applied to the full input string, a
substring, or an arbitrarily transformed substring using the B<Call> method.
It is also possible to invoke the shift-reduce-type parser with the B<Apply>
method. Both methods return an analysis of the given substring, which can
then be integrated with the analyses of other substrings and the parsing
performed by the rule body itself.
The section on L<"WRITING GRAMMARS"> below explains in detail how to write new
DPP grammars from scratch; L<"GRAMMAR RULES"> presents some typical design
patterns for grammar rules and lists the methods available to grammar writers;
L<"EXTENDING GRAMMARS"> shows how to extend and modify existing grammars (such
as the standard CEQL implementation provided by the B<CWB::CEQL> module);
L<"USER-VISIBLE METHODS"> documents methods aimed at grammar users; L<"METHODS
USED BY GRAMMAR AUTHORS"> documents methods for grammar writers; and
L<"INTERNAL METHODS"> contains short descriptions of methods used internally
by the parser.
=head1 WRITING GRAMMARS
Technically, a B<DPP grammar> is a subclass of B<CWB::CEQL::Parser>, which
defines B<DPP rules> in the form of Perl B<methods>, and inherits parsing and
housekeeping methods from the base class. Instantiating such a grammar class
yields an independent parser object.
By convention, the names of B<rule methods> are written in lowercase with
underscores (e.g., C<word_and_pos>), B<methods for users and grammar writers>
are written in mixed case (e.g., C<Parse> or C<SetParam>), and B<internal
methods> are written in mixed case starting with a lowercase letter (e.g.,
B<formatHtmlString>). If you need to define helper subroutines in your grammar
class, their names should begin with an underscore (e.g., C<_escape_regexp>)
to avoid confusion with grammar rules. The C<default> rule has to be
implemented by all grammars and will be applied to an input string if no
constituent type is specified. The basic skeleton of a DPP grammar therefore
looks like this:
package MyGrammar;
use base 'CWB::CEQL::Parser';
sub some_rule {
## body of grammar rule "some_rule" goes here
}
sub default {
## default rule will be called if parser is applied to string
}
1; # usually, a grammar is implemented as a separate module file
The user instantiates a parser for this grammar as an object of type
B<MyGrammar>, then calls the B<Parse> method to analyse an input string
(optionally specifying the expected constituent type if it is different from
C<default>). B<Parse> returns an analysis of the input string in the form
chosen by the grammar author. In most cases, including the standard
B<CWB::CEQL> grammar, this will simply be a string containing a low-level
query expression. Additional information can be returned by using objects of
class B<CWB::CEQL::String>, which behave like strings in most contexts
(through operator overloading) but can also be assigned a user-specified type
(see the L<CWB::CEQL::String> manpage for details). Alternatively, an
arbitrary data structure or object can be returned instead of a string. We
will assume in the following that DPP rules always return plain strings.
use MyGrammar;
our $grammar = new MyGrammar;
$result = $grammar->Parse($string); # applies 'default' rule
$result = $grammar->Parse($string, "some_rule"); # parse as given constituent type
If parsing fails, the B<Parse> method returns B<undef>. A full description of
the error and where it occurred can then be obtained with the B<ErrorMessage>
and B<HtmlErrorMessage> methods:
@lines_of_text = $grammar->ErrorMessage;
$html_code = $grammar->HtmlErrorMessage;
The latter takes care of encoding special characters as HTML entities where
necessary and has been included to simplify the integration of DPP grammars
into Web interfaces.
Internally, B<Parse> will invoke appropriate grammar rules. In the first
example above, the B<default> method would be called with argument I<$string>;
in the second example, B<some_rule> would be called. A typical DPP rule
performs the following operations:
=over 4
=item 1.
examine the input string to decide whether it appears to be a suitable
constituent, and to determine its internal structure
=item 2.
if the test in Step 1 fails, B<die> with a meaningful error message;
the B<Parse> method will catch this exception and report it to the user
lib/CWB/CEQL/Parser.pm view on Meta::CPAN
=back
Note that DPP rules always return an analysis or transformation of their
input; they are I<not allowed> to return B<undef> in order to show that the
input string failed to parse. This is a consequence of the deterministic
nature of the DPP approach: the caller guarantees that the input is a
constituent of the specified type -- anything else is an error condition and
causes the rule to B<die>. The two main adavantages of the DPP approach are
that (i) the parser does not have to perform any backtracking and (ii) grammar
rules do not need to check the return values of subrules invoked with B<Call>
or B<Apply>.
Sometimes, it may be unavoidable to try different analyses of an input string
in sequence. In such exceptional cases, grammar writers can use the B<Try>
method to perform a simple type of backtracking. B<Try> works exactly like
B<Call>, but will catch any exception raised due to a parse failure and return
B<undef> in this case. Grammar writers are strongly advised to avoid
backtracking whenever possible, though: the deterministic nature of DPP is
essential for efficient parsing, and repeated backtracking will greatly
increase its computational complexity.
DPP grammars can be B<customised> in two ways. One possibility is to
B<override existing rules> by subclassing the grammar, as described in the
section on L<"EXTENDING GRAMMARS">. This offers an extremely flexible way of
changing grammar behaviour, but requires a detailed knowledge of the
B<CWB::CEQL::Parser> module and the internal design of the grammar.
A much easier customisation strategy is for grammar writers to define named
B<parameters>, which can then be set by end users in order to control certain
features of the grammar. Typical applications of parameters include the
following:
=over 4
=item *
customisation of corpus attribute names (e.g., a parameter C<pos_attribute>
might specify the appropriate positional attribute for part-of-speech tags,
such as C<pos> or C<tag>)
=item *
activating or deactivating certain grammar rules (e.g., a parameter
C<has_lemma> might indicate whether a corpus includes lemmatisation
information or not; if it is FALSE, then input strings including lemma
constraints will raise parse errors in the respective grammar rules)
=item *
definition of lookup tables for simplified part-of-speech tags (which have to
be adapted to the tagset used by a particular corpus)
=back
Named parameters have to be defined in the constructor (i.e. the B<new>
method) of a grammar by calling the B<NewParam> method, which also sets a
default value for the new parameter. They can then be modified or read out at
any time using the B<SetParam> and B<GetParam> methods. It is an error to
set or read the value of a parameter that hasn't previously been defined.
A typical skeletion of a DPP grammar with parameters looks as follows:
package MyGrammar;
use base 'CWB::CEQL::Parser';
sub new {
my $class = shift;
my $self = new CWB::CEQL::Parser;
$self->NewParam("pos_attribute", "pos");
return bless($self, $class);
}
sub pos_tag {
my ($self, $input) = @_;
my $pos_att = $self->GetParam("pos_attribute");
die "'$input' does not appear to be a valid POS tag\n"
unless $input =~ /^[A-Z0-9]+$/;
return "$pos_att = '$input'"; # CQP constraint for POS tag
}
# ... other grammar rules, including "default" ...
1;
If your grammar does not define its own parameters, it is not necessary to
provide an explicit implementation of the B<new> method (unless some other
initialisation has to be performed).
A user can now apply B<MyGrammar> to a corpus that stores POS tags in
a p-attribute named C<tag>:
use MyGrammar;
our $grammar = new MyGrammar;
$grammar->SetParam("pos_attribute", "tag");
$cqp_query = $grammar->Parse($simple_query);
The following section presents some typical design patterns for DPP rules and
explains the use of B<Call>, B<Apply> and B<Try>. Complete function
references are found in the sections L<"USER-VISIBLE METHODS"> and L<"METHODS
USED BY GRAMMAR AUTHORS">. If you want to see an example of a complete DPP
grammar, it is a good idea to take a look at the implementation of the
standard CEQL grammar in the B<CWB::CEQL> module. Knowledge of this grammar
implementation is essential if you want to build your own custom CEQL
extensions.
=head1 GRAMMAR RULES
=head2 Stand-alone rules
The simplest DPP rules are stand-alone rules that transform their input string
directly without invoking any subrules. These rules typically make use of regular
expression substitutions and correspond to one part of the substitution cascade
in a traditional implementation of simple query languages. In contrast to such
cascades, DPP rules apply only to relevant parts of the input string and cannot
accidentally modify other parts of the simple query. The example below transforms
a search term with shell-style wildcards (C<?> and C<*>) into a regular expression.
Note how the input string is first checked to make sure it does not contain any
other metacharacters that might have a special meaning in the generated regular
( run in 0.747 second using v1.01-cache-2.11-cpan-5837b0d9d2c )