escape results from the CPAN

escape

Alt-CWB-ambs

view release on metacpan or search on metacpan

lib/CWB/CEQL/Parser.pm view on Meta::CPAN

often be useful (i) to validate and transform recursive structures, and (ii)
to restrict a particular transformation to a certain scope.  Because of these
limitations, incorrect user input -- and sometimes even correct input -- leads
to malformed low-level queries.  Without an intimate knowledge of the
implementation, it is often impossible to guess the true location of the
problem from the cryptic error messages generated by the backend processor.
Moreover, simplified query languages based on regular expression substitution
typically have rather limited expressiveness and flexibility (because the
substitutions are applied unconditionally, so symbols cannot have different
meanings in different contexts).

B<CWB::CEQL::Parser> aims to overcome these limitations by combining
regexp-based matching and substitution with a simple top-down parser for
context-free grammars, as well as a shift-reduce-style parser for nested
bracketing.  Parsing complexity is limited by enforcing a B<fully
deterministic> parsing algorithm: a B<DPP rule> (= constituent type,
corresponding to the LHS of a traditional CFG rule) may have multiple
expansions, but the Perl code implementing the rule has to employ heuristics
to choose a single expansion for a given input.  If the selected expansion
does not match the input string, the entire parsing process fails.  Again,
this decision was motivated by the observation that, in the case of simplified
query languages, it is often very easy to make such deterministic decisions
with a regular expression and/or a few lines of Perl code.

Each B<DPP rule> is implemented as a B<Perl subroutine> (or, more precisely,
B<method>).  It is invoked by the parser with an input string that is expected
to be a constituent of the respective type, and returns its analysis of this
constituent.  In the typical application of DPP grammars, the return value is
a string representing (part of) a low-level query expression, but grammar
authors may also decide to return arbitrary data structures.  In the B<rule
body>, other grammar rules can be applied to the full input string, a
substring, or an arbitrarily transformed substring using the B<Call> method.
It is also possible to invoke the shift-reduce-type parser with the B<Apply>
method.  Both methods return an analysis of the given substring, which can
then be integrated with the analyses of other substrings and the parsing
performed by the rule body itself.

The section on L<"WRITING GRAMMARS"> below explains in detail how to write new
DPP grammars from scratch; L<"GRAMMAR RULES"> presents some typical design
patterns for grammar rules and lists the methods available to grammar writers;
L<"EXTENDING GRAMMARS"> shows how to extend and modify existing grammars (such
as the standard CEQL implementation provided by the B<CWB::CEQL> module);
L<"USER-VISIBLE METHODS"> documents methods aimed at grammar users; L<"METHODS
USED BY GRAMMAR AUTHORS"> documents methods for grammar writers; and
L<"INTERNAL METHODS"> contains short descriptions of methods used internally
by the parser.


=head1 WRITING GRAMMARS

Technically, a B<DPP grammar> is a subclass of B<CWB::CEQL::Parser>, which
defines B<DPP rules> in the form of Perl B<methods>, and inherits parsing and
housekeeping methods from the base class.  Instantiating such a grammar class
yields an independent parser object.

By convention, the names of B<rule methods> are written in lowercase with
underscores (e.g., C<word_and_pos>), B<methods for users and grammar writers>
are written in mixed case (e.g., C<Parse> or C<SetParam>), and B<internal
methods> are written in mixed case starting with a lowercase letter (e.g.,
B<formatHtmlString>).  If you need to define helper subroutines in your grammar
class, their names should begin with an underscore (e.g., C<_escape_regexp>)
to avoid confusion with grammar rules.  The C<default> rule has to be
implemented by all grammars and will be applied to an input string if no
constituent type is specified.  The basic skeleton of a DPP grammar therefore
looks like this:

  package MyGrammar;
  use base 'CWB::CEQL::Parser';

  sub some_rule {
    ## body of grammar rule "some_rule" goes here
  }

  sub default {
    ## default rule will be called if parser is applied to string
  }

  1; # usually, a grammar is implemented as a separate module file

The user instantiates a parser for this grammar as an object of type
B<MyGrammar>, then calls the B<Parse> method to analyse an input string
(optionally specifying the expected constituent type if it is different from
C<default>).  B<Parse> returns an analysis of the input string in the form
chosen by the grammar author.  In most cases, including the standard
B<CWB::CEQL> grammar, this will simply be a string containing a low-level
query expression.  Additional information can be returned by using objects of
class B<CWB::CEQL::String>, which behave like strings in most contexts
(through operator overloading) but can also be assigned a user-specified type
(see the L<CWB::CEQL::String> manpage for details).  Alternatively, an
arbitrary data structure or object can be returned instead of a string.  We
will assume in the following that DPP rules always return plain strings.

  use MyGrammar;
  our $grammar = new MyGrammar;

  $result = $grammar->Parse($string);              # applies 'default' rule
  $result = $grammar->Parse($string, "some_rule"); # parse as given constituent type

If parsing fails, the B<Parse> method returns B<undef>.  A full description of
the error and where it occurred can then be obtained with the B<ErrorMessage>
and B<HtmlErrorMessage> methods:

  @lines_of_text = $grammar->ErrorMessage;
  $html_code = $grammar->HtmlErrorMessage;

The latter takes care of encoding special characters as HTML entities where
necessary and has been included to simplify the integration of DPP grammars
into Web interfaces.

Internally, B<Parse> will invoke appropriate grammar rules.  In the first
example above, the B<default> method would be called with argument I<$string>;
in the second example, B<some_rule> would be called.  A typical DPP rule
performs the following operations:

=over 4

=item 1.

examine the input string to decide whether it appears to be a suitable
constituent, and to determine its internal structure

lib/CWB/CEQL/Parser.pm view on Meta::CPAN

    my ($self, $input) = @_;
    my $pos_att = $self->GetParam("pos_attribute");
    die "'$input' does not appear to be a valid POS tag\n"
      unless $input =~ /^[A-Z0-9]+$/;
    return "$pos_att = '$input'"; # CQP constraint for POS tag
  }

  # ... other grammar rules, including "default" ...

  1;

If your grammar does not define its own parameters, it is not necessary to
provide an explicit implementation of the B<new> method (unless some other
initialisation has to be performed).

A user can now apply B<MyGrammar> to a corpus that stores POS tags in
a p-attribute named C<tag>:

  use MyGrammar;

  our $grammar = new MyGrammar;
  $grammar->SetParam("pos_attribute", "tag");

  $cqp_query = $grammar->Parse($simple_query);

The following section presents some typical design patterns for DPP rules and
explains the use of B<Call>, B<Apply> and B<Try>.  Complete function
references are found in the sections L<"USER-VISIBLE METHODS"> and L<"METHODS
USED BY GRAMMAR AUTHORS">.  If you want to see an example of a complete DPP
grammar, it is a good idea to take a look at the implementation of the
standard CEQL grammar in the B<CWB::CEQL> module.  Knowledge of this grammar
implementation is essential if you want to build your own custom CEQL
extensions.


=head1 GRAMMAR RULES

=head2 Stand-alone rules

The simplest DPP rules are stand-alone rules that transform their input string
directly without invoking any subrules.  These rules typically make use of regular
expression substitutions and correspond to one part of the substitution cascade
in a traditional implementation of simple query languages.  In contrast to such
cascades, DPP rules apply only to relevant parts of the input string and cannot
accidentally modify other parts of the simple query.  The example below transforms
a search term with shell-style wildcards (C<?> and C<*>) into a regular expression.
Note how the input string is first checked to make sure it does not contain any
other metacharacters that might have a special meaning in the generated regular
expression, and B<die>s with an informative error message otherwise.

  sub wildcard_expression {
    my ($self, $input) = @_;
    die "the wildcard expression ''$input'' contains invalid characters\n"
      unless $input =~ /^[A-Za-z0-9?*-]+$/;
    my $regexp = $input;
    $regexp =~ s/\?/./g;
    $regexp =~ s/\*/.*/g;
    return $regexp;
  }

Alternatively, the rule could escape all regular expression metacharacters in
the input string so they are matched literally by the regular expression.
This version of the grammar might use an internal subroutine for translating
strings with wildcards safely into regular expressions:

  sub wildcard_expression {
    my ($self, $input) = @_;
    return _wildcard_to_regexp($input);
  }

  # note leading underscore for internal subroutine (this is not a method!)
  sub _wildcard_to_regexp {
    my $s = quotemeta(shift);
    $s =~ s/\\[?]/./g;  # wildcards will also have been escaped with a backslash
    $s =~ s/\\([*+])/.$1/g;  # works for wildcards * and +
    return $s;
  }

=head2 Handling parse errors

DPP rules should always carry out strict checks to ensure that their input is
a well-formed constituent of the required type, and B<die> with a clear and
informative error message otherwise.  This helps users to locate and correct
syntax errors in their input.  If errors are caught too late, i.e. in deeply
nested subrules, it may be difficult to recognise the true cause of the
problem.

The error message passed to B<die> should be limited to a single line of text
if possible.  Always append a newline character (C<\n>) in order to suppress
the automatic Perl stack trace, which provides no useful information for
grammar users and is likely to be confusing.  B<CWB::CEQL::Parser> will add
its own stack trace of subrule invocations so that users can pinpoint the
precise location of the syntax error.  In order to make this stack trace
readable and informative, DPP rules should always be given descriptive names: use
C<wildcard_expression> or C<part_of_speech> rather than C<rule1723a>.

The B<HtmlErrorMessage> method will automatically convert HTML metacharacters
and non-ASCII characters to entities, so it is safe to include the returned
HTML code directly in a Web page.  Error messages may use basic wiki-style
formatting: C<''...''> for typewriter font, C<//...//> for italics and
C<**...**> for bold font.  Note that such markup is non-recursive and nested
formatting will be ignored.  User input should always be enclosed in
C<''...''> in error messages so that C<//> and C<**> sequences in the input
are not mistaken as formatting instructions.

=head2 Calling subrules

Most DPP rules divide the input string into one or more subconstituents,
similar to the rules of a standard context-free grammar.  The main difference
is that a DPP rule has to settle on the specific positions and categories
of the subconstituents, rather than just listing possible category sequences.
Many DPP rules will also remove syntactic operators and delimiters, so that
only complex subconstituents are passed to other rules for parsing with the
B<Call> method.

The following example allows users to search for a word form using either a
wildcard pattern or a regular expression enclosed in C</.../>.  The return
value is a CQP query.  As an additional optimisation, wildcard patterns that
do not contain any wildcards are matched literally (which is faster than a
regular expression and avoids possible conflicts with regexp metacharacters).

  sub wordform_pattern {
    my ($self, $input) = @_;
    die "the wordform pattern ''$input'' must not contain whitespace or double quotes\n"
      if $input =~ /\s|\"/;
    if ($input =~ /^\/(.+)\/$/) {
      my $regexp = $1; # regular expression query: simply wrap in double quotes
      return "\"$regexp\"";
    }
    else {
      if ($input =~ /[?*+]/) {
        my $regexp = $self->Call("wildcard_expression", $input); # call subrule
        return "\"$regexp\"";
      }

( run in 0.714 second using v1.01-cache-2.11-cpan-524268b4103 )