Alt-CWB-ambs
view release on metacpan or search on metacpan
lib/CWB/CEQL/Parser.pm view on Meta::CPAN
only complex subconstituents are passed to other rules for parsing with the
B<Call> method.
The following example allows users to search for a word form using either a
wildcard pattern or a regular expression enclosed in C</.../>. The return
value is a CQP query. As an additional optimisation, wildcard patterns that
do not contain any wildcards are matched literally (which is faster than a
regular expression and avoids possible conflicts with regexp metacharacters).
sub wordform_pattern {
my ($self, $input) = @_;
die "the wordform pattern ''$input'' must not contain whitespace or double quotes\n"
if $input =~ /\s|\"/;
if ($input =~ /^\/(.+)\/$/) {
my $regexp = $1; # regular expression query: simply wrap in double quotes
return "\"$regexp\"";
}
else {
if ($input =~ /[?*+]/) {
my $regexp = $self->Call("wildcard_expression", $input); # call subrule
return "\"$regexp\"";
}
else {
return "\"$input\"\%l";
}
}
}
It would probably be a good idea to signal an error if the wordform pattern
starts or ends with a slash (C</>) but is not enclosed in C</.../> as a
regular expression query. This is likely to be a typing mistake and the user
will be confused if the input is silently interpreted as a wildcard
expression.
=head2 Parsing sequences
If the input string consists of a variable number of subconstituents of the
same type, the B<Apply> method provides a convenient alternative to repeated
subrule calls. It parses all specified subconstituents, collects the parse
results and returns them as a list. The following example processes queries
that consist of a sequence wordform patterns separated by blanks (each pattern
is either a wildcard expression or regular expression, according to the DPP
rules defined above), and returns an equivalent CQP query.
sub wordform_sequence {
my ($self, $input) = @_;
my @items = split " ", $input;
my @cqp_patterns = $self->Apply("wordform_pattern", @items);
return "@cqp_patterns";
}
Recall that the list returned by B<Apply> does not have to be validated: if
any error occurs, the respective subrule will B<die> and abort the complete
parse.
=head2 The shift-reduce parser for nested bracketing
The B<Apply> method is more than a convenient shorthand for parsing lists of
constituents. Its main purpose is to parse nested bracketing structures,
which are very common in the syntax of formal languages (examples include
arithmetical formulae, regular expressions and most computer programming
languages). When parsing the constituents of a list with nested bracketing,
two special methods, B<BeginGroup> and B<EndGroup>, are called to mark opening
and closing delimiters. Proper nesting will then automatically be verified by
the DPP parser. If the syntax allows different types of groups to be mixed,
optional names can be passed to the B<BeginGroup> and B<EndGroup> calls in
order to ensure that the different group types match properly.
The output generated by the items of a bracketing group is collected
separately and returned when B<EndGroup> is called. From this list, the rule
processing the closing delimiter has to construct a single expression for the
entire group. Note that the return value of the DPP rule calling
B<BeginGroup> becomes part of the bracketing group output. If this is not
desired, the rule must return an empty string (C<"">). Rules can also check
whether they are in a nested group with the help of the B<NestingLevel> method
(which returns 0 at the top level).
The example below extends our simple query language with regexp-style
parenthesised groups, quantifiers (C<?>, C<*>, C<+>) and alternatives (C<|>).
In order to simplify the implementation, metacharacters must be separated from
wordform patterns and from other metacharacters by blanks; and quantifiers
must be attached directly to a closing parenthesis (otherwise, the question
mark in C<) ?> would be ambiguous between a quantifier and a wildcard pattern
matching a single character). Note that the C<simple_query> rule is
practically identical to C<wordform_sequence> above, but has been renamed to
reflect its new semantics.
sub simple_query {
my ($self, $input) = @_;
my @items = split " ", $input;
my @cqp_tokens = $self->Apply("simple_query_item", @items);
return "@cqp_tokens";
}
# need to define single rule to parse all items of a list with nested bracketing
sub simple_query_item {
my ($self, $item) = @_;
# opening delimiter: (
if ($item eq "(") {
$self->BeginGroup();
return ""; # opening delimiter should not become part of group output
}
# alternatives separator: | (only within nested group)
elsif ($item eq "|") {
die "a group of alternatives (|) must be enclosed in parentheses\n"
unless $self->NestingLevel > 0; # | metacharacter is not allowed at top level
return "|";
}
# closing delimiter: ) with optional quantifier
elsif ($item =~ /^\)([?*+]?)$/) {
my $quantifier = $1;
my @cqp_tokens = $self->EndGroup();
die "empty groups '( )' are not allowed\n"
unless @cqp_tokens > 0;
return "(@cqp_tokens)$quantifier";
}
# all other tokens should be wordform patterns
else {
return $self->Call("wordform_pattern", $item);
}
}
( run in 0.628 second using v1.01-cache-2.11-cpan-98e64b0badf )