Alt-CWB-ambs

 view release on metacpan or  search on metacpan

lib/CWB/CEQL/Parser.pm  view on Meta::CPAN

The simplest DPP rules are stand-alone rules that transform their input string
directly without invoking any subrules.  These rules typically make use of regular
expression substitutions and correspond to one part of the substitution cascade
in a traditional implementation of simple query languages.  In contrast to such
cascades, DPP rules apply only to relevant parts of the input string and cannot
accidentally modify other parts of the simple query.  The example below transforms
a search term with shell-style wildcards (C<?> and C<*>) into a regular expression.
Note how the input string is first checked to make sure it does not contain any
other metacharacters that might have a special meaning in the generated regular
expression, and B<die>s with an informative error message otherwise.

  sub wildcard_expression {
    my ($self, $input) = @_;
    die "the wildcard expression ''$input'' contains invalid characters\n"
      unless $input =~ /^[A-Za-z0-9?*-]+$/;
    my $regexp = $input;
    $regexp =~ s/\?/./g;
    $regexp =~ s/\*/.*/g;
    return $regexp;
  }

Alternatively, the rule could escape all regular expression metacharacters in
the input string so they are matched literally by the regular expression.
This version of the grammar might use an internal subroutine for translating
strings with wildcards safely into regular expressions:

  sub wildcard_expression {
    my ($self, $input) = @_;
    return _wildcard_to_regexp($input);
  }

  # note leading underscore for internal subroutine (this is not a method!)
  sub _wildcard_to_regexp {
    my $s = quotemeta(shift);
    $s =~ s/\\[?]/./g;  # wildcards will also have been escaped with a backslash
    $s =~ s/\\([*+])/.$1/g;  # works for wildcards * and +
    return $s;
  }

=head2 Handling parse errors

DPP rules should always carry out strict checks to ensure that their input is
a well-formed constituent of the required type, and B<die> with a clear and
informative error message otherwise.  This helps users to locate and correct
syntax errors in their input.  If errors are caught too late, i.e. in deeply
nested subrules, it may be difficult to recognise the true cause of the
problem.

The error message passed to B<die> should be limited to a single line of text
if possible.  Always append a newline character (C<\n>) in order to suppress
the automatic Perl stack trace, which provides no useful information for
grammar users and is likely to be confusing.  B<CWB::CEQL::Parser> will add
its own stack trace of subrule invocations so that users can pinpoint the
precise location of the syntax error.  In order to make this stack trace
readable and informative, DPP rules should always be given descriptive names: use
C<wildcard_expression> or C<part_of_speech> rather than C<rule1723a>.

The B<HtmlErrorMessage> method will automatically convert HTML metacharacters
and non-ASCII characters to entities, so it is safe to include the returned
HTML code directly in a Web page.  Error messages may use basic wiki-style
formatting: C<''...''> for typewriter font, C<//...//> for italics and
C<**...**> for bold font.  Note that such markup is non-recursive and nested
formatting will be ignored.  User input should always be enclosed in
C<''...''> in error messages so that C<//> and C<**> sequences in the input
are not mistaken as formatting instructions.

=head2 Calling subrules

Most DPP rules divide the input string into one or more subconstituents,
similar to the rules of a standard context-free grammar.  The main difference
is that a DPP rule has to settle on the specific positions and categories
of the subconstituents, rather than just listing possible category sequences.
Many DPP rules will also remove syntactic operators and delimiters, so that
only complex subconstituents are passed to other rules for parsing with the
B<Call> method.

The following example allows users to search for a word form using either a
wildcard pattern or a regular expression enclosed in C</.../>.  The return
value is a CQP query.  As an additional optimisation, wildcard patterns that
do not contain any wildcards are matched literally (which is faster than a
regular expression and avoids possible conflicts with regexp metacharacters).

  sub wordform_pattern {
    my ($self, $input) = @_;
    die "the wordform pattern ''$input'' must not contain whitespace or double quotes\n"
      if $input =~ /\s|\"/;
    if ($input =~ /^\/(.+)\/$/) {
      my $regexp = $1; # regular expression query: simply wrap in double quotes
      return "\"$regexp\"";
    }
    else {
      if ($input =~ /[?*+]/) {
        my $regexp = $self->Call("wildcard_expression", $input); # call subrule
        return "\"$regexp\"";
      }
      else {
        return "\"$input\"\%l";
      }
    }
  }

It would probably be a good idea to signal an error if the wordform pattern
starts or ends with a slash (C</>) but is not enclosed in C</.../> as a
regular expression query.  This is likely to be a typing mistake and the user
will be confused if the input is silently interpreted as a wildcard
expression.

=head2 Parsing sequences

If the input string consists of a variable number of subconstituents of the
same type, the B<Apply> method provides a convenient alternative to repeated
subrule calls.  It parses all specified subconstituents, collects the parse
results and returns them as a list.  The following example processes queries
that consist of a sequence wordform patterns separated by blanks (each pattern
is either a wildcard expression or regular expression, according to the DPP
rules defined above), and returns an equivalent CQP query.

  sub wordform_sequence {
    my ($self, $input) = @_;
    my @items = split " ", $input;
    my @cqp_patterns = $self->Apply("wordform_pattern", @items);
    return "@cqp_patterns";

lib/CWB/CEQL/Parser.pm  view on Meta::CPAN

  if (not defined $result) {
    my $error = $@;
    chomp($error);                      # remove trailing newline
    $error = "parse of '' ".$self->{INPUT}." '' returned no result (reason unknown)"
      if $error eq "";
    $error =~ s/\s*\n\s*/ **::** /g;
    $self->{ERROR} = $error;
  }

  $self->{INPUT} = undef;               # no active parse
  $self->{PARAM} = undef;               # restore global parameter values (PARAM_DEFAULTS)
  return $result;                       # undef if parse failed
}

=item I<@lines_of_text> = I<$grammar>->B<ErrorMessage>;

If the last parse failed, returns a detailed error message and backtrace of
the callstack as a list of text lines (without newlines).  Otherwise, returns
empty list.

=cut

sub ErrorMessage {
  my $self = shift;
  my $error = $self->{ERROR};
  return ()
    unless defined $error;

  my @lines = "**Error:** $error";
  my $previous_frame = { RULE => "", INPUT => "" }; # init do dummy frame to avoid special case below
  foreach my $frame (reverse @{$self->{CALLSTACK}}) {
    my $rule = $frame->{RULE};
    if ($rule eq "APPLY") {
      my @done = @{$frame->{APPLY_DONE}};
      my @remain = @{$frame->{APPLY_ITEMS}};
      push @lines, " - at this location: '' @done ''**<==**'' @remain ''";
    }
    else {
      my $input = $frame->{INPUT};
      my $previous_input = $previous_frame->{INPUT} || "";
      if (($previous_input eq $input) and ($previous_frame->{RULE} ne "APPLY")) {
        $lines[-1] .= ", **$rule**";
      }
      else {
        push @lines, " - when parsing '' $input '' as **$rule**";
      }
    }
    $previous_frame = $frame;
  }
  return @lines;
}

=item I<$html_code> = I<$grammar>->B<HtmlErrorMessage>;

If the last parse failed, returns HTML-formatted error message and backtrace
of the callstack.  The string I<$html_code> is valid HTML and can directly be
included in a generated Web page.  In particular, unsafe and non-ASCII
characters have been encoded as HTML entities.  Simple, non-recursive
wiki-style markup in an error message is interpreted in the following way:

  **<text>**    <text> is shown in bold font (<b> ... </b>)
  //<text>//    <text> is displayed in italics (<i> ... </i>)
  ''<text>''    <text> is shown in typewriter font (<code> ... </code>)

Lines starting with C< - > (note the two blanks) are converted into list items.

=cut

sub HtmlErrorMessage {
  my $self = shift;
  my @text_lines = $self->ErrorMessage();
  if (@text_lines > 0) {
    return $self->formatHtmlText(@text_lines);
  }
  else {
    return undef;
  }
}

=item I<$grammar>->B<SetParam>(I<$name>, I<$value>);

=item I<$value> = I<$grammar>->B<GetParam>(I<$name>);

Set the value of parameter I<$name> (B<SetParam>), or read its current value
(B<GetParam>).  The parameter I<$name> must have been defined by the grammar
class (which I<$grammar> is an instance of) and should be described in the
grammar's documentation.

=cut

sub SetParam {
  croak 'Usage:  $grammar->SetParam($name, $value)'
    unless @_ == 3;
  my ($self, $name, $value) = @_;
  ## select either global parameter values (user level) or working copy (during parse)
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  croak "CWB::CEQL::Parser: parameter '$name' does not exist"
    unless exists $param_set->{$name};
  $param_set->{$name} = $value;
}

sub GetParam {
  croak 'Usage:  $grammar->GetParam($name)'
    unless @_ == 2;
  my ($self, $name) = @_;
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  croak "CWB::CEQL::Parser: parameter '$name' does not exist"
    unless exists $param_set->{$name};
  return $param_set->{$name};
}

=back


=head1 METHODS USED BY GRAMMAR AUTHORS

Methods for grammar authors.  Since these methods are intended for use in the
rules of a DPP grammar, they are typically applied to the object I<$self>.

=over 4

=item I<$self>->B<NewParam>(I<$name>, I<$default_value>);

lib/CWB/CEQL/Parser.pm  view on Meta::CPAN

will B<die> otherwise.

=cut

sub NestingLevel {
  my $self = shift;
  my $group_stack = $self->{GROUPSTACK};
  confess "NestingLevel() called outside Apply() operation (internal error)"
    unless defined $group_stack;
  return scalar @{$group_stack};
}

=back


=head1 INTERNAL METHODS

Internal methods of B<CWB::CEQL::Parser>.

=over 4

=item I<$array_ref> = I<$self>->B<currentGroup>;

Returns a pointer to the currently active result list during an B<Apply>
operation (either the top-level result list, or the local result list in a
nested group).  This pointer can be used to access previously collected
return values (before B<EndGroup> is called), and to manipulate the result
list (e.g. to perform advanced shift-reduce parsing).

It is an error to call this method while the shift-reduce parser is not
active.

=cut

sub currentGroup {
  my $self = shift;
  confess "currentGroup() called outside Apply() operation (internal error)"
    unless defined $self->{GROUPSTACK};
  return $self->{GROUPS}->[0];
}

=item I<$html_code> = I<$grammar>->B<formatHtmlText>(I<@lines_of_text>);

Format one or more text lines with simple wiki-style markup as HTML.  The
string I<$html_code> is valid HTML and can directly be included in a generated
Web page.  In particular, unsafe and non-ASCII characters are automatically
encoded as HTML entities.  The following typographic markup is supported:

=over 4

=item *

C<< **<text>** >> - <text> is displayed in bold face (C<< <b> ... </b> >>)

=item *

C<< //<text>// >> - <text> is displayed in italics (C<< <i> ... </i> >>)

=item *

C<< ''<text>'' >> - <text> is shown in typewriter font (C<< <code> ... </code> >>)

=item *

lines starting with C< - > (note the two blanks before and after the
hyphen) are converted into list items

=item *

all other lines are formatted as separate paragraphs (C<< <p> ... </p> >>)

=back

The wiki markup is non-recursive, i.e. no substitutions will be applied to
the text wrapped in C<''...''> etc.  This behaviour is intentional, so that
e.g. B<**> in a query expression will not be mistaken for a bold face marker,
(as long as the query is displayed in typewriter font, i.e. as C<''<query>''>).

=cut

sub formatHtmlText {
  my $self = shift;
  my @html_lines = ();
  my $in_list = 0;
  while (@_) {
    my $line = shift;
    my $list_item = ($line =~ s{^ -\s+}{}) ? 1 : 0;
    $line = $self->encodeEntities($line); # does not affect **, // and ''
    $line =~ s{(\*\*|//|'')(.*?)(\1)}{
      if ($1 eq "**")    { "<b>$2</b>" }
      elsif ($1 eq "//") { "<i>$2</i>" }
      else               { "<code>$2</code>" }
    }ge;
    if ($list_item) {
      push @html_lines, "<ul>"
        unless $in_list;
      push @html_lines, "<li>$line</li>";
      $in_list = 1;
    }
    else {
      push @html_lines, "</ul>"
        if $in_list;
      push @html_lines, "<p>$line</p>";
      $in_list = 0;
    }
  }
  push @html_lines, "</ul>"
    if $in_list;
  return join("\n", @html_lines);
}

=item I<$html> = I<$grammar>->B<encodeEntities>(I<$string>);

Replacement for B<encode_entities> function from B<HTML::Entities>, to avoid
dependency on this package (which is not part of the standard library).
Transforms unsafe characters C<E<lt>>, C<E<gt>>, C<&> and C<"> into HTML
entities, normalises whitespace and removes other control characters.

If I<$string> is a Unicode string, all non-ASCII characters are replaced
by numerical entities (otherwise, an unknown 8-bit character set is assumed,
so no substitutions can be made).

=cut

sub encodeEntities {
  my ($self, $s) = @_;
  my %entity = ( '<' => '&lt;', '>' => '&gt;', '&' => '&amp;', '"' => '&quot;' );
  $s =~ s/([<>&"])/$entity{$1}/ge;  # unsafe characters => entities
  $s =~ s/[ \t]+/ /g;               # normalise whitespace (but not line breaks)
  $s =~ s/[\x00-\x09\x0b\x0c\x0e-\x1f]+//g; # remove other control characters except LF and CR
  if (Encode::is_utf8($s)) {
    $s =~ s/([^\x00-\x7f])/sprintf "&#x%X;", ord($1)/ge;
  }
  return $s;
}

=back



( run in 1.540 second using v1.01-cache-2.11-cpan-5735350b133 )