PUT results from the CPAN

Alt-CWB-ambs

view release on metacpan or search on metacpan

lib/CWB/CEQL/Parser.pm view on Meta::CPAN

      unshift @terms_ops, $new_term;
    }
    die "syntax error in arithmetic expression\n"
      unless @terms_ops == 1;     # wrong number of items
    return shift @terms_ops;
  }

The obvious drawback of this approach is the difficulty of signaling the
precise location of a syntax error to the user (in the example grammar above,
the parser will simply print C<syntax error> if there is any problem in a
sequence of terms and operators).  By the time the error is detected, all
items in the active group have already been pre-processed and subexpressions
have been collapsed.  Printing the current list of terms and operators would
only add to the user's confusion.

In order to signal errors immediately where they occur, each item should be
validated before it is added to the result list (e.g. an operator may not be
pushed as first item on a result list), and the reduce operation (C<< Term Op
Term => Term >>) should be applied as soon as possible.  The rule
C<arithmetic_item> needs direct access to the currently active result list for
this purpose: (1) to check how many items have already been pushed when
validating a new item, and (2) to reduce a sequence C<Term Op Term> to a single
C<Term> in the result list.

A pointer to the currently active result list is obtained with the internal
B<currentGroup> method, allowing a grammar rule to manipulate the result list.
The B<proximity queries> in the B<CWB::CEQL> grammar illustrate this advanced
form of shift-reduce parsing.


=head2 Backtracking with Try()

B<** TODO **>


=head1 EXTENDING GRAMMARS

B<** TODO **>


=head1 USER-VISIBLE METHODS

Methods that are called by the "end users" of a grammar.

=over 4

=item I<$grammar> = B<new> MyGrammar;

Create parser object I<$grammar> for the specified grammar (which must be a
class derived from B<CWB::CEQL::Parser>).  Note that the parser itself is not
reentrant, but multiple parsers for the same grammar can be run in parallel.
The return value I<$grammar> is an object of class B<MyGrammar>.

=cut

sub new {
  my $class = shift;
  my $self = {
              'PARAM_DEFAULTS' => {},  # globally set default values for parameters
              'PARAM' => undef,        # working copies of parameters during parse
              'INPUT' => undef,        # input string (defined while parsing)
              'ERROR' => undef,        # error message generated by last parse (undef = no error)
              'CALLSTACK' => [],       # call stack for backtrace in case of error
              'GROUPS' => undef,       # group structure for shift-reduce parser (undef if not active)
              'GROUPSTACK' => undef,   # stack of nested bracketing groups (undef if not active)
             };
  bless($self, $class);
}

=item I<$result> = I<$grammar>->B<Parse>(I<$string> [, I<$rule>]);

Parse input string I<$string> as a constituent of type I<$rule> (if
unspecified, the C<default> rule will be used).  The return value I<$result>
is typically a string containing the transformed query, but may also be an
arbitrary data structure or object (such as a parse tree for I<$input>).
Consult the relevant grammar documentation for details.  If parsing fails,
B<undef> is returned.

=cut

sub Parse {
  croak 'Usage:  $result = $grammar->Parse($string [, $rule]);'
    unless @_ == 2 or @_ == 3;
  my ($self, $input, $rule) = @_;
  $rule = "default"
    unless defined $rule;
  confess "CWB::CEQL::Parser: Parse() method is not re-entrant\n(tried to parse '$input' while parsing '".$self->{INPUT}."')"
    if defined $self->{INPUT};

  $self->{INPUT} = $input;
  %{$self->{PARAM}} = %{$self->{PARAM_DEFAULTS}}; # shallow copy of hash
  $self->{CALLSTACK} = [];              # re-initialise call stack (should destroy information from last parse)
  $self->{GROUPS} = undef;              # indicate that shift-reduce parser is not active
  $self->{CURRENT_GROUP} = undef;
  $self->{GROUPSTACK} = undef;
  $self->{ERROR} = undef;               # clear previous errors

  my $result = eval { $self->Call($rule, $input) }; # catch exceptions from parse errors
  if (not defined $result) {
    my $error = $@;
    chomp($error);                      # remove trailing newline
    $error = "parse of '' ".$self->{INPUT}." '' returned no result (reason unknown)"
      if $error eq "";
    $error =~ s/\s*\n\s*/ **::** /g;
    $self->{ERROR} = $error;
  }

  $self->{INPUT} = undef;               # no active parse
  $self->{PARAM} = undef;               # restore global parameter values (PARAM_DEFAULTS)
  return $result;                       # undef if parse failed
}

=item I<@lines_of_text> = I<$grammar>->B<ErrorMessage>;

If the last parse failed, returns a detailed error message and backtrace of
the callstack as a list of text lines (without newlines).  Otherwise, returns
empty list.

=cut

sub ErrorMessage {
  my $self = shift;
  my $error = $self->{ERROR};
  return ()
    unless defined $error;

  my @lines = "**Error:** $error";
  my $previous_frame = { RULE => "", INPUT => "" }; # init do dummy frame to avoid special case below
  foreach my $frame (reverse @{$self->{CALLSTACK}}) {
    my $rule = $frame->{RULE};
    if ($rule eq "APPLY") {
      my @done = @{$frame->{APPLY_DONE}};
      my @remain = @{$frame->{APPLY_ITEMS}};
      push @lines, " - at this location: '' @done ''**<==**'' @remain ''";
    }
    else {
      my $input = $frame->{INPUT};
      my $previous_input = $previous_frame->{INPUT} || "";
      if (($previous_input eq $input) and ($previous_frame->{RULE} ne "APPLY")) {
        $lines[-1] .= ", **$rule**";
      }
      else {
        push @lines, " - when parsing '' $input '' as **$rule**";
      }
    }
    $previous_frame = $frame;
  }
  return @lines;
}

=item I<$html_code> = I<$grammar>->B<HtmlErrorMessage>;

If the last parse failed, returns HTML-formatted error message and backtrace
of the callstack.  The string I<$html_code> is valid HTML and can directly be
included in a generated Web page.  In particular, unsafe and non-ASCII
characters have been encoded as HTML entities.  Simple, non-recursive
wiki-style markup in an error message is interpreted in the following way:

  **<text>**    <text> is shown in bold font (<b> ... </b>)
  //<text>//    <text> is displayed in italics (<i> ... </i>)
  ''<text>''    <text> is shown in typewriter font (<code> ... </code>)

Lines starting with C< - > (note the two blanks) are converted into list items.

=cut

sub HtmlErrorMessage {
  my $self = shift;
  my @text_lines = $self->ErrorMessage();
  if (@text_lines > 0) {
    return $self->formatHtmlText(@text_lines);
  }
  else {
    return undef;
  }
}

=item I<$grammar>->B<SetParam>(I<$name>, I<$value>);

=item I<$value> = I<$grammar>->B<GetParam>(I<$name>);

Set the value of parameter I<$name> (B<SetParam>), or read its current value
(B<GetParam>).  The parameter I<$name> must have been defined by the grammar
class (which I<$grammar> is an instance of) and should be described in the
grammar's documentation.

=cut

sub SetParam {
  croak 'Usage:  $grammar->SetParam($name, $value)'
    unless @_ == 3;
  my ($self, $name, $value) = @_;
  ## select either global parameter values (user level) or working copy (during parse)
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  croak "CWB::CEQL::Parser: parameter '$name' does not exist"
    unless exists $param_set->{$name};
  $param_set->{$name} = $value;
}

sub GetParam {
  croak 'Usage:  $grammar->GetParam($name)'
    unless @_ == 2;
  my ($self, $name) = @_;
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  croak "CWB::CEQL::Parser: parameter '$name' does not exist"
    unless exists $param_set->{$name};
  return $param_set->{$name};
}

=back


=head1 METHODS USED BY GRAMMAR AUTHORS

Methods for grammar authors.  Since these methods are intended for use in the
rules of a DPP grammar, they are typically applied to the object I<$self>.

=over 4

=item I<$self>->B<NewParam>(I<$name>, I<$default_value>);

Define new parameter I<$name> with default value I<$default_value>.  This
method is normally called in the constructor (method B<new>) of a
parameterized grammar.  If it is used in a rule body, the new parameter
will be created in the working copy of the parameter set and will only be
available during the current parse.

=cut

sub NewParam {
  confess 'Usage:  $self->NewParam($name, $default_value)'
    unless @_ == 3;
  my ($self, $name, $value) = @_;
  ## select either global parameter values (user level) or working copy (during parse)
  my $param_set = (defined $self->{INPUT}) ? $self->{PARAM} : $self->{PARAM_DEFAULTS};
  confess "CWB::CEQL::Parser: parameter '$name' already exists, cannot create with NewParam()"
    if exists $param_set->{$name};
  $param_set->{$name} = $value;
}

=item I<$result> = I<$self>->B<Call>(I<$rule>, I<$input>);

Apply rule I<$rule> to input string I<$input>.  The return value I<$result>
depends on the grammar rule, but is usually a string containing a translated
version of I<$input>.  Grammar rules may also annotate this string with
B<attributes> or by B<bless>ing it into a custom class, or return a complex
data structure such as a parse tree for I<$input>.  The caller has to be aware
what kind of value I<$rule> returns.

Note that B<Call> never returns B<undef>.  In case of an error, the entire
parse is aborted.

=cut

sub Call {
  confess 'Usage:  $result = $self->Call($rule, $input);'
    unless @_ == 3;
  my ($self, $rule, $input) = @_;
  confess "Sorry, we're not parsing yet"
    unless defined $self->{INPUT};
  my $method = $self->can("$rule");
  confess "the rule **$rule** does not exist in grammar **".ref($self)."** (internal error)\n"
    unless defined $method;
  my $frame = {RULE => $rule,
               INPUT => $input};
  push @{$self->{CALLSTACK}}, $frame; 
  my $result = $method->($self, $input);
  die "rule **$rule** failed to return a result (internal error)\n"
    unless defined $result;
  my $return_frame = pop @{$self->{CALLSTACK}};
  die "call stack has been corrupted (internal error)"
    unless $return_frame eq $frame;
  return $result;
}

=item I<$result> = I<$self>->B<Try>(I<$rule>, I<$input>);

Tentatively apply rule I<$rule> to the input string.  If I<$input> is parsed
successfully, B<Try> returns the translated version I<$result> (or an
arbitrary data structure such as a parse tree for I<$input>) just as B<Call>
would.  If parsing fails, B<Try> does not abort but simply returns B<undef>,
ignoring any error messages generated during the attempt.  In addition, the
call stack is restored and all parameters are reset to their previous values,
so that parsing can continue as if nothing had happened (note, however, that
this is based on flat backup copies, so complex data structures may have been
altered destructively).

=cut

sub Try {
  confess 'Usage:  $result = $self->Try($rule, $input);'
    unless @_ == 3;
  my ($self, $rule, $input) = @_;
  confess "Sorry, we're not parsing yet"
    unless defined $self->{INPUT};

  ## make flat backup copies of important data structures and ensure they are restored upon return
  ## (this is not completely safe, but should undo most changes that a failed parse may have made)
  my $back_param = [ @{$self->{PARAM}} ];
  my $back_callstack = [ @{$self->{CALLSTACK}} ]; 
  my ($back_groups, $back_current_group, $back_groupstack) = (undef, undef, undef);
  if (defined $self->{GROUPS}) {
    $back_groups = [ @{$self->{GROUPS}} ];
    $back_current_group = [ @{$back_groups->[0]} ]
      if @$back_groups > 0;
  }
  if (defined $self->{GROUPSTACK}) {
    $back_groupstack = [ @{$self->{GROUPSTACK}} ];
  }

  my $result = eval { $self->Call($rule, $input) };

  ## if parsing failed, restore internal data structures from backup copies
  if (not defined $result) {
    $self->{PARAM} = $back_param;
    $self->{CALLSTACK} = $back_callstack;
    $self->{GROUPS} = $back_groups;
    if (defined $back_groups and defined $back_current_group) {
      $self->{GROUPS}->[0] = $back_current_group;
    }
    $self->{GROUPSTACK} = $back_groupstack;
  }

  return $result;
}

=item I<@results> = I<$self>->B<Apply>(I<$rule>, I<@items>);

Apply rule I<$rule> to each input string in the list I<@items>.  The return
values are collected and returned as a list I<@results>, which has to be
further processed by the caller.  Note that empty strings (C<"">) are
automatically removed from the list of return values.

=cut

sub Apply {
  confess 'Usage:  @results = $self->Apply($rule, @items);'
    unless @_ >= 2;
  my $self = shift;
  my $rule = shift;
  my $frame = {RULE => "APPLY",
               INPUT => undef,
               APPLY_ITEMS => [ @_ ],
               APPLY_DONE => []};
  push @{$self->{CALLSTACK}}, $frame;

  ## data structures for nested groups and result values must be restored on exit (in case of nested Apply())
  local $self->{GROUPS} = [ [] ];   # set up data structure to collect result values of nested groups
  local $self->{GROUPSTACK} = [];   # stack of nested groups (keeps track of nesting depth and ensures proper nesting)

  ## process each input item in turn
  while (@{$frame->{APPLY_ITEMS}}) {
    my $input = shift @{$frame->{APPLY_ITEMS}};
    push @{$frame->{APPLY_DONE}}, $input;
    my $result = $self->Call($rule, $input);
    push @{$self->{GROUPS}->[0]}, $result
      unless $result eq "" and not ref $result;  # plain empty string indicates that this item does not generate output
  }

  ## check that nested bracketing was balanced and that data structure for result values is sane
  if (@{$self->{GROUPSTACK}} > 0) {
    my $next_type = pop @{$self->{GROUPSTACK}};
    my $type_msg = ($next_type eq "*") ? "" : "of type ''$next_type''";
    die "bracketing is not balanced: too many opening delimiters $type_msg\n";
  }
  confess "data structure for result values is corrupt in Apply() call (internal error)"
    unless @{$self->{GROUPS}} == 1;
  my @results = @{$self->{GROUPS}->[0]};

  my $return_frame = pop @{$self->{CALLSTACK}};
  die "call stack has been corrupted (internal error)"
    unless $return_frame eq $frame;
  return @results;
}

=item I<$self>->B<BeginGroup>([I<$name>]);

Marks the start of a nested group, when an opening delimiter is encountered.
B<BeginGroup> may only be called while the shift-reduce parser is active
during an B<Apply> operation.  The optional parameter I<$name> can be used to
ensure proper nesting of different types of groups; the default group name is
C<*>.  After calling B<BeginGroup>, a DPP rule will often return C<""> since
the opening determiner has a purely syntactic function and is not generate
output directly.

=cut

sub BeginGroup {
  my ($self, $name) = @_;
  $name = "*"
    unless defined $name;
  my $group_stack = $self->{GROUPSTACK};
  confess "BeginGroup() called outside Apply() operation (internal error)"
    unless defined $group_stack;
  push @$group_stack, $name;
  unshift @{$self->{GROUPS}}, [];
}

=item I<@group_results> = I<$self>->B<EndGroup>([I<$name>]);

Marks the end of a nested group, when a closing delimiter is encountered.  The
optional parameter I<$name> (or the default name C<*>) must be identical to

lib/CWB/CEQL/Parser.pm view on Meta::CPAN

  push @html_lines, "</ul>"
    if $in_list;
  return join("\n", @html_lines);
}

=item I<$html> = I<$grammar>->B<encodeEntities>(I<$string>);

Replacement for B<encode_entities> function from B<HTML::Entities>, to avoid
dependency on this package (which is not part of the standard library).
Transforms unsafe characters C<E<lt>>, C<E<gt>>, C<&> and C<"> into HTML
entities, normalises whitespace and removes other control characters.

If I<$string> is a Unicode string, all non-ASCII characters are replaced
by numerical entities (otherwise, an unknown 8-bit character set is assumed,
so no substitutions can be made).

=cut

sub encodeEntities {
  my ($self, $s) = @_;
  my %entity = ( '<' => '&lt;', '>' => '&gt;', '&' => '&amp;', '"' => '&quot;' );
  $s =~ s/([<>&"])/$entity{$1}/ge;  # unsafe characters => entities
  $s =~ s/[ \t]+/ /g;               # normalise whitespace (but not line breaks)
  $s =~ s/[\x00-\x09\x0b\x0c\x0e-\x1f]+//g; # remove other control characters except LF and CR
  if (Encode::is_utf8($s)) {
    $s =~ s/([^\x00-\x7f])/sprintf "&#x%X;", ord($1)/ge;
  }
  return $s;
}

=back

=head2 Internal structure of CWB::CEQL::Parser objects

A DPP parser object (i.e. an object that belongs to B<CWB::CEQL::Parser> or
one of its subclasses) is a data structure (hashref) with the following
variables:

=over 4

=item PARAM_DEFAULTS

A hashref containing the global values of grammar parameters, i.e. values set
by the main program for this parser object or the default values defined by
the grammar class.

=item PARAM

Working copy of the grammar parameters, which is used while parsing and may be
modified by grammar rules without affecting the global values.  During a
parse, the B<NewParam>, B<SetParam> and B<GetParam> methods operate on this
working copy.

The C<PARAM> variable is re-initialised before each parse with a flat copy of
the C<PARAM_DEFAULTS> hashref.  Therefore, care has to be taken when modifying
complex parameter values within grammar rules, as the changes will affect the
global values in C<PARAM_DEFAULTS>.  If complex values need to be changed
internally, the grammar rule should always update the parameter with
B<SetParam> and a deep copy of the previous parameter value.

=item INPUT

The current input string passed to the B<Parse> method.  This variable is
mostly used to indicate whether the parser is currently active or not (e.g. in
order to avoid nested B<Parse> calls).

=item ERROR

Error message generated by the last parse, or B<undef> if the parse was
successful.  This error message is returned by B<ErrorMessage> and
B<HtmlErrorMessage> together with a backtrace of the parser's call stack.

=item CALLSTACK

The C<CALLSTACK> variable is an arrayref with information about the nested
calls of grammar rules and their input strings.  Each array element
corresponds to a nested rule invocation and is a hashref with the following
fields:

=over 4

=item RULE 

Name of the grammar rule (i.e. Perl B<method>) invoked.  When the shift-reduce
parser is called with B<Apply>, a special rule named C<APPLY> is pushed on the
stack.

=item INPUT

Input string for the grammar rule (which should be a constituent of the
respective type).

=item APPLY_ITEMS (optional, "APPLY" rule only)

List (arrayref) of items passed to B<Apply> for processing by the shift-reduce
parser.  This field is only present in the call stack entry for the special
C<APPLY> rule.  Items are shifted from this list to C<APPLY_DONE> as they are
processed by the shift-reduce parser.

=item APPLY_DONE (optiona, "APPLY" rule only) 

Items from the list passed to B<Apply> that have already been handled by the
shift-reduce parser.  The main purpose of C<APPLY_ITEMS> and C<APPLY_DONE> is
to narrow down the location of parse errors in a nested bracketing structure.

=back

=item GROUPS

List (arrayref) of arrayrefs collecting parse results for nested bracketing
groups.  The first element of this list corresponds to the currently active
bracketing group.  The C<GROUPS> variable is only defined while the
shift-reduce parser is active.

=item GROUPSTACK

Stack (arrayref) of nested bracketing groups.  Each stack element corresponds
to one level of nesting and is a string giving the type of the respective
group.  If no type has been specified by the user, the default value C<*> is
used.  The length of this array can be used to determine the current nesting
depth.

=back


=head1 COPYRIGHT

Copyright (C) 1999-2010 Stefan Evert [http::/purl.org/stefan.evert]

This software is provided AS IS and the author makes no warranty as to
its use and performance. You may use the software, redistribute and
modify it under the same terms as Perl itself.

=cut

1;

( run in 2.128 seconds using v1.01-cache-2.11-cpan-13bb782fe5a )