decoded results from the CPAN

decoded
CodeGen-Cpppp
view release on metacpan or search on metacpan
lib/CodeGen/Cpppp/CParser.pm view on Meta::CPAN
         |  # all other characters
            (.) (?{ $_type= 'unknown'; $_error= q{parse error} })
         )
      }xcg
   ) {
      my @token= ($_type, $_value // $1, $-[0], $+[0] - $-[0], defined $_error? ($_error) : ());
      # disambiguate negative number from minus operator
      if (($_type eq 'integer' || $_type eq 'real')
         && @tokens && $tokens[-1][0] eq '-'
         && (@tokens == 1 || !$tokens_before_infix_minus{$tokens[-2]->type})
      ) {
         $token[1]= -$token[1];
         $token[2]= $tokens[-1][2];
         $token[3]= $+[0] - $tokens[-1][2];
         @{$tokens[-1]}= @token;
      } else {
         push @tokens, bless \@token, 'CodeGen::Cpppp::CParser::Token';
      }
      ($_error, $_value)= (undef, undef);
   }
   return @tokens;
}

1;

__END__

=pod

=encoding UTF-8

=head1 NAME

CodeGen::Cpppp::CParser - C Parser Utility Library

=head1 METHODS

=head2 tokenize

  @tokens= $class->tokenize($string);
  @tokens= $class->tokenize(\$string);
  @tokens= $class->tokenize(\$string, $max_tokens);

Parse some number of C language tokens from the input string, and update the
regex C<pos()> of the string so that you can resume parsing more tokens later.
Since this updates the pos of the string, you can pass it as a reference to
make it more clear to readers what is happening.

If C<$max_tokens> is given, only that many tokens will be returned.

Whitespace is ignored (not returned as a token) except for whitespace contained
in a 'directive' token.  The body of a directive needs further tokenized.

Each token is an arrayref of the form:

  [ $type, $value, $offset, $length, $error=undef ]
  
  $type:   'directive', 'comment', 'string', 'char', 'real', 'integer',
           'keyword', 'ident', 'unknown', or any punctuation character
  
  $value:  for constants, this is the decoded string or numeric value
           for directives and comments, it is the body text
           for punctuation, it is a copy of $type
           for unknown, it is the exact character that didn't parse
  
  $src_pos: the character offset within the source $string
  
  $src_len: the number of characters occupied in the source $string
  
  $error: if the token is invalid in some way, but still undisputedly that
          type of token (e.g. unclosed string or unclosed comment) it will be
          returned with a 5th element containing the error message.

For some tokens, you will need to inspect C<< substr($string, $offset, $length) >>
to get the full details, like the suffixes on integer constants.

Consecutive string tokens are not merged, since the parser needs to handle
that step after preprocessor macros are substituted.

=head1 AUTHOR

Michael Conrad <mike@nrdvana.net>

=head1 VERSION

version 0.005

=head1 COPYRIGHT AND LICENSE

This software is copyright (c) 2024 by Michael Conrad.

This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.

=cut
( run in 2.116 seconds using v1.01-cache-2.11-cpan-df04353d9ac )