Plack-App-MCCS

 view release on metacpan or  search on metacpan

local/lib/perl5/x86_64-linux-thread-multi/HTML/Parser.pm  view on Meta::CPAN

package HTML::Parser;

use strict;

our $VERSION = '3.81';

require HTML::Entities;

require XSLoader;
XSLoader::load('HTML::Parser', $VERSION);

sub new
{
    my $class = shift;
    my $self = bless {}, $class;
    return $self->init(@_);
}


sub init
{
    my $self = shift;
    $self->_alloc_pstate;

    my %arg = @_;
    my $api_version = delete $arg{api_version} || (@_ ? 3 : 2);
    if ($api_version >= 4) {
	require Carp;
	Carp::croak("API version $api_version not supported " .
		    "by HTML::Parser $VERSION");
    }

    if ($api_version < 3) {
	# Set up method callbacks compatible with HTML-Parser-2.xx
	$self->handler(text    => "text",    "self,text,is_cdata");
	$self->handler(end     => "end",     "self,tagname,text");
	$self->handler(process => "process", "self,token0,text");
	$self->handler(start   => "start",
		                  "self,tagname,attr,attrseq,text");

	$self->handler(comment =>
		       sub {
			   my($self, $tokens) = @_;
			   for (@$tokens) {
			       $self->comment($_);
			   }
		       }, "self,tokens");

	$self->handler(declaration =>
		       sub {
			   my $self = shift;
			   $self->declaration(substr($_[0], 2, -1));
		       }, "self,text");
    }

    if (my $h = delete $arg{handlers}) {
	$h = {@$h} if ref($h) eq "ARRAY";
	while (my($event, $cb) = each %$h) {
	    $self->handler($event => @$cb);
	}
    }

    # In the end we try to assume plain attribute or handler
    while (my($option, $val) = each %arg) {
	if ($option =~ /^(\w+)_h$/) {
	    $self->handler($1 => @$val);
	}
        elsif ($option =~ /^(text|start|end|process|declaration|comment)$/) {
	    require Carp;
	    Carp::croak("Bad constructor option '$option'");
        }
	else {
	    $self->$option($val);
	}
    }

    return $self;
}


sub parse_file
{
    my($self, $file) = @_;
    my $opened;
    if (!ref($file) && ref(\$file) ne "GLOB") {
        # Assume $file is a filename
        local(*F);
        open(F, "<", $file) || return undef;
        binmode(F);  # should we? good for byte counts
        $opened++;
        $file = *F;
    }
    my $chunk = '';
    while (read($file, $chunk, 512)) {
        $self->parse($chunk) || last;

local/lib/perl5/x86_64-linux-thread-multi/HTML/Parser.pm  view on Meta::CPAN

This passes undef except for C<start> events.

Unless C<xml_mode> or C<case_sensitive> is enabled, the attribute
names are forced to lower case.

General entities are decoded in the attribute values and
one layer of matching quotes enclosing the attribute values is removed.

The Unicode character set is assumed for entity decoding.

=item C<@attr>

Basically the same as C<attr>, but keys and values are passed as
individual arguments and the original sequence of the attributes is
kept.  The parameters passed will be the same as the @attr calculated
here:

   @attr = map { $_ => $attr->{$_} } @$attrseq;

assuming $attr and $attrseq here are the hash and array passed as the
result of C<attr> and C<attrseq> argspecs.

This passes no values for events besides C<start>.

=item C<attrseq>

Attrseq causes a reference to an array of attribute names to be
passed.  This can be useful if you want to walk the C<attr> hash in
the original sequence.

This passes undef except for C<start> events.

Unless C<xml_mode> or C<case_sensitive> is enabled, the attribute
names are forced to lower case.

=item C<column>

Column causes the column number of the start of the event to be passed.
The first column on a line is 0.

=item C<dtext>

Dtext causes the decoded text to be passed.  General entities are
automatically decoded unless the event was inside a CDATA section or
was between literal start and end tags (C<script>, C<style>,
C<xmp>, C<iframe>, C<title>, C<textarea> and C<plaintext>).

The Unicode character set is assumed for entity decoding.  With Perl
version 5.6 or earlier only the Latin-1 range is supported, and
entities for characters outside the range 0..255 are left unchanged.

This passes undef except for C<text> events.

=item C<event>

Event causes the event name to be passed.

The event name is one of C<text>, C<start>, C<end>, C<declaration>,
C<comment>, C<process>, C<start_document> or C<end_document>.

=item C<is_cdata>

Is_cdata causes a TRUE value to be passed if the event is inside a CDATA
section or between literal start and end tags (C<script>,
C<style>, C<xmp>, C<iframe>, C<title>, C<textarea> and C<plaintext>).

if the flag is FALSE for a text event, then you should normally
either use C<dtext> or decode the entities yourself before the text is
processed further.

=item C<length>

Length causes the number of bytes of the source text of the event to
be passed.

=item C<line>

Line causes the line number of the start of the event to be passed.
The first line in the document is 1.  Line counting doesn't start
until at least one handler requests this value to be reported.

=item C<offset>

Offset causes the byte position in the HTML document of the start of
the event to be passed.  The first byte in the document has offset 0.

=item C<offset_end>

Offset_end causes the byte position in the HTML document of the end of
the event to be passed.  This is the same as C<offset> + C<length>.

=item C<self>

Self causes the current object to be passed to the handler.  If the
handler is a method, this must be the first element in the argspec.

An alternative to passing self as an argspec is to register closures
that capture $self by themselves as handlers.  Unfortunately this
creates circular references which prevent the HTML::Parser object
from being garbage collected.  Using the C<self> argspec avoids this
problem.

=item C<skipped_text>

Skipped_text returns the concatenated text of all the events that have
been skipped since the last time an event was reported.  Events might
be skipped because no handler is registered for them or because some
filter applies.  Skipped text also includes marked section markup,
since there are no events that can catch it.

If an C<"">-handler is registered for an event, then the text for this
event is not included in C<skipped_text>.  Skipped text both before
and after the C<"">-event is included in the next reported
C<skipped_text>.

=item C<tag>

Same as C<tagname>, but prefixed with "/" if it belongs to an C<end>
event and "!" for a declaration.  The C<tag> does not have any prefix
for C<start> events, and is in this case identical to C<tagname>.

=item C<tagname>

local/lib/perl5/x86_64-linux-thread-multi/HTML/Parser.pm  view on Meta::CPAN


This event is triggered when a start tag is recognized.

Example:

  <A HREF="http://www.perl.com/">

=item C<start_document>

This event is triggered before any other events for a new document.  A
handler for it can be used to initialize stuff.  There is no document
text associated with this event.

=item C<text>

This event is triggered when plain text (characters) is recognized.
The text may contain multiple lines.  A sequence of text may be broken
between several text events unless $p->unbroken_text is enabled.

The parser will make sure that it does not break a word or a sequence
of whitespace between two text events.

=back

=head2 Unicode

C<HTML::Parser> can parse Unicode strings when running under
perl-5.8 or better.  If Unicode is passed to $p->parse() then chunks
of Unicode will be reported to the handlers.  The offset and length
argspecs will also report their position in terms of characters.

It is safe to parse raw undecoded UTF-8 if you either avoid decoding
entities and make sure to not use I<argspecs> that do, or enable the
C<utf8_mode> for the parser.  Parsing of undecoded UTF-8 might be
useful when parsing from a file where you need the reported offsets
and lengths to match the byte offsets in the file.

If a filename is passed to $p->parse_file() then the file will be read
in binary mode.  This will be fine if the file contains only ASCII or
Latin-1 characters.  If the file contains UTF-8 encoded text then care
must be taken when decoding entities as described in the previous
paragraph, but better is to open the file with the UTF-8 layer so that
it is decoded properly:

   open(my $fh, "<:utf8", "index.html") || die "...: $!";
   $p->parse_file($fh);

If the file contains text encoded in a charset besides ASCII, Latin-1
or UTF-8 then decoding will always be needed.

=head1 VERSION 2 COMPATIBILITY

When an C<HTML::Parser> object is constructed with no arguments, a set
of handlers is automatically provided that is compatible with the old
HTML::Parser version 2 callback methods.

This is equivalent to the following method calls:

    $p->handler(start   => "start",   "self, tagname, attr, attrseq, text");
    $p->handler(end     => "end",     "self, tagname, text");
    $p->handler(text    => "text",    "self, text, is_cdata");
    $p->handler(process => "process", "self, token0, text");
    $p->handler(
        comment => sub {
            my ($self, $tokens) = @_;
            for (@$tokens) { $self->comment($_); }
        },
        "self, tokens"
    );
    $p->handler(
        declaration => sub {
            my $self = shift;
            $self->declaration(substr($_[0], 2, -1));
        },
        "self, text"
    );

Setting up these handlers can also be requested with the "api_version =>
2" constructor option.

=head1 SUBCLASSING

The C<HTML::Parser> class is able to be subclassed.  Parser objects are plain
hashes and C<HTML::Parser> reserves only hash keys that start with
"_hparser".  The parser state can be set up by invoking the init()
method, which takes the same arguments as new().

=head1 EXAMPLES

The first simple example shows how you might strip out comments from
an HTML document.  We achieve this by setting up a comment handler that
does nothing and a default handler that will print out anything else:

    use HTML::Parser ();
    HTML::Parser->new(
        default_h => [sub { print shift }, 'text'],
        comment_h => [""],
    )->parse_file(shift || die)
        || die $!;

An alternative implementation is:

    use HTML::Parser ();
    HTML::Parser->new(
        end_document_h => [sub { print shift }, 'skipped_text'],
        comment_h      => [""],
    )->parse_file(shift || die)
        || die $!;

This will in most cases be much more efficient since only a single
callback will be made.

The next example prints out the text that is inside the <title>
element of an HTML document.  Here we start by setting up a start
handler.  When it sees the title start tag it enables a text handler
that prints any text found and an end handler that will terminate
parsing as soon as the title end tag is seen:

    use HTML::Parser ();

    sub start_handler {



( run in 2.717 seconds using v1.01-cache-2.11-cpan-524268b4103 )