HTML-Parser

 view release on metacpan or  search on metacpan

README  view on Meta::CPAN

        Attr causes a reference to a hash of attribute name/value pairs to
        be passed.

        Boolean attributes' values are either the value set by
        $p->boolean_attribute_value, or the attribute name if no value has
        been set by $p->boolean_attribute_value.

        This passes undef except for "start" events.

        Unless "xml_mode" or "case_sensitive" is enabled, the attribute
        names are forced to lower case.

        General entities are decoded in the attribute values and one layer
        of matching quotes enclosing the attribute values is removed.

        The Unicode character set is assumed for entity decoding.

    @attr
        Basically the same as "attr", but keys and values are passed as
        individual arguments and the original sequence of the attributes is
        kept. The parameters passed will be the same as the @attr calculated
        here:

           @attr = map { $_ => $attr->{$_} } @$attrseq;

        assuming $attr and $attrseq here are the hash and array passed as
        the result of "attr" and "attrseq" argspecs.

        This passes no values for events besides "start".

    "attrseq"
        Attrseq causes a reference to an array of attribute names to be
        passed. This can be useful if you want to walk the "attr" hash in
        the original sequence.

        This passes undef except for "start" events.

        Unless "xml_mode" or "case_sensitive" is enabled, the attribute
        names are forced to lower case.

    "column"
        Column causes the column number of the start of the event to be
        passed. The first column on a line is 0.

    "dtext"
        Dtext causes the decoded text to be passed. General entities are
        automatically decoded unless the event was inside a CDATA section or
        was between literal start and end tags ("script", "style", "xmp",
        "iframe", "title", "textarea" and "plaintext").

        The Unicode character set is assumed for entity decoding.

        This passes undef except for "text" events.

    "event"
        Event causes the event name to be passed.

        The event name is one of "text", "start", "end", "declaration",
        "comment", "process", "start_document" or "end_document".

    "is_cdata"
        Is_cdata causes a TRUE value to be passed if the event is inside a
        CDATA section or between literal start and end tags ("script",
        "style", "xmp", "iframe", "title", "textarea" and "plaintext").

        if the flag is FALSE for a text event, then you should normally
        either use "dtext" or decode the entities yourself before the text
        is processed further.

    "length"
        Length causes the number of bytes of the source text of the event to
        be passed.

    "line"
        Line causes the line number of the start of the event to be passed.
        The first line in the document is 1. Line counting doesn't start
        until at least one handler requests this value to be reported.

    "offset"
        Offset causes the byte position in the HTML document of the start of
        the event to be passed. The first byte in the document has offset 0.

    "offset_end"
        Offset_end causes the byte position in the HTML document of the end
        of the event to be passed. This is the same as "offset" + "length".

    "self"
        Self causes the current object to be passed to the handler. If the
        handler is a method, this must be the first element in the argspec.

        An alternative to passing self as an argspec is to register closures
        that capture $self by themselves as handlers. Unfortunately this
        creates circular references which prevent the HTML::Parser object
        from being garbage collected. Using the "self" argspec avoids this
        problem.

    "skipped_text"
        Skipped_text returns the concatenated text of all the events that
        have been skipped since the last time an event was reported. Events
        might be skipped because no handler is registered for them or
        because some filter applies. Skipped text also includes marked
        section markup, since there are no events that can catch it.

        If an ""-handler is registered for an event, then the text for this
        event is not included in "skipped_text". Skipped text both before
        and after the ""-event is included in the next reported
        "skipped_text".

    "tag"
        Same as "tagname", but prefixed with "/" if it belongs to an "end"
        event and "!" for a declaration. The "tag" does not have any prefix
        for "start" events, and is in this case identical to "tagname".

    "tagname"
        This is the element name (or *generic identifier* in SGML jargon)
        for start and end tags. Since HTML is case insensitive, this name is
        forced to lower case to ease string matching.

        Since XML is case sensitive, the tagname case is not changed when
        "xml_mode" is enabled. The same happens if the "case_sensitive"
        attribute is set.

README  view on Meta::CPAN


        Examples:

          <? HTML processing instructions >
          <? XML processing instructions ?>

    "start"
        This event is triggered when a start tag is recognized.

        Example:

          <A HREF="http://www.perl.com/">

    "start_document"
        This event is triggered before any other events for a new document.
        A handler for it can be used to initialize stuff. There is no
        document text associated with this event.

    "text"
        This event is triggered when plain text (characters) is recognized.
        The text may contain multiple lines. A sequence of text may be
        broken between several text events unless $p->unbroken_text is
        enabled.

        The parser will make sure that it does not break a word or a
        sequence of whitespace between two text events.

  Unicode
    If Unicode is passed to $p->parse() then chunks of Unicode will be
    reported to the handlers. The offset and length argspecs will also
    report their position in terms of characters.

    It is safe to parse raw undecoded UTF-8 if you either avoid decoding
    entities and make sure to not use *argspecs* that do, or enable the
    "utf8_mode" for the parser. Parsing of undecoded UTF-8 might be useful
    when parsing from a file where you need the reported offsets and lengths
    to match the byte offsets in the file.

    If a filename is passed to $p->parse_file() then the file will be read
    in binary mode. This will be fine if the file contains only ASCII or
    Latin-1 characters. If the file contains UTF-8 encoded text then care
    must be taken when decoding entities as described in the previous
    paragraph, but better is to open the file with the UTF-8 layer so that
    it is decoded properly:

       open(my $fh, "<:utf8", "index.html") || die "...: $!";
       $p->parse_file($fh);

    If the file contains text encoded in a charset besides ASCII, Latin-1 or
    UTF-8 then decoding will always be needed.

VERSION 2 COMPATIBILITY
    When an "HTML::Parser" object is constructed with no arguments, a set of
    handlers is automatically provided that is compatible with the old
    HTML::Parser version 2 callback methods.

    This is equivalent to the following method calls:

        $p->handler(start   => "start",   "self, tagname, attr, attrseq, text");
        $p->handler(end     => "end",     "self, tagname, text");
        $p->handler(text    => "text",    "self, text, is_cdata");
        $p->handler(process => "process", "self, token0, text");
        $p->handler(
            comment => sub {
                my ($self, $tokens) = @_;
                for (@$tokens) { $self->comment($_); }
            },
            "self, tokens"
        );
        $p->handler(
            declaration => sub {
                my $self = shift;
                $self->declaration(substr($_[0], 2, -1));
            },
            "self, text"
        );

    Setting up these handlers can also be requested with the "api_version =>
    2" constructor option.

SUBCLASSING
    The "HTML::Parser" class is able to be subclassed. Parser objects are
    plain hashes and "HTML::Parser" reserves only hash keys that start with
    "_hparser". The parser state can be set up by invoking the init()
    method, which takes the same arguments as new().

EXAMPLES
    The first simple example shows how you might strip out comments from an
    HTML document. We achieve this by setting up a comment handler that does
    nothing and a default handler that will print out anything else:

        use HTML::Parser ();
        HTML::Parser->new(
            default_h => [sub { print shift }, 'text'],
            comment_h => [""],
        )->parse_file(shift || die)
            || die $!;

    An alternative implementation is:

        use HTML::Parser ();
        HTML::Parser->new(
            end_document_h => [sub { print shift }, 'skipped_text'],
            comment_h      => [""],
        )->parse_file(shift || die)
            || die $!;

    This will in most cases be much more efficient since only a single
    callback will be made.

    The next example prints out the text that is inside the <title> element
    of an HTML document. Here we start by setting up a start handler. When
    it sees the title start tag it enables a text handler that prints any
    text found and an end handler that will terminate parsing as soon as the
    title end tag is seen:

        use HTML::Parser ();

        sub start_handler {
            return if shift ne "title";
            my $self = shift;



( run in 0.986 second using v1.01-cache-2.11-cpan-cdf2f3d4e48 )