HTML-Parser

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN

  * HTML::Parser doesn't compile with perl 5.8.0. (Zefram)

3.59     2008-11-24
  * Restore perl-5.6 compatibility for HTML::HeadParser.
  * Improved META.yml

3.58     2008-11-17
  * Suppress "Parsing of undecoded UTF-8 will give garbage" warning
     with attr_encoded [RT#29089]
  * HTML::HeadParser:
       - Recognize the Unicode BOM in utf8_mode as well [RT#27522]
       - Avoid ending up with '/' keys attribute in Link headers.

3.57     2008-11-16
  * The <iframe> element content is now parsed in literal mode.
  * Parsing of <script> and <style> content ends on the first end tag
     even when that tag was in a quoted string.  That seems to be the
     behaviour of all modern browsers.
  * Implement backquote() attribute as requested by Alex Kapranoff.
  * Test and documentation tweaks from Alex Kapranoff.

Changes  view on Meta::CPAN

     This make sure we never loose old header values.

3.41     2004-11-30
  * Fix unresolved symbol error with perl-5.005.

3.40     2004-11-29
  * Make utf8_mode only available on perl-5.8 or better.  It produced
     garbage with older versions of perl.
  * Emit warning if entities are decoded and something in the first
     chunk looks like hi-bit UTF-8.  Previously this warning was only
     triggered for documents with BOM.

3.39_92     2004-11-23
  * More documentation of the Unicode issues.  Moved around HTML::Parser
     documentation a bit.
  * New boolean option; $p->utf8_mode to allow parsing of raw  UTF-8.
  * Documented that HTML::Entities::decode_entities() can take multiple
     arguments.
  * Unterminated entities are now decoded in text (compatibility
     with MSIE misfeature).
  * Document HTML::Entities::_decode_entities(); this variation of the
     decode_entities() function has been available for a long time, but
     have not been documented until now.
  * HTML::Entities::_decode_entities() can now be told to try to
     expand unterminated entities.
  * Simplified Makefile.PL

3.39_91     2004-11-23
  * The HTML::HeadParser will skip Unicode BOM.  Previously it
     would consider the <head> section done when it saw the BOM.
  * The parser will look for Unicode BOM and give appropriate
     warnings if the form found indicate trouble.
  * If no matching end tag is found for <script>, <style>, <xmp>
     <title>, <textarea> then generate one where the next tag
     starts.
  * For <script> and <style> recognize quoted strings and don't
     consider end element if the corresponding end tag is found
     inside such a string.

3.39_90     2004-11-17
  * The <title> element is now parsed in literal mode, which

README  view on Meta::CPAN

        data before feeding it to the $p->parse(). For $p->parse_file() pass
        a file that has been opened in ":utf8" mode.

        The alternative solution is to enable the "utf8_mode" and not decode
        before passing strings to $p->parse(). The parser can process raw
        undecoded UTF-8 sanely if the "utf8_mode" is enabled, or if the
        "attr", @attr or "dtext" argspecs are avoided.

    Parsing string decoded with wrong endian selection
        (W) The first character in the document is U+FFFE. This is not a
        legal Unicode character but a byte swapped "BOM". The result of
        parsing will likely be garbage.

    Parsing of undecoded UTF-32
        (W) The parser found the Unicode UTF-32 "BOM" signature at the start
        of the document. The result of parsing will likely be garbage.

    Parsing of undecoded UTF-16
        (W) The parser found the Unicode UTF-16 "BOM" signature at the start
        of the document. The result of parsing will likely be garbage.

SEE ALSO
    HTML::Entities, HTML::PullParser, HTML::TokeParser, HTML::HeadParser,
    HTML::LinkExtor, HTML::Form

    HTML::TreeBuilder (part of the *HTML-Tree* distribution)

    <http://www.w3.org/TR/html4/>

hparser.c  view on Meta::CPAN


    if (p_state->buf && SvOK(p_state->buf)) {
	sv_catsv(p_state->buf, chunk);
	beg = SvPV(p_state->buf, len);
	utf8 = SvUTF8(p_state->buf);
    }
    else {
	beg = SvPV(chunk, len);
	utf8 = SvUTF8(chunk);
	if (p_state->offset == 0 && DOWARN) {
	    /* Print warnings if we find unexpected Unicode BOM forms */
	    if (p_state->argspec_entity_decode &&
		!(p_state->attr_encoded && p_state->argspec_entity_decode == ARG_ATTR) &&
		!p_state->utf8_mode && (
                 (!utf8 && len >= 3 && strnEQ(beg, "\xEF\xBB\xBF", 3)) ||
		 (utf8 && len >= 6 && strnEQ(beg, "\xC3\xAF\xC2\xBB\xC2\xBF", 6)) ||
		 (!utf8 && probably_utf8_chunk(aTHX_ beg, len))
		)
	       )
	    {
		warn("Parsing of undecoded UTF-8 will give garbage when decoding entities");

lib/HTML/HeadParser.pm  view on Meta::CPAN

    print "END[$tag]\n" if $DEBUG;
    $self->flush_text if $self->{'tag'};
    $self->eof if $tag eq 'head';
}

sub text
{
    my($self, $text) = @_;
    print "TEXT[$text]\n" if $DEBUG;
    unless ($self->{first_chunk}) {
	# drop Unicode BOM if found
	if ($self->utf8_mode) {
	    $text =~ s/^\xEF\xBB\xBF//;
	}
	else {
	    $text =~ s/^\x{FEFF}//;
	}
	$self->{first_chunk}++;
    }
    my $tag = $self->{tag};
    if (!$tag && $text =~ /\S/) {

lib/HTML/Parser.pm  view on Meta::CPAN

opened in ":utf8" mode.

The alternative solution is to enable the C<utf8_mode> and not decode before
passing strings to $p->parse().  The parser can process raw undecoded UTF-8
sanely if the C<utf8_mode> is enabled, or if the C<attr>, C<@attr> or C<dtext>
argspecs are avoided.

=item Parsing string decoded with wrong endian selection

(W) The first character in the document is U+FFFE.  This is not a
legal Unicode character but a byte swapped C<BOM>.  The result of parsing
will likely be garbage.

=item Parsing of undecoded UTF-32

(W) The parser found the Unicode UTF-32 C<BOM> signature at the start
of the document.  The result of parsing will likely be garbage.

=item Parsing of undecoded UTF-16

(W) The parser found the Unicode UTF-16 C<BOM> signature at the start of
the document.  The result of parsing will likely be garbage.

=back

=head1 SEE ALSO

L<HTML::Entities>, L<HTML::PullParser>, L<HTML::TokeParser>, L<HTML::HeadParser>,
L<HTML::LinkExtor>, L<HTML::Form>

L<HTML::TreeBuilder> (part of the I<HTML-Tree> distribution)

ppport.h  view on Meta::CPAN

#endif
#if defined(is_utf8_string) && defined(UTF8SKIP)
#ifndef isUTF8_CHAR
#define isUTF8_CHAR(s, e) ( \
(e) <= (s) || ! is_utf8_string(s, UTF8_SAFE_SKIP(s, e)) \
? 0 \
: UTF8SKIP(s))
#endif
#endif
#if 'A' == 65
#ifndef BOM_UTF8
#define BOM_UTF8 "\xEF\xBB\xBF"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
#define REPLACEMENT_CHARACTER_UTF8 "\xEF\xBF\xBD"
#endif
#elif '^' == 95
#ifndef BOM_UTF8
#define BOM_UTF8 "\xDD\x73\x66\x73"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
#define REPLACEMENT_CHARACTER_UTF8 "\xDD\x73\x73\x71"
#endif
#elif '^' == 176
#ifndef BOM_UTF8
#define BOM_UTF8 "\xDD\x72\x65\x72"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
#define REPLACEMENT_CHARACTER_UTF8 "\xDD\x72\x72\x70"
#endif
#else
#error Unknown character set
#endif
#if (PERL_BCDVERSION < 0x5035010)
#undef utf8_to_uvchr_buf
#endif

t/headparser.t  view on Meta::CPAN

    close($fh);
}

$p = HTML::HeadParser->new(H->new);
$p->parse_file($file);
unlink($file) or warn "Can't unlink $file: $!";

ok(!$p->as_string);

SKIP: {
    # Test that the Unicode BOM does not confuse us?
    $p = HTML::HeadParser->new(H->new);
    ok($p->parse("\x{FEFF}\n<title>Hi <foo></title>"));
    $p->eof;

    is($p->header("title"), "Hi <foo>");

    $p = HTML::HeadParser->new(H->new);
    $p->utf8_mode(1);
    $p->parse(
        <<"EOT"); # example from http://rt.cpan.org/Ticket/Display.html?id=27522



( run in 1.076 second using v1.01-cache-2.11-cpan-e9daa2b36ef )