HTML-Parser
view release on metacpan or search on metacpan
* HTML::Parser doesn't compile with perl 5.8.0. (Zefram)
3.59 2008-11-24
* Restore perl-5.6 compatibility for HTML::HeadParser.
* Improved META.yml
3.58 2008-11-17
* Suppress "Parsing of undecoded UTF-8 will give garbage" warning
with attr_encoded [RT#29089]
* HTML::HeadParser:
- Recognize the Unicode BOM in utf8_mode as well [RT#27522]
- Avoid ending up with '/' keys attribute in Link headers.
3.57 2008-11-16
* The <iframe> element content is now parsed in literal mode.
* Parsing of <script> and <style> content ends on the first end tag
even when that tag was in a quoted string. That seems to be the
behaviour of all modern browsers.
* Implement backquote() attribute as requested by Alex Kapranoff.
* Test and documentation tweaks from Alex Kapranoff.
This make sure we never loose old header values.
3.41 2004-11-30
* Fix unresolved symbol error with perl-5.005.
3.40 2004-11-29
* Make utf8_mode only available on perl-5.8 or better. It produced
garbage with older versions of perl.
* Emit warning if entities are decoded and something in the first
chunk looks like hi-bit UTF-8. Previously this warning was only
triggered for documents with BOM.
3.39_92 2004-11-23
* More documentation of the Unicode issues. Moved around HTML::Parser
documentation a bit.
* New boolean option; $p->utf8_mode to allow parsing of raw UTF-8.
* Documented that HTML::Entities::decode_entities() can take multiple
arguments.
* Unterminated entities are now decoded in text (compatibility
with MSIE misfeature).
* Document HTML::Entities::_decode_entities(); this variation of the
decode_entities() function has been available for a long time, but
have not been documented until now.
* HTML::Entities::_decode_entities() can now be told to try to
expand unterminated entities.
* Simplified Makefile.PL
3.39_91 2004-11-23
* The HTML::HeadParser will skip Unicode BOM. Previously it
would consider the <head> section done when it saw the BOM.
* The parser will look for Unicode BOM and give appropriate
warnings if the form found indicate trouble.
* If no matching end tag is found for <script>, <style>, <xmp>
<title>, <textarea> then generate one where the next tag
starts.
* For <script> and <style> recognize quoted strings and don't
consider end element if the corresponding end tag is found
inside such a string.
3.39_90 2004-11-17
* The <title> element is now parsed in literal mode, which
data before feeding it to the $p->parse(). For $p->parse_file() pass
a file that has been opened in ":utf8" mode.
The alternative solution is to enable the "utf8_mode" and not decode
before passing strings to $p->parse(). The parser can process raw
undecoded UTF-8 sanely if the "utf8_mode" is enabled, or if the
"attr", @attr or "dtext" argspecs are avoided.
Parsing string decoded with wrong endian selection
(W) The first character in the document is U+FFFE. This is not a
legal Unicode character but a byte swapped "BOM". The result of
parsing will likely be garbage.
Parsing of undecoded UTF-32
(W) The parser found the Unicode UTF-32 "BOM" signature at the start
of the document. The result of parsing will likely be garbage.
Parsing of undecoded UTF-16
(W) The parser found the Unicode UTF-16 "BOM" signature at the start
of the document. The result of parsing will likely be garbage.
SEE ALSO
HTML::Entities, HTML::PullParser, HTML::TokeParser, HTML::HeadParser,
HTML::LinkExtor, HTML::Form
HTML::TreeBuilder (part of the *HTML-Tree* distribution)
<http://www.w3.org/TR/html4/>
if (p_state->buf && SvOK(p_state->buf)) {
sv_catsv(p_state->buf, chunk);
beg = SvPV(p_state->buf, len);
utf8 = SvUTF8(p_state->buf);
}
else {
beg = SvPV(chunk, len);
utf8 = SvUTF8(chunk);
if (p_state->offset == 0 && DOWARN) {
/* Print warnings if we find unexpected Unicode BOM forms */
if (p_state->argspec_entity_decode &&
!(p_state->attr_encoded && p_state->argspec_entity_decode == ARG_ATTR) &&
!p_state->utf8_mode && (
(!utf8 && len >= 3 && strnEQ(beg, "\xEF\xBB\xBF", 3)) ||
(utf8 && len >= 6 && strnEQ(beg, "\xC3\xAF\xC2\xBB\xC2\xBF", 6)) ||
(!utf8 && probably_utf8_chunk(aTHX_ beg, len))
)
)
{
warn("Parsing of undecoded UTF-8 will give garbage when decoding entities");
lib/HTML/HeadParser.pm view on Meta::CPAN
print "END[$tag]\n" if $DEBUG;
$self->flush_text if $self->{'tag'};
$self->eof if $tag eq 'head';
}
sub text
{
my($self, $text) = @_;
print "TEXT[$text]\n" if $DEBUG;
unless ($self->{first_chunk}) {
# drop Unicode BOM if found
if ($self->utf8_mode) {
$text =~ s/^\xEF\xBB\xBF//;
}
else {
$text =~ s/^\x{FEFF}//;
}
$self->{first_chunk}++;
}
my $tag = $self->{tag};
if (!$tag && $text =~ /\S/) {
lib/HTML/Parser.pm view on Meta::CPAN
opened in ":utf8" mode.
The alternative solution is to enable the C<utf8_mode> and not decode before
passing strings to $p->parse(). The parser can process raw undecoded UTF-8
sanely if the C<utf8_mode> is enabled, or if the C<attr>, C<@attr> or C<dtext>
argspecs are avoided.
=item Parsing string decoded with wrong endian selection
(W) The first character in the document is U+FFFE. This is not a
legal Unicode character but a byte swapped C<BOM>. The result of parsing
will likely be garbage.
=item Parsing of undecoded UTF-32
(W) The parser found the Unicode UTF-32 C<BOM> signature at the start
of the document. The result of parsing will likely be garbage.
=item Parsing of undecoded UTF-16
(W) The parser found the Unicode UTF-16 C<BOM> signature at the start of
the document. The result of parsing will likely be garbage.
=back
=head1 SEE ALSO
L<HTML::Entities>, L<HTML::PullParser>, L<HTML::TokeParser>, L<HTML::HeadParser>,
L<HTML::LinkExtor>, L<HTML::Form>
L<HTML::TreeBuilder> (part of the I<HTML-Tree> distribution)
#endif
#if defined(is_utf8_string) && defined(UTF8SKIP)
#ifndef isUTF8_CHAR
#define isUTF8_CHAR(s, e) ( \
(e) <= (s) || ! is_utf8_string(s, UTF8_SAFE_SKIP(s, e)) \
? 0 \
: UTF8SKIP(s))
#endif
#endif
#if 'A' == 65
#ifndef BOM_UTF8
#define BOM_UTF8 "\xEF\xBB\xBF"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
#define REPLACEMENT_CHARACTER_UTF8 "\xEF\xBF\xBD"
#endif
#elif '^' == 95
#ifndef BOM_UTF8
#define BOM_UTF8 "\xDD\x73\x66\x73"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
#define REPLACEMENT_CHARACTER_UTF8 "\xDD\x73\x73\x71"
#endif
#elif '^' == 176
#ifndef BOM_UTF8
#define BOM_UTF8 "\xDD\x72\x65\x72"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
#define REPLACEMENT_CHARACTER_UTF8 "\xDD\x72\x72\x70"
#endif
#else
#error Unknown character set
#endif
#if (PERL_BCDVERSION < 0x5035010)
#undef utf8_to_uvchr_buf
#endif
t/headparser.t view on Meta::CPAN
close($fh);
}
$p = HTML::HeadParser->new(H->new);
$p->parse_file($file);
unlink($file) or warn "Can't unlink $file: $!";
ok(!$p->as_string);
SKIP: {
# Test that the Unicode BOM does not confuse us?
$p = HTML::HeadParser->new(H->new);
ok($p->parse("\x{FEFF}\n<title>Hi <foo></title>"));
$p->eof;
is($p->header("title"), "Hi <foo>");
$p = HTML::HeadParser->new(H->new);
$p->utf8_mode(1);
$p->parse(
<<"EOT"); # example from http://rt.cpan.org/Ticket/Display.html?id=27522
( run in 1.076 second using v1.01-cache-2.11-cpan-e9daa2b36ef )