HTML-Parser
view release on metacpan or search on metacpan
* fix to TokeParser to correctly handle option configuration (Barbie)
* Aesthetic change: remove extra ; (Jon Jensen)
* Trim surrounding whitespace from extracted URLs. (Ville Skyttä)
3.68 2010-09-01
* Declare the encoding of the POD to be utf8
3.67 2010-08-17
* bleadperl 2154eca7 breaks HTML::Parser 3.66 [RT#60368] (Nicholas Clark)
3.66 2010-07-09
* Fix entity decoding in utf8_mode for the title header
3.65 2010-04-04
* Eliminate buggy entities_decode_old
* Fixed endianness typo [RT#50811] (Salvatore Bonaccorso)
* Documentation Fixes. (Ville Skyttä)
3.64 2009-10-25
* Convert files to UTF-8
* Don't allow decode_entities() to generate illegal Unicode chars
* Copyright 2009
* Remove rendundant (repeated) test
* Make parse_file() method use 3-arg open [RT#49434]
3.63 2009-10-22
* Take more care to prepare the char range for encode_entities [RT#50170]
* decode_entities confused by trailing incomplete entity
3.62 2009-08-13
* Doc patch: Make it clearer what the return value from ->parse is
* HTTP::Header doc typo fix. (Ville Skyttä)
* Do not bother tracking style or script, they're ignored. (Ville Skyttä)
* Bring HTML 5 head elements up to date with WD-html5-20090423. (Ville Skyttä)
* Improve HeadParser performance. (Ville Skyttä)
3.61 2009-06-20
* Test that triggers the crash that Chip fixed
* Complete documented list of literal tags
* Avoid crash (referenced pend_text instead of skipped_text) (Chip Salzenberg)
* Reference HTML::LinkExttor [RT#43164] (Antonio Radici)
3.60 2009-02-09
* Spelling fixes. (Ville Skyttä)
* Test multi-value headers. (Ville Skyttä)
* Documentation improvements. (Ville Skyttä)
* Do not terminate head parsing on the <object> element (added in HTML 4.0). (Ville Skyttä)
* Add support for HTML 5 <meta charset> and new HEAD elements. (Ville Skyttä)
* Short description of the htextsub example (Damyan Ivanov)
* Suppress warning when encode_entities is called with undef [RT#27567] (Mike South)
* HTML::Parser doesn't compile with perl 5.8.0. (Zefram)
3.59 2008-11-24
* Restore perl-5.6 compatibility for HTML::HeadParser.
* Improved META.yml
3.58 2008-11-17
* Suppress "Parsing of undecoded UTF-8 will give garbage" warning
with attr_encoded [RT#29089]
* HTML::HeadParser:
- Recognize the Unicode BOM in utf8_mode as well [RT#27522]
- Avoid ending up with '/' keys attribute in Link headers.
3.57 2008-11-16
* The <iframe> element content is now parsed in literal mode.
* Parsing of <script> and <style> content ends on the first end tag
even when that tag was in a quoted string. That seems to be the
behaviour of all modern browsers.
* Implement backquote() attribute as requested by Alex Kapranoff.
* Test and documentation tweaks from Alex Kapranoff.
3.56 2007-01-12
* Cloning of parser state for compatibility with threads.
Fixed by Bo Lindbergh <blgl@hagernas.com>.
* Don't require whitespace between declaration tokens.
<http://rt.cpan.org/Ticket/Display.html?id=20864>
3.55 2006-07-10
* Treat <> at the end of document as text. Used to be
reported as a comment.
* Improved Firefox compatibility for bad HTML:
- Unclosed <script>, <style> are now treated as empty tags.
- Unclosed <textarea>, <xmp> and <plaintext> treat rest as text.
- Unclosed <title> closes at next tag.
* Make <!a'b> a comment by itself.
3.54 2006-04-28
* Yaakov Belch discovered yet another issue with <script> parsing.
Enabling of 'empty_element_tags' got the parser confused
if it found such a tag for elements that are normally parsed
in literal mode. Of these <script src="..."/> is the only
one likely to be found in documents.
<http://rt.cpan.org//Ticket/Display.html?id=18965>
3.53 2006-04-27
* When ignore_element was enabled it got confused if the
corresponding tags did not nest properly; the end tag
was treated it as if it was a start tag.
Found and fixed by Yaakov Belch <code@yaakovnet.net>.
<http://rt.cpan.org/Ticket/Display.html?id=18936>
3.52 2006-04-26
* Make sure the 'start_document' fires exactly once for
each document parsed. For earlier releases it did not
fire at all for empty documents and could fire multiple
times if parse was called with empty chunks.
* Documentation tweaks and typo fixes.
3.51 2006-03-22
* Named entities outside the Latin-1 range are now only expanded
when properly terminated with ";". This makes HTML::Parser
compatible with Firefox/Konqueror/MSIE when it comes to how these
entities are expanded in attribute values. Firefox does expand
unterminated non-Latin-1 entities in plain text, so here
HTML::Parser only stays compatible with Konqueror/MSIE.
Fixes <http://rt.cpan.org/Ticket/Display.html?id=17962>.
* Fixed some documentation typos spotted by <william@knowmad.com>.
<http://rt.cpan.org/Ticket/Display.html?id=18062>
3.50 2006-02-14
* The 3.49 release didn't compile with VC++ because it mixed code
* Enabling empty_element_tags by default for HTML::TokeParser
was a mistake. Reverted that change.
<http://rt.cpan.org/Ticket/Display.html?id=16164>
* When processing a document with "marked_sections => 1", the
skipped text missed the first 3 bytes "<![".
<http://rt.cpan.org/Ticket/Display.html?id=16207>
3.47 2005-11-22
* Added empty_element_tags and xml_pic configuration
options. These make it possible to enable these XML
features without enabling the full XML-mode.
* The empty_element_tags is enabled by default for
HTML::TokeParser.
3.46 2005-10-24
* Don't try to treat an literal as space.
This breaks Unicode parsing.
<http://rt.cpan.org/Ticket/Display.html?id=15068>
* The unbroken_text option is now on by default
for HTML::TokeParser.
* HTML::Entities::encode will now encode "'" by default.
* Improved report/ignore_tags documentation by
Norbert Kiesel <nkiesel@tbdnetworks.com>.
* Test suite now use Test::More, by
Norbert Kiesel <nkiesel@tbdnetworks.com>.
* Fix HTML::Entities typo spotted by
Stefan Funke <bundy@adm.arcor.net>.
* Faster load time with XSLoader (perl-5.6 or better now required).
* Fixed POD markup errors in some of the modules.
3.45 2005-01-06
* Fix stack memory leak caused by missing PUTBACK. Only
code that used $p->parse(\&cb) form was affected.
Fix provided by Gurusamy Sarathy <gsar@sophos.com>.
3.44 2004-12-28
* Fix confusion about nested quotes in <script> and <style> text.
3.43 2004-12-06
* The SvUTF8 flag was not propagated correctly when replacing
unterminated entities.
* Fixed test failure because of missing binmode on Windows.
3.42 2004-12-04
* Avoid sv_catpvn_utf8_upgrade() as that macro was not
available in perl-5.8.0.
Patch by Reed Russell <Russell.Reed@acxiom.com>.
* Add casts to suppress compilation warnings for char/U8
mismatches.
* HTML::HeadParser will always push new header values.
This make sure we never loose old header values.
3.41 2004-11-30
* Fix unresolved symbol error with perl-5.005.
3.40 2004-11-29
* Make utf8_mode only available on perl-5.8 or better. It produced
garbage with older versions of perl.
* Emit warning if entities are decoded and something in the first
chunk looks like hi-bit UTF-8. Previously this warning was only
triggered for documents with BOM.
3.39_92 2004-11-23
* More documentation of the Unicode issues. Moved around HTML::Parser
documentation a bit.
* New boolean option; $p->utf8_mode to allow parsing of raw UTF-8.
* Documented that HTML::Entities::decode_entities() can take multiple
arguments.
* Unterminated entities are now decoded in text (compatibility
with MSIE misfeature).
* Document HTML::Entities::_decode_entities(); this variation of the
decode_entities() function has been available for a long time, but
have not been documented until now.
* HTML::Entities::_decode_entities() can now be told to try to
expand unterminated entities.
* Simplified Makefile.PL
3.39_91 2004-11-23
* The HTML::HeadParser will skip Unicode BOM. Previously it
would consider the <head> section done when it saw the BOM.
* The parser will look for Unicode BOM and give appropriate
warnings if the form found indicate trouble.
* If no matching end tag is found for <script>, <style>, <xmp>
<title>, <textarea> then generate one where the next tag
starts.
* For <script> and <style> recognize quoted strings and don't
consider end element if the corresponding end tag is found
inside such a string.
3.39_90 2004-11-17
* The <title> element is now parsed in literal mode, which
means that other tags are not recognized until </title> has
been seen.
* Unicode support for perl-5.8 and better.
* Decoding Unicode entities always enabled; no longer a compile
time option.
* Propagation of UTF8 state on strings.
Patch contributed by John Gardiner Myers <jgmyers@proofpoint.com>.
* Calculate offsets and lengths in chars for Unicode strings.
* Fixed link typo in the HTML::TokeParser documentation.
3.38 2004-11-11
* New boolean option; $p->closing_plaintext
Contributed by Alex Kapranoff <alex@kapranoff.ru>
3.37 2004-11-10
* Improved handling of HTML encoded surrogate pairs and illegally
encoded Unicode; <http://rt.cpan.org/Ticket/Display.html?id=7785>.
Patch by John Gardiner Myers <jgmyers@proofpoint.com>.
* Avoid generating bad UTF8 strings when decoding entities
representing chars beyond #255 in 8-bit strings. Such bad
UTF8 sometimes made perl-5.8.5 and older segfault.
* Undocument v2 style subclassing in synopsis section.
* Internal cleanup: Make 'gcc -Wall' happier.
* Avoid modification of PVs during parsing of attrspec.
Another patch by John Gardiner Myers.
3.36 2004-04-01
* Improved MSIE/Mozilla compatibility. If the same attribute
name repeats for a start tag, use the first value instead
of the last. Patch by Nick Duffek <html-parser@duffek.com>.
<http://rt.cpan.org/Ticket/Display.html?id=5472>
3.35 2003-12-12
* Documentation fixes by Paul Croome <Paul.Croome@softwareag.com>.
* Removed redundant dSP.
3.34 2003-10-27
* Fix segfault that happened when the parse callback caused
the stack to get reallocated. The original bug report was
<http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=217616>
3.33 2003-10-14
* Perl 5.005 or better is now required. For some reason we get
a test failure with perl-5.004 and I don't really feel like
debugging that perl any more. Details about this failure can
be found at <http://rt.cpan.org/Ticket/Display.html?id=4065>.
* New HTML::TokeParser method called 'get_phrase'. It returns
all current text while ignoring any phrase-level markup.
* The HTML::TokeParser method 'get_text' now expands skipped
non-phrase-level tags as a single space.
( run in 0.784 second using v1.01-cache-2.11-cpan-d7f47b0818f )