HTML-Parser

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN

  * fix to TokeParser to correctly handle option configuration (Barbie)
  * Aesthetic change: remove extra ; (Jon Jensen)
  * Trim surrounding whitespace from extracted URLs. (Ville Skyttä)

3.68     2010-09-01
  * Declare the encoding of the POD to be utf8

3.67     2010-08-17
  * bleadperl 2154eca7 breaks HTML::Parser 3.66 [RT#60368] (Nicholas Clark)

3.66     2010-07-09
  * Fix entity decoding in utf8_mode for the title header

3.65     2010-04-04
  * Eliminate buggy entities_decode_old
  * Fixed endianness typo [RT#50811] (Salvatore Bonaccorso)
  * Documentation Fixes. (Ville Skyttä)

3.64     2009-10-25
  * Convert files to UTF-8
  * Don't allow decode_entities() to generate illegal Unicode chars
  * Copyright 2009
  * Remove rendundant (repeated) test
  * Make parse_file() method use 3-arg open [RT#49434]

3.63     2009-10-22
  * Take more care to prepare the char range for encode_entities [RT#50170]
  * decode_entities confused by trailing incomplete entity

3.62     2009-08-13
  * Doc patch: Make it clearer what the return value from ->parse is
  * HTTP::Header doc typo fix. (Ville Skyttä)
  * Do not bother tracking style or script, they're ignored. (Ville Skyttä)
  * Bring HTML 5 head elements up to date with WD-html5-20090423. (Ville Skyttä)
  * Improve HeadParser performance. (Ville Skyttä)

3.61     2009-06-20
  * Test that triggers the crash that Chip fixed
  * Complete documented list of literal tags
  * Avoid crash (referenced pend_text instead of skipped_text) (Chip Salzenberg)
  * Reference HTML::LinkExttor [RT#43164] (Antonio Radici)

3.60     2009-02-09
  * Spelling fixes. (Ville Skyttä)
  * Test multi-value headers. (Ville Skyttä)
  * Documentation improvements. (Ville Skyttä)
  * Do not terminate head parsing on the <object> element (added in HTML 4.0). (Ville Skyttä)
  * Add support for HTML 5 <meta charset> and new HEAD elements. (Ville Skyttä)
  * Short description of the htextsub example (Damyan Ivanov)
  * Suppress warning when encode_entities is called with undef [RT#27567] (Mike South)
  * HTML::Parser doesn't compile with perl 5.8.0. (Zefram)

3.59     2008-11-24
  * Restore perl-5.6 compatibility for HTML::HeadParser.
  * Improved META.yml

3.58     2008-11-17
  * Suppress "Parsing of undecoded UTF-8 will give garbage" warning
     with attr_encoded [RT#29089]
  * HTML::HeadParser:
       - Recognize the Unicode BOM in utf8_mode as well [RT#27522]
       - Avoid ending up with '/' keys attribute in Link headers.

3.57     2008-11-16
  * The <iframe> element content is now parsed in literal mode.
  * Parsing of <script> and <style> content ends on the first end tag
     even when that tag was in a quoted string.  That seems to be the
     behaviour of all modern browsers.
  * Implement backquote() attribute as requested by Alex Kapranoff.
  * Test and documentation tweaks from Alex Kapranoff.

3.56     2007-01-12
  * Cloning of parser state for compatibility with threads.
     Fixed by Bo Lindbergh <blgl@hagernas.com>.
  * Don't require whitespace between declaration tokens.
     <http://rt.cpan.org/Ticket/Display.html?id=20864>

3.55     2006-07-10
  * Treat <> at the end of document as text.  Used to be
     reported as a comment.
  * Improved Firefox compatibility for bad HTML:
      - Unclosed <script>, <style> are now treated as empty tags.
      - Unclosed <textarea>, <xmp> and <plaintext> treat rest as text.
      - Unclosed <title> closes at next tag.
  * Make <!a'b> a comment by itself.

3.54     2006-04-28
  * Yaakov Belch discovered yet another issue with <script> parsing.
     Enabling of 'empty_element_tags' got the parser confused
     if it found such a tag for elements that are normally parsed
     in literal mode.  Of these <script src="..."/> is the only
     one likely to be found in documents.
     <http://rt.cpan.org//Ticket/Display.html?id=18965>

3.53     2006-04-27
  * When ignore_element was enabled it got confused if the
     corresponding tags did not nest properly; the end tag
     was treated it as if it was a start tag.
     Found and fixed by Yaakov Belch <code@yaakovnet.net>.
     <http://rt.cpan.org/Ticket/Display.html?id=18936>

3.52     2006-04-26
  * Make sure the 'start_document' fires exactly once for
     each document parsed.  For earlier releases it did not
     fire at all for empty documents and could fire multiple
     times if parse was called with empty chunks.
  * Documentation tweaks and typo fixes.

3.51     2006-03-22
  * Named entities outside the Latin-1 range are now only expanded
     when properly terminated with ";".  This makes HTML::Parser
     compatible with Firefox/Konqueror/MSIE when it comes to how these
     entities are expanded in attribute values.  Firefox does expand
     unterminated non-Latin-1 entities in plain text, so here
     HTML::Parser only stays compatible with Konqueror/MSIE.
     Fixes <http://rt.cpan.org/Ticket/Display.html?id=17962>.
  * Fixed some documentation typos spotted by <william@knowmad.com>.
     <http://rt.cpan.org/Ticket/Display.html?id=18062>

3.50     2006-02-14
  * The 3.49 release didn't compile with VC++ because it mixed code

Changes  view on Meta::CPAN

  * Enabling empty_element_tags by default for HTML::TokeParser
     was a mistake.  Reverted that change.
     <http://rt.cpan.org/Ticket/Display.html?id=16164>
  * When processing a document with "marked_sections => 1", the
     skipped text missed the first 3 bytes "<![".
     <http://rt.cpan.org/Ticket/Display.html?id=16207>

3.47     2005-11-22
  * Added empty_element_tags and xml_pic configuration
     options.  These make it possible to enable these XML
     features without enabling the full XML-mode.
  * The empty_element_tags is enabled by default for
     HTML::TokeParser.

3.46     2005-10-24
  * Don't try to treat an literal &nbsp; as space.
     This breaks Unicode parsing.
     <http://rt.cpan.org/Ticket/Display.html?id=15068>
  * The unbroken_text option is now on by default
     for HTML::TokeParser.
  * HTML::Entities::encode will now encode "'" by default.
  * Improved report/ignore_tags documentation by
     Norbert Kiesel <nkiesel@tbdnetworks.com>.
  * Test suite now use Test::More, by
     Norbert Kiesel <nkiesel@tbdnetworks.com>.
  * Fix HTML::Entities typo spotted by
     Stefan Funke <bundy@adm.arcor.net>.
  * Faster load time with XSLoader (perl-5.6 or better now required).
  * Fixed POD markup errors in some of the modules.

3.45     2005-01-06
  * Fix stack memory leak caused by missing PUTBACK.  Only
     code that used $p->parse(\&cb) form was affected.
     Fix provided by Gurusamy Sarathy <gsar@sophos.com>.

3.44     2004-12-28
  * Fix confusion about nested quotes in <script> and <style> text.

3.43     2004-12-06
  * The SvUTF8 flag was not propagated correctly when replacing
     unterminated entities.
  * Fixed test failure because of missing binmode on Windows.

3.42     2004-12-04
  * Avoid sv_catpvn_utf8_upgrade() as that macro was not
     available in perl-5.8.0.
     Patch by Reed Russell <Russell.Reed@acxiom.com>.
  * Add casts to suppress compilation warnings for char/U8
     mismatches.
  * HTML::HeadParser will always push new header values.
     This make sure we never loose old header values.

3.41     2004-11-30
  * Fix unresolved symbol error with perl-5.005.

3.40     2004-11-29
  * Make utf8_mode only available on perl-5.8 or better.  It produced
     garbage with older versions of perl.
  * Emit warning if entities are decoded and something in the first
     chunk looks like hi-bit UTF-8.  Previously this warning was only
     triggered for documents with BOM.

3.39_92     2004-11-23
  * More documentation of the Unicode issues.  Moved around HTML::Parser
     documentation a bit.
  * New boolean option; $p->utf8_mode to allow parsing of raw  UTF-8.
  * Documented that HTML::Entities::decode_entities() can take multiple
     arguments.
  * Unterminated entities are now decoded in text (compatibility
     with MSIE misfeature).
  * Document HTML::Entities::_decode_entities(); this variation of the
     decode_entities() function has been available for a long time, but
     have not been documented until now.
  * HTML::Entities::_decode_entities() can now be told to try to
     expand unterminated entities.
  * Simplified Makefile.PL

3.39_91     2004-11-23
  * The HTML::HeadParser will skip Unicode BOM.  Previously it
     would consider the <head> section done when it saw the BOM.
  * The parser will look for Unicode BOM and give appropriate
     warnings if the form found indicate trouble.
  * If no matching end tag is found for <script>, <style>, <xmp>
     <title>, <textarea> then generate one where the next tag
     starts.
  * For <script> and <style> recognize quoted strings and don't
     consider end element if the corresponding end tag is found
     inside such a string.

3.39_90     2004-11-17
  * The <title> element is now parsed in literal mode, which
     means that other tags are not recognized until </title> has
     been seen.
  * Unicode support for perl-5.8 and better.
  * Decoding Unicode entities always enabled; no longer a compile
    time option.
  * Propagation of UTF8 state on strings.
    Patch contributed by John Gardiner Myers <jgmyers@proofpoint.com>.
  * Calculate offsets and lengths in chars for Unicode strings.
  * Fixed link typo in the HTML::TokeParser documentation.

3.38     2004-11-11
  * New boolean option; $p->closing_plaintext
     Contributed by Alex Kapranoff <alex@kapranoff.ru>

3.37     2004-11-10
  * Improved handling of HTML encoded surrogate pairs and illegally
     encoded Unicode; <http://rt.cpan.org/Ticket/Display.html?id=7785>.
     Patch by John Gardiner Myers <jgmyers@proofpoint.com>.
  * Avoid generating bad UTF8 strings when decoding entities
     representing chars beyond #255 in 8-bit strings.  Such bad
     UTF8 sometimes made perl-5.8.5 and older segfault.
  * Undocument v2 style subclassing in synopsis section.
  * Internal cleanup: Make 'gcc -Wall' happier.
  * Avoid modification of PVs during parsing of attrspec.
    Another patch by John Gardiner Myers.

3.36     2004-04-01
  * Improved MSIE/Mozilla compatibility.  If the same attribute
     name repeats for a start tag, use the first value instead
     of the last.  Patch by Nick Duffek <html-parser@duffek.com>.
     <http://rt.cpan.org/Ticket/Display.html?id=5472>

3.35     2003-12-12
  * Documentation fixes by Paul Croome <Paul.Croome@softwareag.com>.
  * Removed redundant dSP.

3.34     2003-10-27
  * Fix segfault that happened when the parse callback caused
     the stack to get reallocated.  The original bug report was
     <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=217616>

3.33     2003-10-14
  * Perl 5.005 or better is now required.  For some reason we get
     a test failure with perl-5.004 and I don't really feel like
     debugging that perl any more.  Details about this failure can
     be found at <http://rt.cpan.org/Ticket/Display.html?id=4065>.
  * New HTML::TokeParser method called 'get_phrase'.  It returns
     all current text while ignoring any phrase-level markup.
  * The HTML::TokeParser method 'get_text' now expands skipped
     non-phrase-level tags as a single space.



( run in 0.784 second using v1.01-cache-2.11-cpan-d7f47b0818f )