HTML-Parser
view release on metacpan or search on metacpan
be reported if requested via the 'attr' or 'tokens' argspecs
for the 'end' handler.
- Parse '</:comment>' and '</ comment>' as comments unless
strict_comment is enabled. Previous versions of the parser
would report these as text. If these comments contain
quoted words prefixed by space or '=' these words can
contain '>' without terminating the comment.
- Parse '<! "<>" foo>' as comment containing ' "<>" foo'.
Previous versions of the parser would terminate the comment
at the first '>' and report the rest as text.
- Legacy comment mode: Parse with comments terminated with a
lone '>' if no '-->' is found before eof.
- Incomplete tag at eof is reported as a 'comment' instead
of 'text' unless strict_comment is enabled.
3.28 2003-04-16
* When 'strict_comment' is off (which it is by default)
treat anything that matches <!...> a comment.
* Should now be more efficient on threaded perls.
3.27 2003-01-18
* Typo fixes to the documentation.
* HTML::Entities::escape_entities_numeric contributed
by Sean M. Burke <sburke@cpan.org>.
* Included one more example program 'hlc' that show
how to downcase all tags in an HTML file.
3.26 2002-03-17
* Avoid core dump in some cases where the callback croaks.
The perl_call_method and perl_call_sv needs G_EVAL flag
to be safe.
* New parser attributes; 'attr_encoded' and 'case_sensitive'.
Contributed by Guy Albertelli II <guy@albertelli.com>.
* HTML::Entities
- don't encode \r by default as suggested by Sean M. Burke.
* HTML::HeadParser
- ignore empty http-equiv
- allow multiple <link> elements. Patch by
Timur I. Bakeyev <timur@gnu.org>
* Avoid warnings from bleadperl on the uentities test.
3.25 2001-05-11
* Minor tweaks for build failures on perl5.004_04, perl-5.6.0,
and for macro clash under Windows.
* Improved parsing of <plaintext>... :-)
3.24 2001-05-09
* $p->parse(CODE)
* New events: start_document, end_document
* New argspecs: skipped_text, offset_end
* The offset/line/column counters was not properly reset
after eof.
3.23 2001-05-01
* If the $p->ignore_elements filter did not work as it should if
handlers for start/end events was not registered.
3.22 2001-04-17
* The <textarea> element is now parsed in literal mode, i.e. no other tags
recognized until the </textarea> tag is seen. Unlike other literal elements,
the text content is not 'cdata'.
* The XML ' entity is decoded. It apos-char itself is still encoded as
' as ' is not really an HTML tag, and not recognized by many HTML
browsers.
3.21 2001-04-10
* Fix a memory leak which occurred when using filter methods.
* Avoid a few compiler warnings (DEC C):
- Trailing comma found in enumerator list
- "unsigned char" is not compatible with "const char".
* Doc update.
3.20 2001-04-02
* Some minor documentation updates.
3.19_94 2001-03-30
* Implemented 'tag', 'line', 'column' argspecs.
* HTML::PullParser doc update.
eg/hform is an example of HTML::PullParser usage.
3.19_93 2001-03-27
* Shorten 'report_only_tags' to 'report_tags'.
I think it reads better.
* Bleadperl portability fixes.
3.19_92 2001-03-25
* HTML::HeadParser made more efficient by using 'ignore_elements'.
* HTML::LinkExtor made more efficient by using 'report_only_tags'.
* HTML::TokeParser generalized into HTML::PullParser. HTML::PullParser
only support the get_token/unget_token interface of HTML::TokeParser,
but is more flexible because the information that make up an token
is customisable. HTML::TokeParser is made into an HTML::PullParser
subclass.
3.19_91 2001-03-19
* Array references can be passed to the filter methods. Makes it easier
to use them as constructor options.
* Example programs updated to use filters.
* Reset ignored_element state on EOF.
* Documentation updates.
* The netscape_buggy_comment() method now generates mandatory warning
about its deprecation.
3.19_90 2001-03-13
* This is an developer only release. It contains some new
experimental features. The interface to these might still change.
* Implemented filters to reduce the numbers of callbacks generated:
- $p->ignore_tags()
- $p->report_only_tags()
- $p->ignore_elements()
* New @attr argspec. Less overhead than 'attr' and allow
compatibility with XML::Parser style start events.
* The whole argspec can be wrapped up in @{...} to signal
flattening. Only makes a difference when the target is an
array.
3.19 2001-03-09
* Avoid the entity2char global. That should make the module
more thread safe. Patch by Gurusamy Sarathy <gsar@ActiveState.com>.
3.18 2001-02-24
3.04 2000-01-15
* Backed out 3.03-patch that checked for legal handler and attribute
names in the HTML::Parser constructor.
* Documentation typo fixed by Michael.
3.03 2000-01-14
* We did not get out of comment mode for comments ending with an
odd number of "-" before ">". Patch by la mouton <kero@3sheep.com>
* Documentation patch by Michael.
3.02 1999-12-21
* Hide ~-magic IV-pointer to 'struct p_state' behind a reference.
This allow copying of the internal _hparser_xs_state element, and
will make HTML-Tree-0.61 work again.
* Introduced $p->init() which might be useful for subclasses that
only want the initialization part of the constructor.
* Filled out DIAGNOSTICS section of the HTML::Parser POD.
3.01 1999-12-19
* Rely on ~-magic instead of a DESTROY method to deallocate
the internal 'struct p_state'. This avoid memory leaks
when people simply wipe of the content of the object hash.
* One of the assertion in hparser.c had opposite logic. This made
the parser fail when compiled with a -DDEBUGGING perl.
* Don't assume any specific order of hash keys in the t/cases.t.
This test failed with some newer development releases of perl.
3.00 1999-12-14
* Documentation update (most of it from Michael)
* Minor patch to eg/hstrip so that it use a "" handler
instead of &ignore.
* Test suite patches from Michael
2.99_96 1999-12-13
* Patches from Michael:
- A handler of "" means that the event will be ignored.
More efficient than using 'sub {}' as handler.
- Don't use a perl hash for looking up argspec keywords.
- Documentation tweaks.
2.99_95 1999-12-09
* (this is a 3.00 candidate)
* Fixed core dump when "<" was followed by an 8-bit character.
Spotted and test case provided by Doug MacEachern. Doug had
been running HTML-Parser-XS through more that 1 million urls that
had been downloaded via LWP.
* Handlers can now invoke $p->eof to request the parsing to terminate.
HTML::HeadParser has been simplified by taking advantage of this.
Also added a title-extraction example that uses this.
* Michael once again fixed my bad English in the HTML::Parser
documentation.
* netscape_buggy_comment will carp instead of warn
* updated TODO/README
* Documented that HTML::Filter is depreciated.
* Made backslash reserved in literal argspec strings.
* Added several new test scripts.
2.99_94 1999-12-08
* (should almost be a 3.00 candidate)
* Renamed 'cdata_flag' as 'is_cdata'.
* Dropped support for wrapping callback handler and argspec
in an array and passing a reference to $p->handler. It
created ambiguities when you want to pass a array as
handler destination and not update argspec. The wrapping
for constructor arguments are unchanged.
* Reworked the documentation after updates from Michael.
* Simplified internal check_handler(). It should probably simply
be inlined in handler() again.
* Added argspec 'length' and 'undef'
* Fix statement-less label. Fix suggested by Matthew Langford
<langfml@Eng.Auburn.EDU>.
* Added two more example programs: eg/hstrip and eg/htext.
* Various minor patches from Michael.
2.99_93 1999-12-07
* Documentation update
* $p->bool_attr_value renamed as $p->boolean_attribute_value
* Internal renaming: attrspec --> argspec
* Introduced internal 'enum argcode' in hparser.c
* Added eg/hrefsub
2.99_92 1999-12-05
* More documentation patches from Michael
* Renamed 'token1' as 'token0' as suggested by Michael
* For artificial end tags we now report 'tokens', but not 'tokenpos'.
* Boolean attribute values show up as (0, 0) in 'tokenpos' now.
* If $p->bool_attr_value is set it will influence 'tokens'
* Fix for core dump when parsing <a "> when $p->strict_names(0).
Based on fix by Michael.
* Will av_extend() the tokens/tokenspos arrays.
* New test suite script by Michael: t/attrspec.t
2.99_91 1999-12-04
* Implemented attrspec 'offset'
* Documentation patch from Michael
* Some more cleanup/updated TODO
2.99_90 1999-12-03
* (first beta for 3.00)
* Using "realloc" as a parameter name in grow_tokens created
problems for some people. Fix by Paul Schinder <schinder@pobox.com>
* Patch by Michael that makes array handler destinations really work.
* Patch by Michael that make HTML::TokeParser use this. This gave a
a speedup of about 80%.
* Patch by Michael that makes t/cases into a real test.
* Small HTML::Parser documentation patch by Michael.
* Renamed attrspec 'origtext' to 'text' and 'decoded_text' to 'dtext'
* Split up Parser.xs. Moved stuff into hparser.c and util.c
* Dropped html_ prefix from internal parser functions.
* Renamed internal function html_handle() as report_event().
2.99_17 1999-12-02
* HTML::Parser documentation patch from Michael.
* Fix memory leaks in html_handler()
* Patch that makes an array legal as handler destination.
Also from Michael.
* The end of marked sections does not eat successive newline
any more.
* The artificial end event for empty tag in xml_mode did not
report an empty origtext.
( run in 0.831 second using v1.01-cache-2.11-cpan-cdf2f3d4e48 )