HTML-Parser-Simple
view release on metacpan or search on metacpan
Changelog.ini view on Meta::CPAN
EOT
[V 1.02]
Date=2009-02-26T11:24:00
Comments= <<EOT
- Rename scripts/parse.file.pl to scripts/parse.html.pl
- Ship scripts/parse.xhtml.pl
- Ship t/Data.pm to read in test data from t/data/
- Rewrite t/*.t to use t/Data.pm
- Patch Simple.pm to accept xhtml as a parameter to new
- Patch Simple.pm to use xhtml in a few of places. XHTML support is not finished!
- Patch Simple.pm to use accessors for object attributes as per PBP. Specifically:
get/set_*() for current_node, depth, input_dir, node_type, output_dir, root, verbose, xhtml
- Hence, rename root() to get_root()
- Hence, rename verbose() to get_verbose()
- Rename new_node() to create_new_node(), since that makes more sense when using get/set_*()
- There are no methods get_result() and set_result(). The reason is efficiency. If we had
$self -> set_result($self -> get_result() + '<tag>') it would mean duplicating the result so far
each time a few chars were added
- Ship various tests, with data, for XHTML
- Add depth to the hashref of data for each tag's node in the tree
1.03 Fri Jun 12 11:49:00 2009
- Improved tests and documentation (Mark Stosberg)
- Added attribute parsing via HTML::Parser::Simple::Attributes (Mark Stosberg)
1.02 Thu Feb 26 11:24:00 2009
- Rename scripts/parse.file.pl to scripts/parse.html.pl
- Ship scripts/parse.xhtml.pl
- Ship t/Data.pm to read in test data from t/data/
- Rewrite t/*.t to use t/Data.pm
- Patch Simple.pm to accept xhtml as a parameter to new
- Patch Simple.pm to use xhtml in a few of places. XHTML support is not finished!
- Patch Simple.pm to use accessors for object attributes as per PBP. Specifically:
get/set_*() for current_node, depth, input_dir, node_type, output_dir, root, verbose, xhtml
- Hence, rename root() to get_root()
- Hence, rename verbose() to get_verbose()
- Rename new_node() to create_new_node(), since that makes more sense when using get/set_*()
- There are no methods get_result() and set_result(). The reason is efficiency. If we had
$self -> set_result($self -> get_result() + '<tag>') it would mean duplicating the result so far
each time a few chars were added
- Ship various tests, with data, for XHTML
- Add depth to the hashref of data for each tag's node in the tree
data/90.xml.declaration.xml view on Meta::CPAN
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Virtual Library</title>
</head>
<body>
<p>Moved to <a href="http://example.org/">example.org</a>.</p>
</body>
</html>
lib/HTML/Parser/Simple.pm view on Meta::CPAN
th => 1,
thead => 1,
'tr' => 1,
});
$self -> current_node($self -> create_new_node('root', '', Tree::Simple -> ROOT) );
$self -> root($self -> current_node);
if ($self -> xhtml)
{
# Compared to the non-XHTML re, this has an extra ':' in the first [].
$self -> tagged_attribute
(
q#^(<(\w+)((?:\s+[-:\w]+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>)#
);
}
else
{
$self -> tagged_attribute
(
lib/HTML/Parser/Simple.pm view on Meta::CPAN
# -----------------------------------------------
sub _set_tagged_attribute
{
my($self, $new, $old) = @_;
if ($new)
{
$self -> tagged_attribute
(
# Compared to the non-XHTML re, this has an extra ':' in the first [].
q#^(<(\w+)((?:\s+[-:\w]+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>)#
);
}
else
{
$self -> tagged_attribute
(
q#^(<(\w+)((?:\s+[-\w]+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>)#
);
lib/HTML/Parser/Simple.pm view on Meta::CPAN
Write more or less progress messages.
Default: 0.
=item o xhtml => $Boolean
This takes either a 0 or a 1.
0 means do not accept an XML declaration, such as <?xml version="1.0" encoding="UTF-8"?>
at the start of the input file, and some other XHTML features, explained next.
1 means accept XHTML input.
Default: 0.
The only XHTML changes to this code, so far, are:
=over 4
=item o Accept the XML declaration
E.g.: <?xml version="1.0" standalone='yes'?>.
=item o Accept attribute names containing the ':' char
E.g.: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">.
t/data/90.xml.declaration.xhtml view on Meta::CPAN
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Virtual Library</title>
</head>
<body>
<p>Moved to <a href="http://example.org/">example.org</a>.</p>
</body>
</html>
t/parse.xhtml.t view on Meta::CPAN
open(my $fh, '<', $p -> input_file) || BAILOUT("Can't read t/data/90.xml.declaration.xhtml");
my($html);
read($fh, $html, -s $fh);
close $fh;
my(@got) = split(/\n/, $p -> parse($html) -> traverse($p -> root) -> result);
my($expected) = <<EOS;
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Virtual Library</title>
</head>
<body>
<p>Moved to <a href="http://example.org/">example.org</a>.</p>
</body>
</html>
EOS
( run in 1.575 second using v1.01-cache-2.11-cpan-49f99fa48dc )