HTML-Parser-Simple

 view release on metacpan or  search on metacpan

Changelog.ini  view on Meta::CPAN

EOT

[V 1.02]
Date=2009-02-26T11:24:00
Comments= <<EOT
- Rename scripts/parse.file.pl to scripts/parse.html.pl
- Ship scripts/parse.xhtml.pl
- Ship t/Data.pm to read in test data from t/data/
- Rewrite t/*.t to use t/Data.pm
- Patch Simple.pm to accept xhtml as a parameter to new
- Patch Simple.pm to use xhtml in a few of places. XHTML support is not finished!
- Patch Simple.pm to use accessors for object attributes as per PBP. Specifically:
get/set_*() for current_node, depth, input_dir, node_type, output_dir, root, verbose, xhtml
- Hence, rename root() to get_root()
- Hence, rename verbose() to get_verbose()
- Rename new_node() to create_new_node(), since that makes more sense when using get/set_*()
- There are no methods get_result() and set_result(). The reason is efficiency. If we had
$self -> set_result($self -> get_result() + '<tag>') it would mean duplicating the result so far
each time a few chars were added
- Ship various tests, with data, for XHTML
- Add depth to the hashref of data for each tag's node in the tree

Changes  view on Meta::CPAN

1.03  Fri Jun 12 11:49:00 2009
	- Improved tests and documentation (Mark Stosberg)
	- Added attribute parsing via HTML::Parser::Simple::Attributes (Mark Stosberg)

1.02  Thu Feb 26 11:24:00 2009
	- Rename scripts/parse.file.pl to scripts/parse.html.pl
	- Ship scripts/parse.xhtml.pl
	- Ship t/Data.pm to read in test data from t/data/
	- Rewrite t/*.t to use t/Data.pm
	- Patch Simple.pm to accept xhtml as a parameter to new
	- Patch Simple.pm to use xhtml in a few of places. XHTML support is not finished!
	- Patch Simple.pm to use accessors for object attributes as per PBP. Specifically:
	get/set_*() for current_node, depth, input_dir, node_type, output_dir, root, verbose, xhtml
	- Hence, rename root() to get_root()
	- Hence, rename verbose() to get_verbose()
	- Rename new_node() to create_new_node(), since that makes more sense when using get/set_*()
	- There are no methods get_result() and set_result(). The reason is efficiency. If we had
	$self -> set_result($self -> get_result() + '<tag>') it would mean duplicating the result so far
	each time a few chars were added
	- Ship various tests, with data, for XHTML
	- Add depth to the hashref of data for each tag's node in the tree

data/90.xml.declaration.xml  view on Meta::CPAN

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Virtual Library</title>
  </head>
  <body>
    <p>Moved to <a href="http://example.org/">example.org</a>.</p>
  </body>
</html>

lib/HTML/Parser/Simple.pm  view on Meta::CPAN

	 th => 1,
	 thead => 1,
	 'tr' => 1,
	});

	$self -> current_node($self -> create_new_node('root', '', Tree::Simple -> ROOT) );
	$self -> root($self -> current_node);

	if ($self -> xhtml)
	{
		# Compared to the non-XHTML re, this has an extra  ':' in the first [].

		$self -> tagged_attribute
		(
			q#^(<(\w+)((?:\s+[-:\w]+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>)#
		);
	}
	else
	{
		$self -> tagged_attribute
		(

lib/HTML/Parser/Simple.pm  view on Meta::CPAN

# -----------------------------------------------

sub _set_tagged_attribute
{
	my($self, $new, $old) = @_;

	if ($new)
	{
		$self -> tagged_attribute
		(
			# Compared to the non-XHTML re, this has an extra  ':' in the first [].

			q#^(<(\w+)((?:\s+[-:\w]+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>)#
		);
	}
	else
	{
		$self -> tagged_attribute
		(
			q#^(<(\w+)((?:\s+[-\w]+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>)#
		);

lib/HTML/Parser/Simple.pm  view on Meta::CPAN


Write more or less progress messages.

Default: 0.

=item o xhtml => $Boolean

This takes either a 0 or a 1.

0 means do not accept an XML declaration, such as <?xml version="1.0" encoding="UTF-8"?>
at the start of the input file, and some other XHTML features, explained next.

1 means accept XHTML input.

Default: 0.

The only XHTML changes to this code, so far, are:

=over 4

=item o Accept the XML declaration

E.g.: <?xml version="1.0" standalone='yes'?>.

=item o Accept attribute names containing the ':' char

E.g.: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">.

t/data/90.xml.declaration.xhtml  view on Meta::CPAN

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Virtual Library</title>
  </head>
  <body>
    <p>Moved to <a href="http://example.org/">example.org</a>.</p>
  </body>
</html>

t/parse.xhtml.t  view on Meta::CPAN


open(my $fh, '<', $p -> input_file) || BAILOUT("Can't read t/data/90.xml.declaration.xhtml");
my($html);
read($fh, $html, -s $fh);
close $fh;

my(@got)      = split(/\n/, $p -> parse($html) -> traverse($p -> root) -> result);
my($expected) = <<EOS;
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Virtual Library</title>
  </head>
  <body>
    <p>Moved to <a href="http://example.org/">example.org</a>.</p>
  </body>
</html>
EOS



( run in 1.575 second using v1.01-cache-2.11-cpan-49f99fa48dc )