HTML-Normalize
view release on metacpan or search on metacpan
lib/HTML/Normalize.pm view on Meta::CPAN
=head1 VERSION
Version 1.0003
=head1 SYNOPSIS
my $norm = HTML::Normalize->new ();
my $cleanHtml = $norm->cleanup (-html => $dirtyHtml);
=head1 DESCRIPTION
HTML::Normalize uses HTML::TreeBuilder to parse an HTML string then processes
the resultant tree to clean up various structural issues in the original HTML.
The result is then rendered using HTML::Element's as_HTML member.
Key structural clean ups fix tag soup (C<< <b><i>foo</b></i> >> becomes C<<
<b><i>foo</i></b> >>) and inline/block element nesting (C<<
<span><p>foo</p></span> >> becomes C<< <p><span>foo</span></p> >>). C<< <br> >>
tags at the start or end of a link element are migrated out of the element.
Note that HTML::Normalize's approach to cleaning up tag soup is different than
that used by HTML::Tidy. HTML::Tidy tends to enforce nested and swaps end tags
to achieve that. HTML::Normalize inserts extra tags to allow correctly taged
overlapped markup.
HTML::Normalize can also remove attributes set to default values and empty
elements. For example a C<< <font face="Verdana" size="1" color="#FF0000"> >>
element would become and C<< <font color="#FF0000"> >> and C<< <font
face="Verdana" size="1"> >> would be removed if Verdana size 1 is set as the
default font.
=head1 Methods
C<new> creates an HTML::Normalize instance and performs parameter validation.
C<cleanup> Validates any further parameters and check parameter consistency then
parses the HTML to generate the internal representation. It then edits the
internal representation and renders the result back into HTML.
Note that I<cleanup> may be called multiple times with different HTML strings to
process.
Generally errors are handled by carping and may be detected in both I<new> and
I<cleanup>.
=cut
=head2 new
Create a new C<HTML::Normalize> instance.
my $norm = HTML::Normalize->new ();
=over 4
=item I<-compact>: optional
Setting C<< -compact => 1 >> suppresses generation of 'optional' close tags.
This reduces the sizeof the output slightly at the expense of breaking any hope
of XHTML compliance.
=item I<-default>: optional - multiple
Define a default attribute for an element. Default attributes are removed if the
attribute value has not been overridden in a parent node. For element such as
'font' this may result in the element being removed if no attributes remain.
C<-default> takes a string of the form 'tag attribute=value' as an argument.
For example:
-default => 'font face="Verdana"'
would specify that the face "Verdana" is the default face attribute for font
elements.
I<value> may be a constant or a regular expression. A regular expression
matches:
/(~|qr)\s*(.).*\1\s*$/
except that the paired delimiters [], {}, () and <> are also accepted as pattern
delimiters.
Literal match values should not encode entities, but remember that quotes around
attribute values are optional for some values so the outer pair of quote
characters will be removed if present. The match value extends to the end of the
line and is not bounded by quote qharacters (except as noted earlier) so no
quoting of "special" characters is required - there are no special characters.
Multiple default attributes may be provided but only one default value is
allowed for any one tag/attribute pair.
Default values are case sensitive. However you can use the regular expression
form to overcome this limitation.
=item I<-distribute>: optional - default true
Distribute inline elements over children if the children are block level
elements. For example:
<span boo="foo"><p>foo</p><p>bar</p></span>
becomes:
<p><span boo="foo">foo</span></p><p><span boo="foo">bar</span></p>
This action is only taken if all the child elements are block level elements.
=item I<-expelbr>: optional - default true
If C<-expelbr> is true (the default) break elements at the edges of link
elements are expelled from the link element. Thus:
<a href="linkto"><br>link text<br></a>
becomes
<br><a href="linkto">link text</a><br>
=item I<-html>: required
( run in 1.385 second using v1.01-cache-2.11-cpan-119454b85a5 )