HTML-Detergent

 view release on metacpan or  search on metacpan

lib/HTML/Detergent.pm  view on Meta::CPAN


# turn aforementioned into a single xpath statement
my $LINKXP = join('|', map {
    sprintf('//html:%s[%s]', $_, join('|', map { "\@$_" } keys %{$LINKS{$_}} ))
} keys %LINKS);

=head1 NAME

HTML::Detergent - Clean the gunk off an HTML document

=head1 VERSION

Version 0.06

=cut

our $VERSION = '0.06';

=head1 SYNOPSIS

    use HTML::Detergent;

    my $scrubber = HTML::Detergent->new($config);

    # $input can be a string, GLOB reference, or XML::LibXML::Document

    my $doc = $scrubber->process($input, $uri);

=head1 DESCRIPTION

L<HTML::Detergent> is for isolating the main content of an HTML page,
stripping it of navigation, visual design, and other ancillary content.

The main purpose of this module is to aid in the migration of web
content from one content management system to another. It is also
useful for preparing HTML resources for automated content inventories.

The module currently has no heuristics for determining the main
content of a page. It works instead by assuming prior knowledge of the
layout, given in the configuration by an XPath expression that
uniquely isolates the container node. That node is then lifted into a
new document, along with the contents of the C<E<lt>headE<gt>>, and
returned by the L</process> method. To accommodate multiple layouts on
a site, the module can be initialized to match multiple XPath
expressions. If further processing is necessary, an expression can be
associated with an XSLT stylesheet, which is assumed to produce an
entire document, thus overriding the default behaviour.

After the new document is generated and before it is returned by
L</process>, it is possible to inject C<E<lt>linkE<gt>> and
C<E<lt>metaE<gt>> elements into the C<E<lt>headE<gt>>. This enables
the inclusion of metadata and the re-association of the main content
with links that represent aspects of the page which have been removed
(e.g. navigation, copyright statement, etc.). In addition, if the
page's URI is supplied to the L</process> method, the
C<E<lt>baseE<gt>> element is either added or rewritten to reflect it,
and the URI attributes in the body are rewritten relative to the base.
Otherwise they are left alone.

The document returned is an L<XML::LibXML::Document> object using the
XHTML namespace, C<http://www.w3.org/1999/xhtml>, but does not profess
to validate against any particular schema. If DTD declarations
(including the empty C<E<lt>!DOCTYPE htmlE<gt>> recommended in HTML5)
are desired, they can be added on afterward. Likewise, the object can
be converted from XML into HTML using L<XML::LibXML::Document/toStringHTML>.

=head1 METHODS

=head2 new %CONFIG | \%CONFIG | $CONFIG

Initialize the processor, either with a list of configuration
parameters, a HASH reference thereof, or an L<HTML::Detergent::Config>
object. Below are the valid parameters:

=over 4

=item match

This is an ARRAY reference of XPath expressions to try against the
document, in order of preference. Entries optionally may be
two-element ARRAY references themselves, the second element being a
URL where an XSLT stylesheet may be found.

    match => [ '/some/xpath/expression',
               [ '/other/expr', '/url/of/transform.xsl' ],
             ],

=item link

This is a HASH reference where the keys correspond to C<rel>
attributes and the values to C<href> attributes of C<E<lt>linkE<gt>>
elements. If the values are ARRAY references, they will be processed
in document order. C<rel> attributes will be sorted lexically. If a
callback is supplied instead, the caller expects a result of the same
form.

    link => { rel1 => 'href1', rel2 => [ qw(href2 href3) ] },

    # or

    link => \&_link_cb,

=item meta

This is a HASH reference where the keys correspond to C<name>
attributes and the values to C<content> attributes of
C<E<lt>metaE<gt>> elements. If the values are ARRAY references, they
will be processed in document order. C<name> attributes will be sorted
lexically. If a callback is supplied instead, the caller expects a
result of the same form.

    meta => { name1 => 'content1',
              name2 => [ qw(content2 content3) ] },

    # or

    meta => \&_meta_cb,

=item callback

These callbacks will be passed into the internal L<XML::LibXSLT>



( run in 1.863 second using v1.01-cache-2.11-cpan-119454b85a5 )