HTML-ExtractMain
view release on metacpan or search on metacpan
lib/HTML/ExtractMain.pm view on Meta::CPAN
our $VERSION = '0.63';
=head1 SYNOPSIS
use HTML::ExtractMain qw( extract_main_html );
my $html = <<'END';
<div id="header">Header</div>
<div id="nav"><a href="/">Home</a></div>
<div id="body">
<p>Foo</p>
<p>Baz</p>
</div>
<div id="footer">Footer</div>
END
my $main_html = extract_main_html($html, output_type => 'xhtml');
if (defined $main_html) {
# do something with $main_html here
# $main_html is '<div id="body"><p>Foo</p><p>Baz</p></div>'
}
=head1 EXPORT
C<extract_main_html> is optionally exported
=head1 FUNCTIONS
=head2 extract_main_html
C<extract_main_html> takes HTML content, and uses the Readability
algorithm to detect the main body of the page, usually skipping
headers, footers, navigation, etc.
The first argument is either an HTML string, or an
HTML::TreeBuilder tree. (If passed a tree, the tree will be modified
and destroyed.)
Remaining arguments are optional and represent key/value options. The
available options are:
=head3 output_type
This determines what format to return data in. If not specified then
xhtml format will be used. Valid formats are:
=over 4
=item C<xhtml>
=item C<html>
=item C<tree>
=back
If C<tree> is selected, then an L<HTML::Element> object will be
returned instead of a string.
If the HTML's main content is found, it's returned in the chosen
output format. The returned HTML/XHTML will I<not> look like what you put
in. (Source formatting, e.g. indentation, will be removed.)
If a most relevant block of content is not found, C<extract_main_html>
returns undef.
=cut
=head1 AUTHOR
Anirvan Chatterjee, C<< <anirvan at cpan.org> >>
=head1 BUGS
Please report any bugs or feature requests to
C<bug-html-extractmain at rt.cpan.org>, or through the web interface
at L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-ExtractMain>.
I will be notified, and then you'll automatically be notified of
progress on your bug as I make changes.
=head1 SUPPORT
You can find documentation for this module with the perldoc command.
perldoc HTML::ExtractMain
You can also look for information at:
=over 4
=item * RT: CPAN's request tracker
L<http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-ExtractMain>
=item * AnnoCPAN: Annotated CPAN documentation
L<http://annocpan.org/dist/HTML-ExtractMain>
=item * CPAN Ratings
L<http://cpanratings.perl.org/d/HTML-ExtractMain>
=item * Search CPAN
L<http://search.cpan.org/dist/HTML-ExtractMain/>
=back
=head1 SEE ALSO
=over 4
=item * C<HTML::Feature>
=item * C<HTML::ExtractContent>
=back
=head1 ACKNOWLEDGEMENTS
The Readability algorithm is ported from Arc90's JavaScript original,
( run in 1.099 second using v1.01-cache-2.11-cpan-b50b6a40fd4 )