HTML-ExtractMain

 view release on metacpan or  search on metacpan

lib/HTML/ExtractMain.pm  view on Meta::CPAN

our $VERSION = '0.63';

=head1 SYNOPSIS

    use HTML::ExtractMain qw( extract_main_html );

    my $html = <<'END';
    <div id="header">Header</div>
    <div id="nav"><a href="/">Home</a></div>
    <div id="body">
        <p>Foo</p>
        <p>Baz</p>
    </div>
    <div id="footer">Footer</div>
    END

    my $main_html = extract_main_html($html, output_type => 'xhtml');
    if (defined $main_html) {
	# do something with $main_html here
        # $main_html is '<div id="body"><p>Foo</p><p>Baz</p></div>'
    }

=head1 EXPORT

C<extract_main_html> is optionally exported

=head1 FUNCTIONS

=head2 extract_main_html

C<extract_main_html> takes HTML content, and uses the Readability
algorithm to detect the main body of the page, usually skipping
headers, footers, navigation, etc.

The first argument is either an HTML string, or an
HTML::TreeBuilder tree. (If passed a tree, the tree will be modified
and destroyed.)

Remaining arguments are optional and represent key/value options. The
available options are:

=head3 output_type

This determines what format to return data in. If not specified then
xhtml format will be used. Valid formats are:

=over 4

=item C<xhtml>

=item C<html>

=item C<tree>

=back

If C<tree> is selected, then an L<HTML::Element> object will be
returned instead of a string.

If the HTML's main content is found, it's returned in the chosen
output format. The returned HTML/XHTML will I<not> look like what you put
in. (Source formatting, e.g. indentation, will be removed.)

If a most relevant block of content is not found, C<extract_main_html>
returns undef.

=cut

=head1 AUTHOR

Anirvan Chatterjee, C<< <anirvan at cpan.org> >>

=head1 BUGS

Please report any bugs or feature requests to
C<bug-html-extractmain at rt.cpan.org>, or through the web interface
at L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-ExtractMain>.
I will be notified, and then you'll automatically be notified of
progress on your bug as I make changes.

=head1 SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc HTML::ExtractMain

You can also look for information at:

=over 4

=item * RT: CPAN's request tracker

L<http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-ExtractMain>

=item * AnnoCPAN: Annotated CPAN documentation

L<http://annocpan.org/dist/HTML-ExtractMain>

=item * CPAN Ratings

L<http://cpanratings.perl.org/d/HTML-ExtractMain>

=item * Search CPAN

L<http://search.cpan.org/dist/HTML-ExtractMain/>

=back

=head1 SEE ALSO

=over 4

=item * C<HTML::Feature>

=item * C<HTML::ExtractContent>

=back

=head1 ACKNOWLEDGEMENTS

The Readability algorithm is ported from Arc90's JavaScript original,



( run in 1.099 second using v1.01-cache-2.11-cpan-b50b6a40fd4 )