HTML-ExtractMain

 view release on metacpan or  search on metacpan

lib/HTML/ExtractMain.pm  view on Meta::CPAN

=item C<html>

=item C<tree>

=back

If C<tree> is selected, then an L<HTML::Element> object will be
returned instead of a string.

If the HTML's main content is found, it's returned in the chosen
output format. The returned HTML/XHTML will I<not> look like what you put
in. (Source formatting, e.g. indentation, will be removed.)

If a most relevant block of content is not found, C<extract_main_html>
returns undef.

=cut

=head1 AUTHOR

Anirvan Chatterjee, C<< <anirvan at cpan.org> >>

t/test_case_data/google_blogger.html  view on Meta::CPAN

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' xmlns:data='http://www.google.com/2005/gml/data' xmlns:expr='http://www.google.com/2005/gml/expr'>
<head>
<script type="text/javascript">(function() { var a=window;function c(b){this.t={};this.tick=function(d,i,e){e=e?e:(new Date).getTime();this.t[d]=[e,i]};this.tick("start",null,b)}var f=new c;a.jstiming={Timer:c,load:f};try{var g=null;if(a.chrome&&a.ch...
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
<meta content='true' name='MSSmartTagsPreventParsing'/>
<meta content='blogger' name='generator'/>
<link href='http://www.blogger.com/favicon.ico' rel='icon' type='image/vnd.microsoft.icon'/>
<link href='http://googlegeodevelopers.blogspot.com/2008/05/introducing-our-geo-developers-blog.html' rel='canonical'/>
<link rel="alternate" type="application/atom+xml" title="Google Geo Developers Blog - Atom" href="http://googlegeodevelopers.blogspot.com/feeds/posts/default" />
<link rel="alternate" type="application/rss+xml" title="Google Geo Developers Blog - RSS" href="http://googlegeodevelopers.blogspot.com/feeds/posts/default?alt=rss" />

t/test_case_data/google_short_blog.html  view on Meta::CPAN

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' xmlns:data='http://www.google.com/2005/gml/data' xmlns:expr='http://www.google.com/2005/gml/expr'>
<head>
<meta content='iitJxuWLjtoK2cUdZtHd8yn6yWLcf5HRPezdIAwXW50=' name='verify-v1'/>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
<script type="text/javascript">(function() { var a=window;function c(b){this.t={};this.tick=function(d,i,e){e=e?e:(new Date).getTime();this.t[d]=[e,i]};this.tick("start",null,b)}var f=new c;a.jstiming={Timer:c,load:f};try{var g=null;if(a.chrome&&a.ch...
<meta content='true' name='MSSmartTagsPreventParsing'/>
<meta content='blogger' name='generator'/>
<link href='http://www.blogger.com/favicon.ico' rel='icon' type='image/vnd.microsoft.icon'/>
<link href='http://googleblog.blogspot.com/2010/03/introducing-google-ad-innovations.html' rel='canonical'/>
<link rel="alternate" type="application/atom+xml" title="Official Google Blog - Atom" href="http://googleblog.blogspot.com/feeds/posts/default" />

t/test_case_data/lessig_blog.html  view on Meta::CPAN

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
   <meta name="generator" content="Movable Type 4.32-en" />
   <title>Announcing the hibernation of lessig.org/blog (from the blogs-deserve-a-sabbatical-too department) (Lessig Blog)</title>

   <link rel="alternate" type="application/atom+xml" title="Atom" href="http://lessig.org/blog/atom.xml" />
   <link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="http://lessig.org/blog/index.xml" />

<link href="/style/lessig.css" rel="stylesheet" type="text/css" />



( run in 1.324 second using v1.01-cache-2.11-cpan-49f99fa48dc )