HTML-ExtractMain
view release on metacpan or search on metacpan
lib/HTML/ExtractMain.pm view on Meta::CPAN
=item C<html>
=item C<tree>
=back
If C<tree> is selected, then an L<HTML::Element> object will be
returned instead of a string.
If the HTML's main content is found, it's returned in the chosen
output format. The returned HTML/XHTML will I<not> look like what you put
in. (Source formatting, e.g. indentation, will be removed.)
If a most relevant block of content is not found, C<extract_main_html>
returns undef.
=cut
=head1 AUTHOR
Anirvan Chatterjee, C<< <anirvan at cpan.org> >>
t/test_case_data/google_blogger.html view on Meta::CPAN
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' xmlns:data='http://www.google.com/2005/gml/data' xmlns:expr='http://www.google.com/2005/gml/expr'>
<head>
<script type="text/javascript">(function() { var a=window;function c(b){this.t={};this.tick=function(d,i,e){e=e?e:(new Date).getTime();this.t[d]=[e,i]};this.tick("start",null,b)}var f=new c;a.jstiming={Timer:c,load:f};try{var g=null;if(a.chrome&&a.ch...
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
<meta content='true' name='MSSmartTagsPreventParsing'/>
<meta content='blogger' name='generator'/>
<link href='http://www.blogger.com/favicon.ico' rel='icon' type='image/vnd.microsoft.icon'/>
<link href='http://googlegeodevelopers.blogspot.com/2008/05/introducing-our-geo-developers-blog.html' rel='canonical'/>
<link rel="alternate" type="application/atom+xml" title="Google Geo Developers Blog - Atom" href="http://googlegeodevelopers.blogspot.com/feeds/posts/default" />
<link rel="alternate" type="application/rss+xml" title="Google Geo Developers Blog - RSS" href="http://googlegeodevelopers.blogspot.com/feeds/posts/default?alt=rss" />
t/test_case_data/google_short_blog.html view on Meta::CPAN
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' xmlns:data='http://www.google.com/2005/gml/data' xmlns:expr='http://www.google.com/2005/gml/expr'>
<head>
<meta content='iitJxuWLjtoK2cUdZtHd8yn6yWLcf5HRPezdIAwXW50=' name='verify-v1'/>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
<script type="text/javascript">(function() { var a=window;function c(b){this.t={};this.tick=function(d,i,e){e=e?e:(new Date).getTime();this.t[d]=[e,i]};this.tick("start",null,b)}var f=new c;a.jstiming={Timer:c,load:f};try{var g=null;if(a.chrome&&a.ch...
<meta content='true' name='MSSmartTagsPreventParsing'/>
<meta content='blogger' name='generator'/>
<link href='http://www.blogger.com/favicon.ico' rel='icon' type='image/vnd.microsoft.icon'/>
<link href='http://googleblog.blogspot.com/2010/03/introducing-google-ad-innovations.html' rel='canonical'/>
<link rel="alternate" type="application/atom+xml" title="Official Google Blog - Atom" href="http://googleblog.blogspot.com/feeds/posts/default" />
t/test_case_data/lessig_blog.html view on Meta::CPAN
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Movable Type 4.32-en" />
<title>Announcing the hibernation of lessig.org/blog (from the blogs-deserve-a-sabbatical-too department) (Lessig Blog)</title>
<link rel="alternate" type="application/atom+xml" title="Atom" href="http://lessig.org/blog/atom.xml" />
<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="http://lessig.org/blog/index.xml" />
<link href="/style/lessig.css" rel="stylesheet" type="text/css" />
( run in 1.324 second using v1.01-cache-2.11-cpan-49f99fa48dc )