HTML-ListScraper
view release on metacpan or search on metacpan
lib/HTML/ListScraper/Interactive.pm view on Meta::CPAN
=item text
Include the plain text in the output.
=item index
Include tag positions in the output.
=back
The returned values are basically XHTML lines: opening tags, text with
quoted entities and closing tags. Tags are enclosed in angle
brackets. The returned values don't necessarily form a valid XML
fragment, though, i.e. because the input tags need not form a
tree.
When C<index> is set, tag values start with the tag's index, followed
by a tab. Next, spaces show indentation. An opening tag not identified
as missing a closing tag increases indentation by 2 spaces, a closing
tag decreases it back. An opening tag with missing closing tag is
output with '/' appended to its name. For the rules of associating
testdata/del.icio.us.html view on Meta::CPAN
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html id="delicious">
<head>
<title>del.icio.us</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta name="robots" content="noarchive,nofollow,noindex"/>
<link rel="stylesheet" type="text/css" href="/delicious.css?v=61E-123"/>
<script type="text/javascript" src="/ui/static/lib.js?v=61E-123"></script>
<script type="text/javascript" src="/ui/static/delicious.js?v=61E-123"></script>
testdata/reddit.html view on Meta::CPAN
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />
<title>reddit.com: what's new online</title>
<script src="/static/psrs.js" language="javascript" type="text/javascript"></script>
<script src="/static/reddit.js" language="javascript" type='text/javascript'></script>
<script language='javascript'>var logged = false </script>
<script language='javascript'> window.onload = init </script>
<link rel='stylesheet' href='/static/styles.css' type='text/css' />
<link rel='shortcut icon' href='/favicon.ico' type="image/x-icon" />
( run in 0.506 second using v1.01-cache-2.11-cpan-87723dcf8b7 )