HTML-ListScraper
view release on metacpan or search on metacpan
lib/HTML/ListScraper/Interactive.pm view on Meta::CPAN
}
$prev = $name;
$prev_index = $ft->index;
}
return wantarray ? @out : \@out;
}
sub canonicalize_tags {
my @out;
foreach (@_) {
my $ln = lc $_;
$ln =~ s/^\s*<//;
$ln =~ s/\/?>[\s\r\n]*$//;
if ($ln) {
push @out, $ln;
}
}
return wantarray ? @out : \@out;
}
1;
__END__
=head1 NAME
HTML::ListScraper::Interactive - formatting data from L<HTML::ListScraper>
=head1 FUNCTIONS
=head2 format_tags
Formats a tag sequence to emphasize its tree-like structure. Takes 2
or 3 parameters: a L<HTML::ListScraper> object, array reference
containing L<HTML::ListScraper::Tag> objects and an optional hash with
formatting options. C<format_tags> returns an array (array reference
if called in a scalar context) with formatted tag names and text.
The formatting options are
=over
=item attr
Include the C<href> attribute in the output.
=item text
Include the plain text in the output.
=item index
Include tag positions in the output.
=back
The returned values are basically XHTML lines: opening tags, text with
quoted entities and closing tags. Tags are enclosed in angle
brackets. The returned values don't necessarily form a valid XML
fragment, though, i.e. because the input tags need not form a
tree.
When C<index> is set, tag values start with the tag's index, followed
by a tab. Next, spaces show indentation. An opening tag not identified
as missing a closing tag increases indentation by 2 spaces, a closing
tag decreases it back. An opening tag with missing closing tag is
output with '/' appended to its name. For the rules of associating
opening and closing tags, see C<HTML::ListScraper::shapeless>.
When C<attr> is set, links are formatted without whitespace and
enclosed in double quotes. Double quotes in links are escaped, but no
other characters are (which can also make the result invalid
HTML). When C<text> is set, the output text has normalized whitespace;
nodes containing only whitespace are dropped. Gaps between adjacent
tag positions are displayed as an empty line. All values end with a
newline.
=head2 canonicalize_tags
Undoes the formatting done by C<format_tags>. Takes a list of lines
such as those output by C<format_tags> when called without any
formatting options and converts them to a list of tag names. Note that
C<canonicalize_tags> doesn't handle attributes, text lines nor index
numbers.
( run in 2.397 seconds using v1.01-cache-2.11-cpan-411bb0df24b )