HTML-ListScraper

 view release on metacpan or  search on metacpan

lib/HTML/ListScraper/Interactive.pm  view on Meta::CPAN

	}

	$prev = $name;
	$prev_index = $ft->index;
    }

    return wantarray ? @out : \@out;
}

sub canonicalize_tags {
    my @out;
    foreach (@_) {
        my $ln = lc $_;
	$ln =~ s/^\s*<//;
	$ln =~ s/\/?>[\s\r\n]*$//;

	if ($ln) {
	    push @out, $ln;
	}
    }

    return wantarray ? @out : \@out;
}

1;

__END__

=head1 NAME

HTML::ListScraper::Interactive - formatting data from L<HTML::ListScraper>

=head1 FUNCTIONS

=head2 format_tags

Formats a tag sequence to emphasize its tree-like structure. Takes 2
or 3 parameters: a L<HTML::ListScraper> object, array reference
containing L<HTML::ListScraper::Tag> objects and an optional hash with
formatting options. C<format_tags> returns an array (array reference
if called in a scalar context) with formatted tag names and text.

The formatting options are

=over

=item attr

Include the C<href> attribute in the output.

=item text

Include the plain text in the output.

=item index

Include tag positions in the output.

=back

The returned values are basically XHTML lines: opening tags, text with
quoted entities and closing tags. Tags are enclosed in angle
brackets. The returned values don't necessarily form a valid XML
fragment, though, i.e. because the input tags need not form a
tree.

When C<index> is set, tag values start with the tag's index, followed
by a tab. Next, spaces show indentation. An opening tag not identified
as missing a closing tag increases indentation by 2 spaces, a closing
tag decreases it back. An opening tag with missing closing tag is
output with '/' appended to its name. For the rules of associating
opening and closing tags, see C<HTML::ListScraper::shapeless>.

When C<attr> is set, links are formatted without whitespace and
enclosed in double quotes. Double quotes in links are escaped, but no
other characters are (which can also make the result invalid
HTML). When C<text> is set, the output text has normalized whitespace;
nodes containing only whitespace are dropped. Gaps between adjacent
tag positions are displayed as an empty line. All values end with a
newline.

=head2 canonicalize_tags

Undoes the formatting done by C<format_tags>. Takes a list of lines
such as those output by C<format_tags> when called without any
formatting options and converts them to a list of tag names. Note that
C<canonicalize_tags> doesn't handle attributes, text lines nor index
numbers.



( run in 2.397 seconds using v1.01-cache-2.11-cpan-411bb0df24b )