HTML-TagParser

 view release on metacpan or  search on metacpan

lib/HTML/TagParser.pm  view on Meta::CPAN

=head1 NAME

HTML::TagParser - Yet another HTML document parser with DOM-like methods

=head1 SYNOPSIS

Parse a HTML file and find its <title> element's value.

    my $html = HTML::TagParser->new( "index-j.html" );
    my $elem = $html->getElementsByTagName( "title" );
    print "<title>", $elem->innerText(), "</title>\n" if ref $elem;

Parse a HTML source and find its first <form action=""> attribute's value
and find all input elements belonging to this form.

    my $src  = '<html><form action="hoge.cgi">...</form></html>';
    my $html = HTML::TagParser->new( $src );
    my $elem = $html->getElementsByTagName( "form" );
    print "<form action=\"", $elem->getAttribute("action"), "\">\n" if ref $elem;
    my @first_inputs = $elem->subTree()->getElementsByTagName( "input" );
    my $form = $first_inputs[0]->getParent();

Fetch a HTML file via HTTP, and display its all <a> elements and attributes.

    my $url  = 'http://www.kawa.net/xp/index-e.html';
    my $html = HTML::TagParser->new( $url );
    my @list = $html->getElementsByTagName( "a" );
    foreach my $elem ( @list ) {
        my $tagname = $elem->tagName;
        my $attr = $elem->attributes;
        my $text = $elem->innerText;
        print "<$tagname";
        foreach my $key ( sort keys %$attr ) {
            print " $key=\"$attr->{$key}\"";
        }
        if ( $text eq "" ) {
            print " />\n";
        } else {
            print ">$text</$tagname>\n";
        }
    }

=head1 DESCRIPTION

HTML::TagParser is a pure Perl module which parses HTML/XHTML files.
This module provides some methods like DOM interface.
This module is not strict about XHTML format
because many of HTML pages are not strict.
You know, many pages use <br> elemtents instead of <br/>
and have <p> elements which are not closed.

=head1 METHODS

=head2 $html = HTML::TagParser->new();

This method constructs an empty instance of the C<HTML::TagParser> class.

=head2 $html = HTML::TagParser->new( $url );

If new() is called with a URL,
this method fetches a HTML file from remote web server and parses it
and returns its instance.
L<URI::Fetch> module is required to fetch a file.

=head2 $html = HTML::TagParser->new( $file );

If new() is called with a filename,
this method parses a local HTML file and returns its instance

=head2 $html = HTML::TagParser->new( "<html>...snip...</html>" );

If new() is called with a string of HTML source code,
this method parses it and returns its instance.

=head2 $html->fetch( $url, %param );

This method fetches a HTML file from remote web server and parse it.
The second argument is optional parameters for L<URI::Fetch> module.

=head2 $html->open( $file );

This method parses a local HTML file.

=head2 $html->parse( $source );

This method parses a string of HTML source code.

=head2 $elem = $html->getElementById( $id );

This method returns the element which id attribute is $id.

=head2 @elem = $html->getElementsByName( $name );

This method returns an array of elements which name attribute is $name.
On scalar context, the first element is only retruned.

=head2 @elem = $html->getElementsByTagName( $tagname );

This method returns an array of elements which tagName is $tagName.
On scalar context, the first element is only retruned.

=head2 @elem = $html->getElementsByClassName( $class );

This method returns an array of elements which className is $tagName.
On scalar context, the first element is only retruned.

=head2 @elem = $html->getElementsByAttribute( $attrname, $value );



( run in 1.802 second using v1.01-cache-2.11-cpan-119454b85a5 )