HTML-SimpleLinkExtor

 view release on metacpan or  search on metacpan

lib/HTML/SimpleLinkExtor.pm  view on Meta::CPAN

			next unless exists $AUTO_METHODS{ $tuple->[$index] };

			my $url = URI->new( $tuple->[$index + 1] );
			next unless ref $url;
			$tuple->[$index + 1] = $url->abs($base);
			}
		}
	}

=encoding utf8

=head1 NAME

HTML::SimpleLinkExtor - Extract links from HTML

=head1 SYNOPSIS

	use HTML::SimpleLinkExtor;

	my $extor = HTML::SimpleLinkExtor->new();
	$extor->parse_file($filename);
	#--or--
	$extor->parse($html);

	$extor->parse_file($other_file); # get more links

	$extor->clear_links; # reset the link list

	#extract all of the links
	@all_links   = $extor->links;

	#extract the img links
	@img_srcs    = $extor->img;

	#extract the frame links
	@frame_srcs  = $extor->frame;

	#extract the hrefs
	@area_hrefs  = $extor->area;
	@a_hrefs     = $extor->a;
	@base_hrefs  = $extor->base;
	@hrefs       = $extor->href;

	#extract the body background link
	@body_bg     = $extor->body;
	@background  = $extor->background;

	@links       = $extor->schemes( 'http' );

=head1 DESCRIPTION

This is a simple HTML link extractor designed for the person who does
not want to deal with the intricacies of C<HTML::Parser> or the
de-referencing needed to get links out of C<HTML::LinkExtor>.

You can extract all the links or some of the links (based on the HTML
tag name or attribute name). If a C<< <BASE HREF> >> tag is found,
all of the relative URLs will be resolved according to that reference.

This module is simply a subclass around C<HTML::LinkExtor>, so it can
only parse what that module can handle.  Invalid HTML or XHTML may
cause problems.

If you parse multiple files, the link list grows and contains the
aggregate list of links for all of the files parsed. If you want to
reset the link list between files, use the clear_links method.

=head2 Class Methods

=over

=item $extor = HTML::SimpleLinkExtor->new()

Create the link extractor object.

=item $extor = HTML::SimpleLinkExtor->new('')

=item $extor = HTML::SimpleLinkExtor->new($base)

Create the link extractor object and resolve the relative URLs
accoridng to the supplied base URL. The supplied base URL overrides
any other base URL found in the HTML.

Create the link extractor object and do not resolve relative
links.

=cut

sub new {
	my $class = shift;
	my $base  = shift;

	my $self = new HTML::LinkExtor;
	bless $self, $class;

	$self->{'_SimpleLinkExtor_base'} = $base;
	$self->{'_ua'} = LWP::UserAgent->new;
	$self->_init_links;

	return $self;
	}

=item HTML::SimpleLinkExtor->ua;

Returns the internal user agent, an C<LWP::UserAgent> object.

=cut

sub ua { $_[0]->{_ua} }

=item HTML::SimpleLinkExtor->add_tags( TAG [, TAG ] )

C<HTML::SimpleLinkExtor> keeps an internal list of HTML tags (such as
'a' and 'img') that have URLs as values. If you run into another tag
that this module doesn't handle, please send it to me and I'll add it.
Until then you can add that tag to the internal list. This affects
the entire class, including previously created objects.

=cut

sub add_tags {



( run in 1.171 second using v1.01-cache-2.11-cpan-119454b85a5 )