libwww-perl

 view release on metacpan or  search on metacpan

lwptut.pod  view on Meta::CPAN

  my $browser = LWP::RobotUA->new('YourSuperBot/1.34', 'you@yoursite.com');
    # Your bot's name and your email address

  my $response = $browser->get($url);

But HTTP::RobotUA adds these features:


=over

=item *

If the F<robots.txt> on C<$url>'s server forbids you from accessing
C<$url>, then the C<$browser> object (assuming it's of class LWP::RobotUA)
won't actually request it, but instead will give you back (in C<$response>) a 403 error
with a message "Forbidden by robots.txt".  That is, if you have this line:

  die "$url -- ", $response->status_line, "\nAborted"
   unless $response->is_success;

then the program would die with an error message like this:

  http://whatever.site.int/pith/x.html -- 403 Forbidden by robots.txt
  Aborted at whateverprogram.pl line 1234

=item *

If this C<$browser> object sees that the last time it talked to
C<$url>'s server was too recently, then it will pause (via C<sleep>) to
avoid making too many requests too often. How long it will pause for, is
by default one minute -- but you can control it with the C<<
$browser->delay( I<minutes> ) >> attribute.

For example, this code:

  $browser->delay( 7/60 );

...means that this browser will pause when it needs to avoid talking to
any given server more than once every 7 seconds.

=back

For more options and information, see L<the full documentation for
LWP::RobotUA|LWP::RobotUA>.





=for comment
 ##########################################################################

=head2 Using Proxies

In some cases, you will want to (or will have to) use proxies for
accessing certain sites and/or using certain protocols. This is most
commonly the case when your LWP program is running (or could be running)
on a machine that is behind a firewall.

To make a browser object use proxies that are defined in the usual
environment variables (C<HTTP_PROXY>, etc.), just call the C<env_proxy>
on a user-agent object before you go making any requests on it.
Specifically:

  use LWP::UserAgent;
  my $browser = LWP::UserAgent->new;
  
  # And before you go making any requests:
  $browser->env_proxy;

For more information on proxy parameters, see L<the LWP::UserAgent
documentation|LWP::UserAgent>, specifically the C<proxy>, C<env_proxy>,
and C<no_proxy> methods.



=for comment
 ##########################################################################

=head2 HTTP Authentication

Many web sites restrict access to documents by using "HTTP
Authentication". This isn't just any form of "enter your password"
restriction, but is a specific mechanism where the HTTP server sends the
browser an HTTP code that says "That document is part of a protected
'realm', and you can access it only if you re-request it and add some
special authorization headers to your request".

For example, the Unicode.org admins stop email-harvesting bots from
harvesting the contents of their mailing list archives, by protecting
them with HTTP Authentication, and then publicly stating the username
and password (at C<http://www.unicode.org/mail-arch/>) -- namely
username "unicode-ml" and password "unicode".  

For example, consider this URL, which is part of the protected
area of the web site:

  http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html

If you access that with a browser, you'll get a prompt
like 
"Enter username and password for 'Unicode-MailList-Archives' at server
'www.unicode.org'".

In LWP, if you just request that URL, like this:

  use LWP;
  my $browser = LWP::UserAgent->new;

  my $url =
   'http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html';
  my $response = $browser->get($url);

  die "Error: ", $response->header('WWW-Authenticate') || 'Error accessing',
    #  ('WWW-Authenticate' is the realm-name)
    "\n ", $response->status_line, "\n at $url\n Aborting"
   unless $response->is_success;

Then you'll get this error:

  Error: Basic realm="Unicode-MailList-Archives"



( run in 0.853 second using v1.01-cache-2.11-cpan-d7a12ab2c7f )