view release on metacpan or search on metacpan
- Add Gungho::Manual::FAQ
0.09003_03 Fri Nov 09 2007 [rev 258]
[POE]
- Note: Changes for POE engine contained in this release are relatively
critical. If you were having problems before, you probably should check
this release out.
- Be smarter how dispatch() gets called. Now we do a more effective
invocation of the dispatch state so that we don't waste cycles just
trying to dispatch requests.
- Allow "0" setting in keepalive.keep_alive. This is a very important
parameter if you're using Gungho through a proxy. If you enable this
while under a proxy, PoCo::Client::Keepalive will think that you should
be using the cached connection to the proxy and so Gungho will lose all
parallism.
- Allow setting the number of PoCo::Client::HTTP to be spawned via
client.spawn parameter. This is required if you're dealing with
relatively large amounts of URLs at once. Otherwise, PoCo::Client::HTTP
will tend to jam up after a while.
0.09003_02 Thu Nov 08 2007 [rev 258]
[Throttle]
- Fix Throttling to delegate throttling decisions. This allows you to
- Implement plugins
- Add G::Plugin::RequestTimer
- Add deps features in Makefile.PL
0.01 Sat 07 Apr 2007
- handle_response() now take $request and $response all over
- Add send_request() in Gungho.pm, Gungho/Engine/POE.pm
- Add notes() in Gungho/Request.pm. Cloning is properly handled
0.01_04 Sat 07 Apr 2007
- Enable keepalive
0.01_03 Fri 06 Apr 2007
- Fix embarassing documentation whoopla. As stated, no,
I'm not ashamed of stealing good code.
0.01_02 Fri 06 Apr 2007
- Add a new provider and small set of changes so that we can
use this in Plagger
- Use Class::Inspector to check if a module has been loaded
HTTP::Headers::Util: 0
HTTP::Parser: 0
HTTP::Status: 0
IO::Async::Buffer: 0
IO::Socket::INET: 0
MIME::Base64: 0
Net::DNS: 0
POE: 0.9999
POE::Component::Client::DNS: 0
POE::Component::Client::HTTP: 0.81
POE::Component::Client::Keepalive: 0
Path::Class::Dir: 0
Regexp::Common: 0
Sys::Hostname: 0
Time::HiRes: 0
URI: 0
WWW::RobotRules::Parser: 0
Web::Scraper::Config: 0
XML::LibXML: 0
YAML: 0
requires:
Exception::Class: 0
FindBin: 0
Getopt::Long: 0
HTTP::Request: 0
HTTP::Response: 0
HTTP::Status: 0
Log::Dispatch: 0
POE: 0.9999
POE::Component::Client::DNS: 0
POE::Component::Client::HTTP: 0.81
POE::Component::Client::Keepalive: 0
Path::Class: 0
Pod::Usage: 0
Regexp::Common: 0
Storable: 0
UNIVERSAL::isa: 0.06
UNIVERSAL::require: 0
URI: 0
perl: 5.8.0
tests: t/01_load.t t/02_config.t t/03_live/perl-proxy.t t/03_live/perl.t t/03_live/twitter.t t/99_kwalitee.t t/99_pod-coverage.t t/99_pod.t t/component/authentication/01_load.t t/component/cache/01_load.t t/component/robot_rules/01_load.t t/component...
version: 0.09006
Makefile.PL view on Meta::CPAN
name("Gungho");
all_from("lib/Gungho.pm");
no_index( directory => 'examples' );
# This used to be optional, but we're making it mandatory.
# You're still free to use other engines, but we're forcing
# you to install POE so we can go ahead with testing
requires('POE', '0.9999');
requires('POE::Component::Client::DNS');
requires('POE::Component::Client::Keepalive');
requires('POE::Component::Client::HTTP', '0.81');
requires("Best");
requires("Class::Accessor::Fast");
requires("Class::C3::Componentised");
requires("Class::Data::Inheritable");
requires("Class::Inspector");
requires("Config::Any");
requires("Data::Dumper");
requires("Event::Notify", '0.00004');
deps/Engine-POE.yaml view on Meta::CPAN
---
name: 'POE Engine'
default: 1
depends:
'POE': 0.9999
'POE::Component::Client::DNS': 0
'POE::Component::Client::Keepalive': 0
'POE::Component::Client::HTTP': 0.81
docs/ja/Gungho/Engine/POE.pod view on Meta::CPAN
config:
loop_delay: 5
client:
spawn: 2
agent:
- AgentName1
- AgentName2
max_size: 16384
follow_redirect: 2
proxy: http://localhost:8080
keepalive:
keep_alive: 10
max_open: 200
max_per_host: 20
timeout: 10
dns:
# disable: 1 If you want to disable DNS resolution by Gungho
=head1 DESCRIPTION
Gunghog::Engine::POEã¯POEãç¨ãã¦Gunghoãåããããã®ã¢ã¸ã¥ã¼ã«ã§ãã
docs/ja/Gungho/Engine/POE.pod view on Meta::CPAN
ããã©ã«ãå¤ã¯ï¼ã§ãã
=head2 dns.disable
DNS解決ããããå ´åã¯ãã®å¤ã1ã«ãã¦ãã ãããä¸é¨ã®ç°å¢ã§ã¯ã¯ã©ã¤ã¢ã³ãå´ã§ã®
DNS解決ãé§ç®ã§ãçµè·¯ã«ãããããã·ãDNS解決ãè¡ãçã®è¨å®ãããã¦ãããããã
åé¡ã¨ãªãäºãããã¾ãããã®éã¯ãã®é
ç®ã1ã«è¨å®ãã¦ãã ããã
ããã©ã«ãå¤ã¯0ã§ãã
=head2 keepalive
C<keepalive>é
ç®ã¯POE::Component::Client::Keepaliveãè¨å®ããããã«
使ç¨ãã¾ãã
ãããã·ãéããç°å¢ã§Gunghoã使ç¨ãã¦ããå ´åã¯ãã®è¨å®ãéè¦ã«ãªã£ã¦ãã
å ´åãããã¾ããGungho::Engine::POEã¯å
é¨ã§æ¥ç¶ä¸ã®ã½ã±ããã使ãåãããã«
ä½ããã¦ãã¾ããããããããã·ã«ã¤ãªããå ´åã¯æ¥ç¶å¯¾è±¡ãµã¼ãã¼ãã²ã¨ã¤ãªã®ã§ã
å
¨ã¦ã®ãªã¯ã¨ã¹ããåä¸ã½ã±ããã使ãããã«ãªã£ã¦ãã¾ããçµæçã«ä¸¦åå¦çã
ã§ããªããªãã¾ã ï¼ééåãããã·ã¯ãã®å¯¾è±¡ã«å«ã¾ãã¾ããï¼ã
Gunghoã¯ãããã·ã®ä½¿ç¨ãæ¤åºããå ´åã¯ãã®è¨å®ãèªåçã«ããããåªåãã¾ããã
æç¤ºçã«æå®ããå ´åã¯
keepalive:
keep_alive: 0
ã¨æå®ãã¦ãã ããããã®ä»ã®POE::Component::Client::Keepaliveè¨å®ã夿´
ããå ´åã¯ã以ä¸ã®ããã«è¨å®ãã¦ãã ãã
keepalive:
max_per_host: ....
max_open: ...
timeout: ...
C<keepalive.timeout>é
ç®ãè¨å®ããå ´åã¯L<USING KEEPALIVE|USING KEEPALIVE>ã
åç
§ãã¦ãã ããã
=head1 POE::Component::Client::HTTP AND DECODED CONTENTS
POE::Component::Client::HTTP ã¯ãã¼ã¸ã§ã³0.80以éãåå¾ããã¬ã¹ãã³ã¹å
容ã
åæã«Perl Unicodeã«ãã³ã¼ããã¦ãã¾ãäºãããã¾ãããã®å ´åããã¨ãHTTP
ãããã¼èªä½ã以ä¸ã®ããã«ãªã£ã¦ãã¦ããå®éã«æ¸¡ã£ã¦ãããã¼ã¿å
容ã¯
æ£è¦åãããUnicodeã§ããäºãããã¾ãï¼
Content-Type: text/html; charset=euc-jp
docs/ja/Gungho/Engine/POE.pod view on Meta::CPAN
user_agent: my_user_agent
engine:
module: POE
...
ãããè¨å®ããªãã¨RobotRulesã®ãããªã³ã³ãã¼ãã³ãã使ç¨ããæã«åé¡ãçãã
äºãããã¾ãã
=head1 USING KEEPALIVE
Gungho::Engine::POEã¯å
é¨ã§POE::Component::Client::Keepaliveã使ç¨ãã¦
ã½ã±ããæ¥ç¶ãå¶å¾¡ãã¦ãã¾ãã
ã»ã¨ãã©ã®è¨å®ã¯ããã©ã¼ãã³ã¹ä»¥å¤ç¹ã«ã¦ã¼ã¶ã¼ã®ç®ã«è§¦ãããã®ã§ã¯ããã¾ãããã
C<timeout>è¨å®ã¯ã¨ã³ã¸ã³ãçµäºããæã«å½±é¿ããå¯è½æ§ãããã¾ããC<timeout>å¤
ãé«ãã¨ãã®æ¥ç¶ãåããã¾ã§POEèªä½ãæ´»åãæ¢ããããªãã®ããã®çç±ã§ããã
ããã¯æ£å¸¸ãªåä½ã§ãã
=head1 ENVIRONMENT VARIABLES
=head2 GUNGHO_ENGINE_POE_SKIP_DECODE_CONTENT
docs/ja/Gungho/Manual/FAQ.pod view on Meta::CPAN
ã ãã§ãã
=back
=head1 Q. "ãªã¯ã¨ã¹ããåå¾ããã®ã«æéãããã£ã¦ããããã§ããã©ããªå¯¾å¦æ¹æ³ãããã¾ããï¼"
ããã¯æ§ã
ãªå åãé¢ä¿ãã¦ãã¾ãããã¾ã以ä¸ã®ãããªç¹ã注æãã¦ã¿ã¦ãã ããï¼
=head2 ããªãã®ãã¼ã¿ã»ããã¯Gunghoã¨åã£ã¦ã¾ããï¼
Gunghoã¯éåæã¨ã³ã¸ã³ã使ç¨ããPOE::Component::Client::Keepaliveã§ã½ã±ãã
æ¥ç¶ããã£ãã·ã¥ãã¾ãã
ãã®ãããªåä½ãããGunghoã使ãå ´åãæ§ã
ãªãã¹ããã¯ãã¼ã«ããåã«ã¯è¯ãæ§è½
ãæå¾
ã§ãã¾ããããã®éã§ä¾ãã°ã²ã¨ã¤ã®ãã¹ãå
ãã¯ãã¼ã«ããã«ã¯æ³¨æã
ããªãã¨æãããã«æ§è½ãä¸ãããªãå¯è½æ§ãããã¾ãã
=head2 Gungho::Engine::POEã¨loop_delayè¨å®
C<engine.config.loop_delay> ã¯ï¼ã«ã¼ããããã«å¾
ã¤æéã®æå®ããã¾ãããã®
ï¼ã«ã¼ãæ¯ã«Providerãæ¬¡ã«éä¿¡ãããªã¯ã¨ã¹ãã¯ããã®ã確èªãããã®ãªã¯ã¨ã¹ãã
docs/ja/Gungho/Manual/FAQ.pod view on Meta::CPAN
ã¯ãã¼ã©ã¼ããããã·ãéãã¦éç¨ããã®ã¯ããè¡ããã¦ããäºã§ããã
Gungho::Engine::POEã¨ä½¿ç¨ããæã«ã¯ããã©ã¼ãã³ã¹ãè½ã¡ãå¯è½æ§ãããã®ã§
注æããå¿
è¦ãããã¾ãã
ãããã·ã使ç¨ããå ´åã¯ä»¥ä¸ã®ãããªè¨å®ãæå®ãã¦ãã ããï¼
engine:
module: POE
config:
keepalive:
keep_alive: 0
ãã®è¨å®ãè¡ãäºã«ãã£ã¦POE::Component::Client::HTTPã§ä½¿ç¨ããã¦ãã
POE::Component::client::Keepaliveãç¡å¹åãã¾ããKeepaliveã¯ä¸åº¦æ¥ç¶ããã
ã½ã±ããæ¥ç¶ãåå©ç¨ãã¦ããã¢ã¸ã¥ã¼ã«ã§ããããããã·ã«æ¥ç¶ãã¦ããå ´åã¯
æ¥ç¶å¯¾è±¡ãµã¼ãã¼ãä¸ã¤ãããªãã®ã§ä¸¦åå¦çãå
¨ãã§ããªããªãã¾ãããªã®ã§
ãã®è¨å®ã使ãäºã«ãã£ã¦ãæ¯åæ¥ç¶ãè¡ãããã«ããã¨ããããã§ãã
=cut
docs/ja/Gungho/Manual/Install.pod view on Meta::CPAN
=head1 ä¾åé¢ä¿
Gunghoã¯æ§ã
ãªã¢ã¸ã¥ã¼ã«é¡ã使ã£ã¦æ§æããã¦ãããããCPANã«ã¢ããããã¦ãã
ã¢ã¸ã¥ã¼ã«ã¸ã®ä¾åé¢ä¿ã夿°ããã¾ãããããã®ã¢ã¸ã¥ã¼ã«ã¯Gunghoã®ã¤ã³ã¹ãã¼ã«ä¸
ã«å¿
è¦ãã©ããã®ç¢ºèªãç»é¢ã«è¡¨ç¤ºããã¾ããä¾ãã°POEã¨ã³ã¸ã³ã使ãã®ã§ããã°
POEé¢é£ã®ã¢ã¸ã¥ã¼ã«ãã¤ã³ã¹ãã¼ã«ãããã¨ãã質åã«ã¯"y"ã§çãã¦ãã ããã
åºåä¾ï¼
[POE Engine]
- POE::Component::Client::Keepalive ...missing
- POE::Component::Client::DNS ...missing
- POE::Component::Client::HTTP ...missing
- POE ...missing
==> Auto-install the 1 optional module(s) from CPAN? [y]
=cut
lib/Gungho/Engine/POE.pm view on Meta::CPAN
# $Id: /mirror/gungho/lib/Gungho/Engine/POE.pm 39017 2008-01-16T16:05:45.674472Z lestrrat $
#
# Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>
# All rights reserved.
package Gungho::Engine::POE;
use strict;
use warnings;
use base qw(Gungho::Engine);
use POE;
use POE::Component::Client::Keepalive;
use POE::Component::Client::HTTP;
__PACKAGE__->mk_accessors($_) for qw(alias loop_alarm loop_delay resolver clients);
use constant DEBUG => 0;
use constant UserAgentAlias => 'Gungho_Engine_POE_UserAgent_Alias';
use constant DnsResolverAlias => 'Gungho_Engine_POE_DnsResolver_Alias';
use constant SKIP_DECODE_CONTENT =>
exists $ENV{GUNGHO_ENGINE_POE_SKIP_DECODE_CONTENT} ? $ENV{GUNGHO_ENGINE_POE_SKIP_DECODE_CONTENT} : 1;
use constant FORCE_ENCODE_CONTENT =>
lib/Gungho/Engine/POE.pm view on Meta::CPAN
$self->loop_delay( $self->config->{loop_delay} ) if $self->config->{loop_delay};
$self->next::method(@_);
}
sub run
{
my ($self, $c) = @_;
my %config = %{ $self->config || {} };
my $keepalive_config = delete $config{keepalive} || {};
{
my %defaults = (
keep_alive => 10,
max_open => 200,
max_per_host => 5,
timeout => 10
);
while (my($key, $value) = each %defaults) {
if (! defined $keepalive_config->{$key}) {
$keepalive_config->{$key} = $value;
}
}
}
my $keepalive = POE::Component::Client::Keepalive->new(%$keepalive_config);
my $client_config = delete $config{client} || {};
foreach my $key (keys %$client_config) {
if ($key =~ /^[a-z]/) { # ah, need to make this CamelCase
my $camel = ucfirst($key);
$camel =~ s/_(\w)/uc($1)/ge;
$client_config->{$camel} = delete $client_config->{$key};
}
}
lib/Gungho/Engine/POE.pm view on Meta::CPAN
if ($spawn < 1) { $spawn = 2 }
for my $i ( 1 .. $spawn ) {
my $alias = join('-', &UserAgentAlias, $i);
push @{ $self->clients }, $alias;
POE::Component::Client::HTTP->spawn(
FollowRedirects => 1,
Agent => $c->user_agent,
Timeout => 60,
%$client_config,
Alias => $alias,
ConnectionManager => $keepalive,
);
}
POE::Session->create(
heap => { CONTEXT => $c },
object_states => [
$self => {
_start => '_poe_session_start',
_stop => '_poe_session_stop',
map { ($_ => "_poe_$_") }
lib/Gungho/Engine/POE.pm view on Meta::CPAN
config:
loop_delay: 5
client:
spawn: 2
agent:
- AgentName1
- AgentName2
max_size: 16384
follow_redirect: 2
proxy: http://localhost:8080
keepalive:
keep_alive: 10
max_open: 200
max_per_host: 20
timeout: 10
dns:
# disable: 1 If you want to disable DNS resolution by Gungho
=head1 DESCRIPTION
Gunghog::Engine::POE gives you the full power of POE to Gungho.
lib/Gungho/Engine/POE.pm view on Meta::CPAN
C<spawn> specifies the number of POE::Component::Client::HTTP sessions to start.
This will greatly affect your fetching speed, as PoCo::Client::HTTP tends to
start jamming up after a certain number of requests have been pushed onto
its queue.
If you feel like all of your other settings are correct but the actual
HTTP fetch is taking too long, try setting this number to something higher.
By default this is set to 2.
=head2 keepalive.keep_alive
Specifies the number of seconds to keep a connection in the Keepalive
connection manager.
This is an important option to tweak if you're using proxies. Even though
you might be accessing thousands of different URLs, POE will think that
you are in fact trying to connect to the same host because you're
accessing the same proxy.
Turn this to 0 if you are using a proxy.
=head1 POE::Component::Client::HTTP AND DECODED CONTENTS
lib/Gungho/Engine/POE.pm view on Meta::CPAN
to enable the workarounds:
GUNGHO_ENGINE_POE_SKIP_DECODE_CONTENT = 1
# or
GUNGHO_ENGINE_POE_FORCE_ENCODE_CONTENT = 1
See L<ENVIRONMENT VARIABLES|ENVIRONMENT VARIABLES> for details
=head1 USING KEEPALIVE
Gungho::Engine::POE uses PoCo::Client::Keepalive to control the connections.
For the most part this has no visible effect on the user, but the "timeout"
parameter dictate exactly how long the component waits for a new connection
which means that, after finishing to fetch all the requests the engine
waits for that amount of time before terminating. This is NORMAL.
=head1 ENVIRONMENT VARIABLES
=head2 GUNGHO_ENGINE_POE_SKIP_DECODE_CONTENT
When set to a non-null value, this will install a new subroutine in
lib/Gungho/Manual/FAQ.pod view on Meta::CPAN
engine:
module: POE
=head1 Q. "My requests are being served slow. What can I do?"
There are actually a number of things that may affect fetch speed.
=head2 Is Gungho The Right Crawler For Your Data Set?
Gungho uses an asynchronous engine, and with POE::Component::Client::Keepalive
it reuses the connections to the same host.
This kind of setup works great if you are accessing a lot of diffferent hosts,
but could easily jam up if you are accessing, for example, a single host.
For such datasets, Gungho will be no more effective than a simple script
repeated calls to LWP::UserAgent-E<gt>get().
=head2 Choosing The Right loop_delay With Gungho::Engine::POE
C<engine.config.loop_delay> specifies the number of seconds to wait between
lib/Gungho/Manual/FAQ.pod view on Meta::CPAN
Do note, however, that excessive enqueueing of requests is going to e a
problem regardless. You should at least keep a mental note of how many requests
you're sending to the POE queue, and throttle as necessary.
=head2 Considerations When Using A Proxy With Gungho::Engine::POE
Proxies are great, and could be used in crawler applications, but by default
it doesn't play nicely with Gungho's POE engine.
The short version of the remedy is: Set engine.config.keepalive.keep_alive to 0
engine:
module: POE
config:
keepalive:
keep_alive: 0
Now the long explanation. Gungho::Engine::POE, and POE::Component::Client::HTTP
which is used internally, uses a module called POE::Component::Client::Keepalive
to manage the connections, and to possibly reuse the already established
connection. However, when using a proxy, all the requests go through the given
proxy, so PoCo::Client::Keepalive will try to reuse the connections to all
of the requests.
This is obviously aproblem, because it will make the entire request set to
go through the same connection -- and therefore you lose all parallelism.
To workaround this problem, you need to disable PoCo::Component::Keepalive,
and hence the above configuration.
=cut
lib/Gungho/Request/http.pm view on Meta::CPAN
# $Id: /mirror/gungho/lib/Gungho/Request/http.pm 2473 2007-09-04T07:08:58.221716Z lestrrat $
#
# Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>
# All rightsreserved.
package Gungho::Request::http;
use strict;
use warnings;
use base qw(Gungho::Base);
__PACKAGE__->mk_accessors($_) for qw(peer_http_version send_te keep_alive);
my $CRLF = "\015\012";
sub new
{
my $class = shift;
$class->next::method(peer_http_version => "1.0", send_te => 0, @_);
}
sub prepare_request{}
lib/Gungho/Request/http.pm view on Meta::CPAN
if ($given{te}) {
push(@connection, "TE") unless grep lc($_) eq "te", @connection;
} elsif ($self->send_te && zlib_ok()) {
# gzip is less wanted since the Compress::Zlib interface for
# it does not really allow chunked decoding to take place easily.
push(@h2, "TE: deflate,gzip;q=0.3");
push(@connection, "TE");
}
unless (grep lc($_) eq "close", @connection) {
if ($self->keep_alive) {
if ($peer_ver eq "1.0") {
# from looking at Netscape's headers
push(@h2, "Keep-Alive: 300");
unshift(@connection, "Keep-Alive");
}
} else {
push(@connection, "close") if $ver ge "1.1";
}
}
push(@h2, "Connection: " . join(", ", @connection)) if @connection;