Gungho

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN

  - Add Gungho::Manual::FAQ

0.09003_03 Fri Nov 09 2007 [rev 258]
  [POE]
  - Note: Changes for POE engine contained in this release are relatively
    critical. If you were having problems before, you probably should check
    this release out.
  - Be smarter how dispatch() gets called. Now we do a more effective
    invocation of the dispatch state so that we don't waste cycles just
    trying to dispatch requests.
  - Allow "0" setting in keepalive.keep_alive. This is a very important
    parameter if you're using Gungho through a proxy. If you enable this
    while under a proxy, PoCo::Client::Keepalive will think that you should
    be using the cached connection to the proxy and so Gungho will lose all
    parallism.
  - Allow setting the number of PoCo::Client::HTTP to be spawned via
    client.spawn parameter. This is required if you're dealing with 
    relatively large amounts of URLs at once. Otherwise, PoCo::Client::HTTP
    will tend to jam up after a while.

0.09003_02 Thu Nov 08 2007 [rev 258]
  [Throttle]
  - Fix Throttling to delegate throttling decisions. This allows you to

Changes  view on Meta::CPAN

  - Implement plugins
  - Add G::Plugin::RequestTimer
  - Add deps features in Makefile.PL 

0.01 Sat 07 Apr 2007
  - handle_response() now take $request and $response all over
  - Add send_request() in Gungho.pm, Gungho/Engine/POE.pm
  - Add notes() in Gungho/Request.pm. Cloning is properly handled

0.01_04 Sat 07 Apr 2007
  - Enable keepalive

0.01_03 Fri 06 Apr 2007
  - Fix embarassing documentation whoopla. As stated, no,
    I'm not ashamed of stealing good code. 

0.01_02 Fri 06 Apr 2007
  - Add a new provider and small set of changes so that we can
    use this in Plagger
  - Use Class::Inspector to check if a module has been loaded

META.yml  view on Meta::CPAN

  HTTP::Headers::Util: 0
  HTTP::Parser: 0
  HTTP::Status: 0
  IO::Async::Buffer: 0
  IO::Socket::INET: 0
  MIME::Base64: 0
  Net::DNS: 0
  POE: 0.9999
  POE::Component::Client::DNS: 0
  POE::Component::Client::HTTP: 0.81
  POE::Component::Client::Keepalive: 0
  Path::Class::Dir: 0
  Regexp::Common: 0
  Sys::Hostname: 0
  Time::HiRes: 0
  URI: 0
  WWW::RobotRules::Parser: 0
  Web::Scraper::Config: 0
  XML::LibXML: 0
  YAML: 0
requires: 

META.yml  view on Meta::CPAN

  Exception::Class: 0
  FindBin: 0
  Getopt::Long: 0
  HTTP::Request: 0
  HTTP::Response: 0
  HTTP::Status: 0
  Log::Dispatch: 0
  POE: 0.9999
  POE::Component::Client::DNS: 0
  POE::Component::Client::HTTP: 0.81
  POE::Component::Client::Keepalive: 0
  Path::Class: 0
  Pod::Usage: 0
  Regexp::Common: 0
  Storable: 0
  UNIVERSAL::isa: 0.06
  UNIVERSAL::require: 0
  URI: 0
  perl: 5.8.0
tests: t/01_load.t t/02_config.t t/03_live/perl-proxy.t t/03_live/perl.t t/03_live/twitter.t t/99_kwalitee.t t/99_pod-coverage.t t/99_pod.t t/component/authentication/01_load.t t/component/cache/01_load.t t/component/robot_rules/01_load.t t/component...
version: 0.09006

Makefile.PL  view on Meta::CPAN

name("Gungho");
all_from("lib/Gungho.pm");

no_index( directory => 'examples' );

# This used to be optional, but we're making it mandatory.
# You're still free to use other engines, but we're forcing
# you to install POE so we can go ahead with testing
requires('POE', '0.9999');
requires('POE::Component::Client::DNS');
requires('POE::Component::Client::Keepalive');
requires('POE::Component::Client::HTTP', '0.81');

requires("Best");
requires("Class::Accessor::Fast");
requires("Class::C3::Componentised");
requires("Class::Data::Inheritable");
requires("Class::Inspector");
requires("Config::Any");
requires("Data::Dumper");
requires("Event::Notify", '0.00004');

deps/Engine-POE.yaml  view on Meta::CPAN

---
name: 'POE Engine'
default: 1
depends:
    'POE': 0.9999
    'POE::Component::Client::DNS': 0
    'POE::Component::Client::Keepalive': 0
    'POE::Component::Client::HTTP': 0.81

docs/ja/Gungho/Engine/POE.pod  view on Meta::CPAN

    config:
      loop_delay: 5 
      client:
        spawn: 2
        agent:
          - AgentName1
          - AgentName2
        max_size: 16384
        follow_redirect: 2
        proxy: http://localhost:8080
      keepalive:
        keep_alive: 10
        max_open: 200
        max_per_host: 20
        timeout: 10
      dns:
        # disable: 1 If you want to disable DNS resolution by Gungho

=head1 DESCRIPTION

Gunghog::Engine::POEはPOEを用いてGunghoを動かすためのモジュールです。

docs/ja/Gungho/Engine/POE.pod  view on Meta::CPAN

デフォルト値は2です。

=head2 dns.disable

DNS解決をしたい場合はこの値を1にしてください。一部の環境ではクライアント側での
DNS解決が駄目でも経路にあるプロキシがDNS解決を行う等の設定がされており、これが
問題となる事もあります。その際はこの項目を1に設定してください。

デフォルト値は0です。

=head2 keepalive

C<keepalive>項目はPOE::Component::Client::Keepaliveを設定するために
使用します。

プロキシを通した環境でGunghoを使用している場合はこの設定が重要になってくる
場合があります。Gungho::Engine::POEは内部で接続中のソケットを使い回すように
作られていますが、もしプロキシにつなげる場合は接続対象サーバーがひとつなので、
全てのリクエストが同一ソケットを使うようになってしまい、結果的に並列処理が
できなくなります (透過型プロキシはこの対象に含まれません)。

Gunghoはプロキシの使用を検出した場合はこの設定を自動的にするよう努力しますが、
明示的に指定する場合は

  keepalive:
    keep_alive: 0

と指定してください。その他のPOE::Component::Client::Keepalive設定を変更
する場合は、以下のように設定してください

  keepalive:
    max_per_host: ....
    max_open: ...
    timeout: ...

C<keepalive.timeout>項目を設定する場合はL<USING KEEPALIVE|USING KEEPALIVE>も
参照してください。

=head1 POE::Component::Client::HTTP AND DECODED CONTENTS

POE::Component::Client::HTTP はバージョン0.80以降、取得したレスポンス内容を
勝手にPerl Unicodeにデコードしてしまう事があります。その場合、たとえHTTP
ヘッダー自体が以下のようになっていても、実際に渡ってくるデータ内容は
正規化されたUnicodeである事があります:

  Content-Type: text/html; charset=euc-jp

docs/ja/Gungho/Engine/POE.pod  view on Meta::CPAN

  user_agent: my_user_agent
  engine:
    module: POE
    ...

これを設定しないとRobotRulesのようなコンポーネントを使用する時に問題が生じる
事があります。

=head1 USING KEEPALIVE

Gungho::Engine::POEは内部でPOE::Component::Client::Keepaliveを使用して
ソケット接続を制御しています。

ほとんどの設定はパフォーマンス以外特にユーザーの目に触れるものではありませんが、
C<timeout>設定はエンジンが終了する時に影響する可能性もあります。C<timeout>値
が高いとその接続が切れるまでPOE自体が活動を止められないのがその理由ですが、
これは正常な動作です。

=head1 ENVIRONMENT VARIABLES

=head2 GUNGHO_ENGINE_POE_SKIP_DECODE_CONTENT

docs/ja/Gungho/Manual/FAQ.pod  view on Meta::CPAN

だけです。

=back

=head1 Q. "リクエストを取得するのに時間がかかっているようです。どんな対処方法がありますか?"

これは様々な因子が関係していますが、まず以下のような点を注意してみてください:

=head2 あなたのデータセットはGunghoと合ってますか?

Gunghoは非同期エンジンを使用し、POE::Component::Client::Keepaliveでソケット
接続をキャッシュします。

このような動作をするGunghoを使う場合、様々なホストをクロールする分には良い性能
を期待できますが、その逆で例えばひとつのホスト内をクロールするには注意を
しないと思うように性能が上がらない可能性があります。

=head2 Gungho::Engine::POEとloop_delay設定

C<engine.config.loop_delay> は1ループあたりに待つ時間の指定をします。この
1ループ毎にProviderが次に送信するリクエストはあるのか確認し、そのリクエストを

docs/ja/Gungho/Manual/FAQ.pod  view on Meta::CPAN


クローラーをプロキシを通して運用するのはよく行われている事ですが、
Gungho::Engine::POEと使用する時にはパフォーマンスが落ちる可能性があるので
注意する必要があります。

プロキシを使用する場合は以下のような設定を指定してください:

  engine:
    module: POE
    config:
      keepalive:
        keep_alive: 0

この設定を行う事によってPOE::Component::Client::HTTPで使用されている
POE::Component::client::Keepaliveを無効化します。Keepaliveは一度接続された
ソケット接続を再利用していくモジュールですが、プロキシに接続している場合は
接続対象サーバーが一つしかないので並列処理が全くできなくなります。なので
この設定を使う事によって、毎回接続を行うようにするというわけです。


=cut

docs/ja/Gungho/Manual/Install.pod  view on Meta::CPAN


=head1 依存関係

Gunghoは様々なモジュール類を使って構成されているため、CPANにアップされている
モジュールへの依存関係が多数あります。これらのモジュールはGunghoのインストール中
に必要かどうかの確認が画面に表示されます。例えばPOEエンジンを使うのであれば
POE関連のモジュールをインストールするかという質問には"y"で答えてください。

  出力例:
  [POE Engine]
  - POE::Component::Client::Keepalive ...missing
  - POE::Component::Client::DNS       ...missing
  - POE::Component::Client::HTTP      ...missing
  - POE                               ...missing
  ==> Auto-install the 1 optional module(s) from CPAN? [y] 

=cut

lib/Gungho/Engine/POE.pm  view on Meta::CPAN

# $Id: /mirror/gungho/lib/Gungho/Engine/POE.pm 39017 2008-01-16T16:05:45.674472Z lestrrat  $
#
# Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>
# All rights reserved.

package Gungho::Engine::POE;
use strict;
use warnings;
use base qw(Gungho::Engine);
use POE;
use POE::Component::Client::Keepalive;
use POE::Component::Client::HTTP;

__PACKAGE__->mk_accessors($_) for qw(alias loop_alarm loop_delay resolver clients);

use constant DEBUG => 0;
use constant UserAgentAlias => 'Gungho_Engine_POE_UserAgent_Alias';
use constant DnsResolverAlias => 'Gungho_Engine_POE_DnsResolver_Alias';
use constant SKIP_DECODE_CONTENT  =>
    exists $ENV{GUNGHO_ENGINE_POE_SKIP_DECODE_CONTENT} ?  $ENV{GUNGHO_ENGINE_POE_SKIP_DECODE_CONTENT} : 1;
use constant FORCE_ENCODE_CONTENT => 

lib/Gungho/Engine/POE.pm  view on Meta::CPAN

    $self->loop_delay( $self->config->{loop_delay} ) if $self->config->{loop_delay};
    $self->next::method(@_);
}

sub run
{
    my ($self, $c) = @_;

    my %config = %{ $self->config || {} };

    my $keepalive_config = delete $config{keepalive} || {};

    {
        my %defaults = (
            keep_alive   => 10,
            max_open     => 200,
            max_per_host => 5,
            timeout      => 10
        );
        while (my($key, $value) = each %defaults) {
            if (! defined $keepalive_config->{$key}) {
                $keepalive_config->{$key} = $value;
            }
        }
    }

    my $keepalive = POE::Component::Client::Keepalive->new(%$keepalive_config);

    my $client_config = delete $config{client} || {};
    foreach my $key (keys %$client_config) {
        if ($key =~ /^[a-z]/) { # ah, need to make this CamelCase
            my $camel = ucfirst($key);
            $camel =~ s/_(\w)/uc($1)/ge;
            $client_config->{$camel} = delete $client_config->{$key};
        }
    }

lib/Gungho/Engine/POE.pm  view on Meta::CPAN

    if ($spawn < 1) { $spawn = 2 }
    for my $i ( 1 .. $spawn ) {
        my $alias = join('-', &UserAgentAlias, $i);
        push @{ $self->clients }, $alias;
        POE::Component::Client::HTTP->spawn(
            FollowRedirects   => 1,
            Agent             => $c->user_agent,
            Timeout           => 60,
            %$client_config,
            Alias             => $alias,
            ConnectionManager => $keepalive,
        );
    }

    POE::Session->create(
        heap => { CONTEXT => $c },
        object_states => [
            $self => {
                _start => '_poe_session_start',
                _stop  => '_poe_session_stop',
                map { ($_ => "_poe_$_") }

lib/Gungho/Engine/POE.pm  view on Meta::CPAN

    config:
      loop_delay: 5 
      client:
        spawn: 2
        agent:
          - AgentName1
          - AgentName2
        max_size: 16384
        follow_redirect: 2
        proxy: http://localhost:8080
      keepalive:
        keep_alive: 10
        max_open: 200
        max_per_host: 20
        timeout: 10
      dns:
        # disable: 1 If you want to disable DNS resolution by Gungho


=head1 DESCRIPTION

Gunghog::Engine::POE gives you the full power of POE to Gungho.

lib/Gungho/Engine/POE.pm  view on Meta::CPAN

C<spawn> specifies the number of POE::Component::Client::HTTP sessions to start.
This will greatly affect your fetching speed, as PoCo::Client::HTTP tends to
start jamming up after a certain number of requests have been pushed onto
its queue.

If you feel like all of your other settings are correct but the actual
HTTP fetch is taking too long, try setting this number to something higher.

By default this is set to 2. 

=head2 keepalive.keep_alive

Specifies the number of seconds to keep a connection in the Keepalive
connection manager. 

This is an important option to tweak if you're using proxies. Even though
you might be accessing thousands of different URLs, POE will think that
you are in fact trying to connect to the same host because you're
accessing the same proxy.

Turn this to 0 if you are using a proxy.

=head1 POE::Component::Client::HTTP AND DECODED CONTENTS

lib/Gungho/Engine/POE.pm  view on Meta::CPAN

to enable the workarounds:

  GUNGHO_ENGINE_POE_SKIP_DECODE_CONTENT = 1
  # or
  GUNGHO_ENGINE_POE_FORCE_ENCODE_CONTENT = 1

See L<ENVIRONMENT VARIABLES|ENVIRONMENT VARIABLES> for details

=head1 USING KEEPALIVE

Gungho::Engine::POE uses PoCo::Client::Keepalive to control the connections.
For the most part this has no visible effect on the user, but the "timeout"
parameter dictate exactly how long the component waits for a new connection
which means that, after finishing to fetch all the requests the engine
waits for that amount of time before terminating. This is NORMAL.

=head1 ENVIRONMENT VARIABLES

=head2 GUNGHO_ENGINE_POE_SKIP_DECODE_CONTENT

When set to a non-null value, this will install a new subroutine in

lib/Gungho/Manual/FAQ.pod  view on Meta::CPAN


  engine:
    module: POE

=head1 Q. "My requests are being served slow. What can I do?"

There are actually a number of things that may affect fetch speed.

=head2 Is Gungho The Right Crawler For Your Data Set?

Gungho uses an asynchronous engine, and with POE::Component::Client::Keepalive
it reuses the connections to the same host. 

This kind of setup works great if you are accessing a lot of diffferent hosts,
but could easily jam up if you are accessing, for example, a single host.
For such datasets, Gungho will be no more effective than a simple script
repeated calls to LWP::UserAgent-E<gt>get().

=head2 Choosing The Right loop_delay With Gungho::Engine::POE

C<engine.config.loop_delay> specifies the number of seconds to wait between

lib/Gungho/Manual/FAQ.pod  view on Meta::CPAN


Do note, however, that excessive enqueueing of requests is going to e a
problem regardless. You should at least keep a mental note of how many requests
you're sending to the POE queue, and throttle as necessary.

=head2 Considerations When Using A Proxy With Gungho::Engine::POE

Proxies are great, and could be used in crawler applications, but by default
it doesn't play nicely with Gungho's POE engine. 

The short version of the remedy is: Set engine.config.keepalive.keep_alive to 0

  engine:
    module: POE
    config:
      keepalive:
        keep_alive: 0

Now the long explanation. Gungho::Engine::POE, and POE::Component::Client::HTTP
which is used internally, uses a module called POE::Component::Client::Keepalive
to manage the connections, and to possibly reuse the already established
connection. However, when using a proxy, all the requests go through the given
proxy, so PoCo::Client::Keepalive will try to reuse the connections to all
of the requests.

This is obviously aproblem, because it will make the entire request set to
go through the same connection -- and therefore you lose all parallelism.

To workaround this problem, you need to disable PoCo::Component::Keepalive,
and hence the above configuration.

=cut

lib/Gungho/Request/http.pm  view on Meta::CPAN

# $Id: /mirror/gungho/lib/Gungho/Request/http.pm 2473 2007-09-04T07:08:58.221716Z lestrrat  $
#
# Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>
# All rightsreserved.

package Gungho::Request::http;
use strict;
use warnings;
use base qw(Gungho::Base);

__PACKAGE__->mk_accessors($_) for qw(peer_http_version send_te keep_alive);

my $CRLF = "\015\012";

sub new
{
    my $class = shift;
    $class->next::method(peer_http_version => "1.0", send_te => 0, @_);
}

sub prepare_request{}

lib/Gungho/Request/http.pm  view on Meta::CPAN

    if ($given{te}) {
        push(@connection, "TE") unless grep lc($_) eq "te", @connection;
    } elsif ($self->send_te && zlib_ok()) {
        # gzip is less wanted since the Compress::Zlib interface for
        # it does not really allow chunked decoding to take place easily.
        push(@h2, "TE: deflate,gzip;q=0.3");
        push(@connection, "TE");
    }

    unless (grep lc($_) eq "close", @connection) {
        if ($self->keep_alive) {
            if ($peer_ver eq "1.0") {
                # from looking at Netscape's headers
                push(@h2, "Keep-Alive: 300");
                unshift(@connection, "Keep-Alive");
            }
        } else {
            push(@connection, "close") if $ver ge "1.1";
        }
    }
    push(@h2, "Connection: " . join(", ", @connection)) if @connection;



( run in 0.817 second using v1.01-cache-2.11-cpan-d7a12ab2c7f )