App-Fetchware
view release on metacpan or search on metacpan
lib/App/FetchwareX/HTMLPageSync.pm view on Meta::CPAN
# Please look up HTMLPageSync's documentation of its configuration file syntax at
# perldoc App::FetchwareX::HTMLPageSync, and only if its configuration file
# syntax is not malleable enough for your application should you resort to
# customizing fetchware's behavior. For extra flexible customization see perldoc
# App::Fetchwarex::HTMLPageSync.
EOF
descriptions => {
page_name => <<EOA,
page_name simply names the HTML page the Fetchwarefile is responsible for
downloading, analyzing via optional callbacks, and copying to your
destination_directory.
EOA
html_page_url => <<EOA,
html_page_url is HTMLPageSync's lookup_url equivalent. It specifies a HTTP url
that returns a page of HTML that can be easily parsed of links to later
download.
EOA
destination_directory => <<EOA,
destination_directory is the directory on your computer where you want the files
that you configure HTMLPageSync to parse to be copied to.
lib/App/FetchwareX/HTMLPageSync.pm view on Meta::CPAN
you run fetchware uninstall.
EOA
}
);
extension_name(__PACKAGE__);
opening_message(<<EOM);
HTMLPageSync's new command is not as sophistocated as Fetchware's. Unless you
only want to download images, you will have to get your hands dirty, and code up
some custom Perl callbacks to customize HTMLPageSync's behavior. However, it
will ask you quite nicely the basic options, so if those are all you need, then
this command will successfully generate a HTMLPageSync Fetchwarefile for you.
After it lets you choose the easy options of page_name, html_page_url,
and destination_directory, it will give you an opportunity to modify the
user_agent string HTMLPageSync uses to avoid betting banned or having your
scraping stick out like a sore thumb in the target Web server's logs. Then,
you'll be asked about the advanced options. If you want them it will add generic
ones to the Fetchwarefile that you can then fill in later on when HTMLPageSync
asks you if you want to edit the generated Fetchwarefile manually. Finally,
lib/App/FetchwareX/HTMLPageSync.pm view on Meta::CPAN
page_name 'Cool Wallpapers';
html_page_url 'http://some-html-page-with-cool.urls';
destination_directory 'wallpapers';
# pretend to be firefox
user_agent 'Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1';
# Customize the callbacks.
html_treebuilder_callback sub {
# Get one HTML::Element.
my $h = shift;
# Return true or false to indicate if this HTML::Element shoudd be a
# download link.
if (something) {
return 'True';
} else {
return undef;
lib/App/FetchwareX/HTMLPageSync.pm view on Meta::CPAN
if ($link =~ /\.(pdf|doc|docx)$/) {
# ...return true...
return 'True';
} else {
# ...if not return false.
return undef; #false
}
}
};
=item B<7. Specify an optional download_links_callbacks>
C<download_links_callback> specifies an optional anonymous Perl subroutine
reference that will replace the default one that HTMLPageSync uses. The default
one removes the HTML::Element skin each download link is wrapped in, because of
the use of L<HTML::TreeBuilder>. This simply strips off the object-oriented crap
its wrapped in, and turns it into a simply string scalar.
If you want to post process the download link in some other way, then just copy
and paste the code below into your Fetchwarefile, and add whatever other Perl
code you may need. This requires knowledge of the Perl programming language.
lib/App/FetchwareX/HTMLPageSync.pm view on Meta::CPAN
=head3 start() and end()
HTMLPageSync just imports start() and end() from App::Fetchware to take
advantage of their ability to manage a temporary directory.
=head3 lookup()
lookup() is overridden, and downloads the C<html_page_url>, which is the main
configuration option that HTMLPageSync uses. Then lookup() parses that
C<html_page_url>, and determines what the download urls should be. If the
C<html_trebuilder_callback> and C<download_links_callbacks> exist, then they are
called to customize lookup()'s default bahavior. See their descriptions below.
=head3 download()
download() downloads the array ref of download links that lookup() returns.
=head3 verify()
verify() is overridden to do nothing.
lib/App/FetchwareX/HTMLPageSync.pm view on Meta::CPAN
As with the rest of App::Fetchware, App::Fetchware::Config does not return any
error codes; instead, all errors are die()'d if it's App::Fetchware::Config's
error, or croak()'d if its the caller's fault. These exceptions are simple
strings, and are listed in the L</DIAGNOSTICS> section below.
=head1 CAVEATS
Certain features of App::FetchwareX::HTMLPageSync require knowledge of the Perl
programming language in order for you to make use of them. However, this is
limited to optional callbacks that are not needed for most uses. These features
are the C<html_treebuilder_callback> and C<download_links_callback> callbacks.
=head1 AUTHOR
David Yingling <deeelwy@gmail.com>
=head1 COPYRIGHT AND LICENSE
This software is copyright (c) 2016 by David Yingling.
This is free software; you can redistribute it and/or modify it under
lib/Test/Fetchware.pm view on Meta::CPAN
test end() we a simple C<ok(not -e $temp_dir, $test_name);>; instead, you should
use this testing subroutine. It tests if the specified $temp_dir still has a
locked C<'fetchware.sem'> fetchware semaphore file. If the file is not locked,
then end_ok() reports success, but if it cannot obtain a lock, end_ok reports
failure simply using ok().
=head2 add_prefix_if_nonroot()
my $prefix = add_prefix_if_nonroot();
my $callbacks_return_value = add_prefix_if_nonroot(sub { a callback });
fetchware is designed to be run as root, and to install system software in
system directories requiring root privileges. But, fetchware is flexible enough
to let you specifiy where you want the software you're going to install be
installed via the prefix configuration option. This subroutine when run creates
a temporary directory in File::Spec's tmpdir(), and then it directly runs
config() itself to create this config option for you.
However, if you supply a coderef, add_prefix_if_nonroot() will instead call your
coderef instead of using config() directly. If your callback returns a scalar
( run in 0.256 second using v1.01-cache-2.11-cpan-8d75d55dd25 )