App-Fetchware
view release on metacpan or search on metacpan
lib/App/FetchwareX/HTMLPageSync.pm view on Meta::CPAN
our %EXPORT_TAGS = (
TESTING => [qw(
get_html_page_url
get_destination_directory
ask_about_keep_destination_directory
new
new_install
)]
);
our @EXPORT_OK = map {@{$_}} values %EXPORT_TAGS;
sub new {
my ($term, $page_name) = @_;
# Instantiate a new Fetchwarefile object for managing and generating a
# Fetchwarefile, which we'll write to a file for the user or use to
# build a associated Fetchware package.
my $now = localtime;
my $fetchwarefile = App::Fetchware::Fetchwarefile->new(
header => <<EOF,
use App::FetchwareX::HTMLPageSync;
# Auto generated $now by HTMLPageSync's fetchware new command.
# However, feel free to edit this file if HTMLPageSync's new command's
# autoconfiguration is not enough.
#
# Please look up HTMLPageSync's documentation of its configuration file syntax at
# perldoc App::FetchwareX::HTMLPageSync, and only if its configuration file
# syntax is not malleable enough for your application should you resort to
# customizing fetchware's behavior. For extra flexible customization see perldoc
# App::Fetchwarex::HTMLPageSync.
EOF
descriptions => {
page_name => <<EOA,
page_name simply names the HTML page the Fetchwarefile is responsible for
downloading, analyzing via optional callbacks, and copying to your
destination_directory.
EOA
html_page_url => <<EOA,
html_page_url is HTMLPageSync's lookup_url equivalent. It specifies a HTTP url
that returns a page of HTML that can be easily parsed of links to later
download.
EOA
destination_directory => <<EOA,
destination_directory is the directory on your computer where you want the files
that you configure HTMLPageSync to parse to be copied to.
EOA
user_agent => <<EOA,
user_agent, if specified, will be passed to HTML::Tiny, the Perl HTTP library
Fetchware uses, where the library will lie to the Web server you are Web
scraping from to hopefully prevent the Web sever from banning you, or updating
the page you want to scrap to use too much Javascript, which would prevent the
simple parser HTMLPageSync uses from working on the specified html_page_url.
EOA
html_treebuilder_callback => <<EOA,
html_treebuilder_callback allows you to specify a perl CODEREF that HTMLPageSync
will execute instead of its default callback that just looks for images.
It receives one parameter, which is an HTML::Element at the first C<a>,
anchor/link tag.
It must [return 'True';] to indicate that that link should be included in the
list of download links, or return false, [return undef], to indicate that that
link should not be included in the list of download links.
EOA
download_links_callback => <<EOA,
download_links_callback specifies an optional callback that will allow you to do
post processing of the list of downloaded urls. This is needed, because the
results of the html_treebuilder_callback are still HTML::Element objects that
need to be converted to just string download urls. That is what the default
C<download_links_callback> does.
It receives a list of all of the download HTML::Elements that
C<html_treebuilder_callback> returned true on. It is called only once, and
should return a list of string download links for download later by
HTMLPageSync.
EOA
keep_destination_directory => <<EOA,
keep_destination_directory is a boolean true or false configuration option that
when true prevents HTMLPageSync from deleting your destination_directory when
you run fetchware uninstall.
EOA
}
);
extension_name(__PACKAGE__);
opening_message(<<EOM);
HTMLPageSync's new command is not as sophistocated as Fetchware's. Unless you
only want to download images, you will have to get your hands dirty, and code up
some custom Perl callbacks to customize HTMLPageSync's behavior. However, it
will ask you quite nicely the basic options, so if those are all you need, then
this command will successfully generate a HTMLPageSync Fetchwarefile for you.
After it lets you choose the easy options of page_name, html_page_url,
and destination_directory, it will give you an opportunity to modify the
user_agent string HTMLPageSync uses to avoid betting banned or having your
scraping stick out like a sore thumb in the target Web server's logs. Then,
you'll be asked about the advanced options. If you want them it will add generic
ones to the Fetchwarefile that you can then fill in later on when HTMLPageSync
asks you if you want to edit the generated Fetchwarefile manually. Finally,
after your Fetchwarefile is generated HTMLPageSync will ask you if you would
like to install your generated Fetchwarefile to test it out.
EOM
# Ask the user for the basic configuration options.
$page_name = fetchwarefile_name(page_name => $page_name);
vmsg "Determined your page_name option to be [$page_name]";
$fetchwarefile->config_options(page_name => $page_name);
vmsg "Appended page_name [$page_name] configuration option to Fetchwarefile";
my $html_page_url = get_html_page_url($term);
vmsg "Asked user for html_page_url [$html_page_url] from user.";
$fetchwarefile->config_options(html_page_url => $html_page_url);
vmsg "Appended html_page_url [$html_page_url] configuration option to Fetchwarefile";
my $destination_directory = get_destination_directory($term);
vmsg "Asked user for destination_directory [$destination_directory] from user.";
$fetchwarefile->config_options(destination_directory => $destination_directory);
vmsg <<EOM;
Appended destination_directory [$destination_directory] configuration option to
your Fetchwarefile";
EOM
# Asks and sets the keep_destination_directory configuratio option if the
# user wants to set it.
ask_about_keep_destination_directory($term, $fetchwarefile);
vmsg 'Prompting for other options that may be needed.';
my $other_options_hashref = prompt_for_other_options($term,
user_agent => {
prompt => <<EOP,
What user_agent configuration option would you like?
EOP
print_me => <<EOP
user_agent, if specified, will be passed to HTML::Tiny, the Perl HTTP library
Fetchware uses, where the library will lie to the Web server you are Web
scraping from to hopefully prevent the Web sever from banning you, or updating
the page you want to scrap to use too much Javascript, which would prevent the
simple parser HTMLPageSync uses from working on the specified html_page_url.
EOP
},
html_treebuilder_callback => {
prompt => <<EOP,
What html_treebuilder_callback configuration option would you like?
EOP
print_me => <<EOP,
html_treebuilder_callback allows you to specify a perl CODEREF that HTMLPageSync
will execute instead of its default callback that just looks for images.
It receives one parameter, which is an HTML::Element at the first C<a>,
anchor/link tag.
It must [return 'True';] to indicate that that link should be included in the
list of download links, or return false, [return undef], to indicate that that
link should not be included in the list of download links.
Because Term::UI's imput is limited to just one line, please just press enter,
and a dummy value will go into your Fetchwarefile, where you can then replace
that dummy value with a proper Perl callback next, when Fetchware gives you the
option to edit your Fetchwarefile manually.
EOP
default => 'sub { my $h = shift; die "Dummy placeholder fill me in."; }',
},
download_links_callback => {
prompt => <<EOP,
What download_links_callback configuration option would you like?
EOP
print_me => <<EOP,
download_links_callback specifies an optional callback that will allow you to do
post processing of the list of downloaded urls. This is needed, because the
results of the html_treebuilder_callback are still HTML::Element objects that
need to be converted to just string download urls. That is what the default
C<download_links_callback> does.
It receives a list of all of the download HTML::Elements that
C<html_treebuilder_callback> returned true on. It is called only once, and
should return a list of string download links for download later by
HTMLPageSync.
Because Term::UI's imput is limited to just one line, please just press enter,
and a dummy value will go into your Fetchwarefile, where you can then replace
that dummy value with a proper Perl callback next, when Fetchware gives you the
option to edit your Fetchwarefile manually.
EOP
default => 'sub { my @download_urls = @_; die "Dummy placeholder fill me in."; }',
},
);
vmsg 'User entered the following options.';
vmsg Dumper($other_options_hashref);
# Append all other options to the Fetchwarefile.
$fetchwarefile->config_options(%$other_options_hashref);
vmsg 'Appended all other options listed above to Fetchwarefile.';
my $edited_fetchwarefile = edit_manually($term, $fetchwarefile);
vmsg <<EOM;
Asked user if they would like to edit their generated Fetchwarefile manually.
EOM
# Generate Fetchwarefile.
# If edit_manually() did not modify the Fetchwarefile, then generate it.
if (blessed($edited_fetchwarefile)
and
$edited_fetchwarefile->isa('App::Fetchware::Fetchwarefile')) {
$fetchwarefile = $fetchwarefile->generate();
# If edit_manually() modified the Fetchwarefile, then do not generate it,
# and replace the Fetchwarefile object with the new string that represents
# the user's edited Fetchwarefile.
} else {
lib/App/FetchwareX/HTMLPageSync.pm view on Meta::CPAN
verify() is overridden to do nothing.
=head3 build()
build() is overridden to do nothing.
=head3 install()
install() takes its argument, which is an arrayref of of the paths of the
files that were downloaded to the tempdir created by start(), and copies them to
the user's provided C<destination_directory>.
=head3 end() and start()
HTMLPageSync just imports end() and start() from App::Fetchware to take
advantage of their ability to manage a temporary directory.
=head3 uninstall()
uninstall() recursively deletes your C<destination_directory> where it stores
whatever links you choose to download unless of course the
C<keep_destination_directory> configuration option is set to true.
=head3 upgrade()
Determines if any looked up URLs have not been downloaded yet, and returns true
if that is the case.
=head2 App::FetchwareX::HTMLPageSync's Configuration Subroutines
Because HTMLPageSync is a App::Fetchware extension, it can not just use the same
configuration subroutines that App::Fetchware uses. Instead, it must create its
own configuration subroutines with App::Fetchware::CreateConfigOptions. These
configuration subroutines are the configuration options that you use in your
App::Fetchware or App::Fetchware extension.
=head3 page_name [MANDATORY]
HTMLPageSync's equivelent to App::Fetchware's C<program_name>. It's simply the
name of the page or what you want to download on that page.
=head3 html_page_url [MANDATORY]
HTMLPageSync's equivelent to App::Fetchware's C<lookup_url>, and is just as
mandatory. This is the url of the HTML page that will be downloaded and
processed.
=head3 destination_directory [MANDATORY]
This option is also mandatory, and it specifies the directory where the files
that you want to download are downloaded to.
=head3 user_agent [OPTIONAL]
This option is optional, and it allows you to have HTML::Tiny pretend to be a
Web browser or perhaps bot if you want to.
=head3 html_treebuilder_callback [OPTIONAL]
This optional option allows you to specify a perl C<CODEREF> that lookup() will
execute instead of its default callback that just looks for images.
It receives one parameter, which is an HTML::Element at the first C<a>,
anchor/link tag.
It must C<return 'True';> to indicate that that link should be included in the
list of download links, or return false, C<return undef>, to indicate that that
link should not be included in the list of download links.
=head3 download_links_callback [OPTIONAL]
This optional option specifies an optional callback that will allow you to do
post processing of the list of downloaded urls. This is needed, because the
results of the C<html_treebuilder_callback> are still HTML::Element objects that
need to be converted to just string download urls. That is what the default
C<download_links_callback> does.
It receives a list of all of the download HTML::Elements that
C<html_treebuilder_callback> returned true on. It is called only once, and
should return a list of string download links for download later by HTML::Tiny
in download().
=head3 keep_destination_directory [OPTIONAL]
This optional option is a boolean true or false configuration option that
when true prevents HTMLPageSync from deleting your destination_directory when
you run fetchware uninstall.
Its default is false, so by defualt HTMLPageSync B<will> delete your files from
your C<destination_directory> unless you set this to true.
=head1 ERRORS
As with the rest of App::Fetchware, App::Fetchware::Config does not return any
error codes; instead, all errors are die()'d if it's App::Fetchware::Config's
error, or croak()'d if its the caller's fault. These exceptions are simple
strings, and are listed in the L</DIAGNOSTICS> section below.
=head1 CAVEATS
Certain features of App::FetchwareX::HTMLPageSync require knowledge of the Perl
programming language in order for you to make use of them. However, this is
limited to optional callbacks that are not needed for most uses. These features
are the C<html_treebuilder_callback> and C<download_links_callback> callbacks.
=head1 AUTHOR
David Yingling <deeelwy@gmail.com>
=head1 COPYRIGHT AND LICENSE
This software is copyright (c) 2016 by David Yingling.
This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.
=cut
__END__
( run in 0.773 second using v1.01-cache-2.11-cpan-39bf76dae61 )