App-Fetchware

 view release on metacpan or  search on metacpan

lib/App/FetchwareX/HTMLPageSync.pm  view on Meta::CPAN

        uninstall upgrade)]
;


# Use App::Fetchware::CreateconfigOptions to build our App::Fetchware
# configuration options for us. These are subroutines with correct prototypes to
# turn a perl code file into something that resembles a configuration file.
use App::Fetchware::CreateConfigOptions
    ONE => [qw(
        page_name
        html_page_url
        destination_directory
        user_agent
        html_treebuilder_callback
        download_links_callback
    )],
    BOOLEAN => [qw(keep_destination_directory)]
;


use Exporter 'import';
our %EXPORT_TAGS = (
    TESTING => [qw(
        get_html_page_url
        get_destination_directory
        ask_about_keep_destination_directory
        new
        new_install
    )]
);
our @EXPORT_OK = map {@{$_}} values %EXPORT_TAGS;





sub new {
    my ($term, $page_name) = @_;

    # Instantiate a new Fetchwarefile object for managing and generating a
    # Fetchwarefile, which we'll write to a file for the user or use to
    # build a associated Fetchware package.
    my $now = localtime;
    my $fetchwarefile = App::Fetchware::Fetchwarefile->new(
        header => <<EOF,
use App::FetchwareX::HTMLPageSync;
# Auto generated $now by HTMLPageSync's fetchware new command.
# However, feel free to edit this file if HTMLPageSync's new command's
# autoconfiguration is not enough.
# 
# Please look up HTMLPageSync's documentation of its configuration file syntax at
# perldoc App::FetchwareX::HTMLPageSync, and only if its configuration file
# syntax is not malleable enough for your application should you resort to
# customizing fetchware's behavior. For extra flexible customization see perldoc
# App::Fetchwarex::HTMLPageSync.
EOF
        descriptions => {

            page_name => <<EOA,
page_name simply names the HTML page the Fetchwarefile is responsible for
downloading, analyzing via optional callbacks, and copying to your
destination_directory.
EOA
            html_page_url => <<EOA,
html_page_url is HTMLPageSync's lookup_url equivalent. It specifies a HTTP url
that returns a page of HTML that can be easily parsed of links to later
download.
EOA
            destination_directory => <<EOA,
destination_directory is the directory on your computer where you want the files
that you configure HTMLPageSync to parse to be copied to.
EOA
            user_agent => <<EOA,
user_agent, if specified, will be passed to HTML::Tiny, the Perl HTTP library
Fetchware uses, where the library will lie to the Web server you are Web
scraping from to hopefully prevent the Web sever from banning you, or updating
the page you want to scrap to use too much Javascript, which would prevent the
simple parser HTMLPageSync uses from working on the specified html_page_url.
EOA
            html_treebuilder_callback => <<EOA,
html_treebuilder_callback allows you to specify a perl CODEREF that HTMLPageSync
will execute instead of its default callback that just looks for images.

It receives one parameter, which is an HTML::Element at the first C<a>,
anchor/link tag.

It must [return 'True';] to indicate that that link should be included in the
list of download links, or return false, [return undef], to indicate that that
link should not be included in the list of download links.
EOA
            download_links_callback => <<EOA,
download_links_callback specifies an optional callback that will allow you to do
post processing of the list of downloaded urls. This is needed, because the
results of the html_treebuilder_callback are still HTML::Element objects that
need to be converted to just string download urls. That is what the default
C<download_links_callback> does.

It receives a list of all of the download HTML::Elements that
C<html_treebuilder_callback> returned true on. It is called only once, and
should return a list of string download links for download later by
HTMLPageSync.
EOA
            keep_destination_directory => <<EOA,
keep_destination_directory is a boolean true or false configuration option that
when true prevents HTMLPageSync from deleting your destination_directory when
you run fetchware uninstall.
EOA
        }
    );

    extension_name(__PACKAGE__);

    opening_message(<<EOM);
HTMLPageSync's new command is not as sophistocated as Fetchware's. Unless you
only want to download images, you will have to get your hands dirty, and code up
some custom Perl callbacks to customize HTMLPageSync's behavior. However, it
will ask you quite nicely the basic options, so if those are all you need, then
this command will successfully generate a HTMLPageSync Fetchwarefile for you.

After it lets you choose the easy options of page_name, html_page_url,
and destination_directory, it will give you an opportunity to modify the
user_agent string HTMLPageSync uses to avoid betting banned or having your
scraping stick out like a sore thumb in the target Web server's logs. Then,
you'll be asked about the advanced options. If you want them it will add generic
ones to the Fetchwarefile that you can then fill in later on when HTMLPageSync
asks you if you want to edit the generated Fetchwarefile manually.  Finally,
after your Fetchwarefile is generated HTMLPageSync will ask you if you would
like to install your generated Fetchwarefile to test it out.
EOM

    # Ask the user for the basic configuration options.
    $page_name = fetchwarefile_name(page_name => $page_name);
    vmsg "Determined your page_name option to be [$page_name]";

    $fetchwarefile->config_options(page_name => $page_name);
    vmsg "Appended page_name [$page_name] configuration option to Fetchwarefile";

    my $html_page_url = get_html_page_url($term);
    vmsg "Asked user for html_page_url [$html_page_url] from user.";

    $fetchwarefile->config_options(html_page_url => $html_page_url);
    vmsg "Appended html_page_url [$html_page_url] configuration option to Fetchwarefile";

    my $destination_directory = get_destination_directory($term);
    vmsg "Asked user for destination_directory [$destination_directory] from user.";

    $fetchwarefile->config_options(destination_directory => $destination_directory);
    vmsg <<EOM;
Appended destination_directory [$destination_directory] configuration option to
your Fetchwarefile";
EOM

    # Asks and sets the keep_destination_directory configuratio option if the
    # user wants to set it.
    ask_about_keep_destination_directory($term, $fetchwarefile);

    vmsg 'Prompting for other options that may be needed.';
    my $other_options_hashref = prompt_for_other_options($term,
        user_agent => {
            prompt => <<EOP,
What user_agent configuration option would you like? 
EOP
            print_me => <<EOP
user_agent, if specified, will be passed to HTML::Tiny, the Perl HTTP library
Fetchware uses, where the library will lie to the Web server you are Web
scraping from to hopefully prevent the Web sever from banning you, or updating
the page you want to scrap to use too much Javascript, which would prevent the
simple parser HTMLPageSync uses from working on the specified html_page_url.
EOP
        },
        html_treebuilder_callback => {
            prompt => <<EOP,
What html_treebuilder_callback configuration option would you like? 
EOP
            print_me => <<EOP,
html_treebuilder_callback allows you to specify a perl CODEREF that HTMLPageSync

lib/App/FetchwareX/HTMLPageSync.pm  view on Meta::CPAN

sub upgrade {
    my $download_path = shift; # $fetchware_package_path is not used in HTMLPageSync.

    # Get the listing of already downloaded file names.
    my @installed_downloads = glob(config('destination_directory'));

    # Preprocess both @$download_path and @installed_downloads to ensure that
    # URL crap or differing full paths won't screw up the "comparisons". The
    # clever delete hashslice does the "comparisons" if you will.
    my @download_path_filenames = map { ( uri_split($_) )[2] } @$download_path;
    my @installed_downloads_filenames = map { ( splitpath($_) ) [2] }
        @installed_downloads;

    # Determine what files are in @$download_path, but not in
    # @installed_downloads.
    # Algo based on code from Perl Cookbook pg. 126.
    my %seen;
    @seen{@$download_path} = ();
    delete @seen{@installed_downloads};

    my @new_urls_to_download = keys %seen;

    if (@new_urls_to_download > 0) {
        # Alter $download_path to only list @new_urls_to_download. That way
        # download() only downloads the new URLs not the already downloaded ones
        # again.
        $download_path = [@new_urls_to_download];

        return 'New URLs Found.';
    } else {
        return;
    }
}


1;

=pod

=head1 NAME

App::FetchwareX::HTMLPageSync - An App::Fetchware extension that downloads files based on an HTML page.

=head1 VERSION

version 1.016

=head1 SYNOPSIS

=head2 Example App::FetchwareX::HTMLPageSync Fetchwarefile.

    page_name 'Cool Wallpapers';

    html_page_url 'http://some-html-page-with-cool.urls';

    destination_directory 'wallpapers';

    # pretend to be firefox
    user_agent 'Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1';

    # Customize the callbacks.
    html_treebuilder_callback sub {
        # Get one HTML::Element.
        my $h = shift;

        # Return true or false to indicate if this HTML::Element shoudd be a
        # download link.
        if (something) {
            return 'True';
        } else {
            return undef;
        }
    };

    download_links_callback sub {
        my @download_urls = @_;

        my @wanted_download_urls;
        for my $link (@download_urls) {
            # Pick ones to keep.
            puse @wanted_download_urls, $link;
        }

        return @wanted_download_urls;
    };

=head2 App::FetchwareX::HTMLPageSync App::Fetchware-like API.

    my $temp_file = start();

    my $download_url = lookup();

    download($temp_dir, $download_url);

    verify($download_url, $package_path);

    unarchive($package_path);

    build($build_path);

    install();

    uninstall($build_path);

=head1 MOTIVATION

I want to automatically parse a Web page with links to wall papers that I want
to download. Only I want software to do it for me. That's where this
App::Fetchware extension comes in.

=head1 DESCRIPTION

App::FetchwareX::HTMLPageSync is an example App::Fetchware extension. It's not
a large extension, but instead is a simple one meant to show how easy it is
extend App::Fetchware.

App::FetchwareX::HTMLPageSync parses the Web page you specify to create a list of
download links. Then it downloads those links, and installs them to your
C<destination_directory>.

In order to use App::FetchwareX::HTMLPageSync to help you mirror the download

lib/App/FetchwareX/HTMLPageSync.pm  view on Meta::CPAN


=item B<5. Specify an optional user_agent>

Many sites don't like bots downloading stuff from them wasting their bandwidth,
and will even limit what you can do based on your user agent, which is the HTTP
standard's name for your browser. This option allows you to pretend to be
something other than HTMLPageSync's underlying library, L<HTTP::Tiny>. Just copy
and past the example below, and paste what you want you user agent to be between
the single quotes C<'> as before.

    user_agent '';

And after pasting.

    user_agent 'Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1';

=item B<6. Specify an optonal html_treebuilder_callback>

C<html_treebuilder_callback> specifies an optional anonymous Perl subroutine
reference that will replace the default one that HTMLPageSync uses. The default
one limits the download to only image format links, which is flexible enough for
downloading wallpapers.

If you want to download something different, then paste the example below in
your Fetchwarefile.

    html_treebuilder_callback sub {
        # Get one HTML::Element.
        my $h = shift;

        # Return true or false to indicate if this HTML::Element shoudd be a
        # download link.
        if (something) {
            return 'True';
        } else {
            return undef;
        }
    };

And create a Perl anonymous subroutine C<CODEREF> that will
be executed instead of the default one. This requires knowledge of the Perl
programming language. The one below limits itself to only pdfs and MS word
documents.

    # Download pdfs and word documents only.
    html_treebuilder_callback sub {
        my $tag = shift;
        my $link = $tag->attr('href');
        if (defined $link) {
            # If the anchor tag is an image...
            if ($link =~ /\.(pdf|doc|docx)$/) {
                # ...return true...
                return 'True';
            } else {
                # ...if not return false.
                return undef; #false
            }
        }
    };

=item B<7. Specify an optional download_links_callbacks>

C<download_links_callback> specifies an optional anonymous Perl subroutine
reference that will replace the default one that HTMLPageSync uses. The default
one removes the HTML::Element skin each download link is wrapped in, because of
the use of L<HTML::TreeBuilder>. This simply strips off the object-oriented crap
its wrapped in, and turns it into a simply string scalar.

If you want to post process the download link in some other way, then just copy
and paste the code below into your Fetchwarefile, and add whatever other Perl
code you may need. This requires knowledge of the Perl programming language.

    download_links_callback sub {
        my @download_urls = @_;

        my @wanted_download_urls;
        for my $link (@download_urls) {
            # Pick ones to keep.
            puse @wanted_download_urls, $link;
        }

        return @wanted_download_urls;
    };

=back

=head1 USING YOUR App::FetchwareX::HTMLPageSync FETCHWAREFILE WITH FETCHWARE

After you have
L<created your Fetchwarefile|/"CREATING A App::FetchwareX::HTMLPageSync FETCHWAREFILE">
as shown above you need to actually use the fetchware command line program to
install, upgrade, and uninstall your App::FetchwareX::HTMLPageSync Fetchwarefile.

Take note how fetchware's package management metaphor does not quite line up
with what App::FetchwareX::HTMLPageSync does. Why would a HTML page mirroring
script be installed, upgraded, or uninstalled? Well HTMLPageSync simply adapts
fetchware's package management metaphor to its own enviroment performing the
likely action for when one of fetchware's behaviors are executed.

=over

=item B<new>

A C<fetchware new> will cause HTMLPageSync to ask the user a bunch of questons,
and help them create a new HTMLPageSync Fetchwarefile.

=item B<install>

A C<fetchware install> while using a HTMLPageSync Fetchwarefile causes fetchware
to download your C<html_page_url>, parse it, download any matching links, and
then copy them to your C<destination_directory> as you specify in your
Fetchwarefile.

=item B<upgrade>

A C<fetchware upgrade> will redownload the C<html_page_url>, parse it, and
compare the corresponding list of files to the list of files already downloaded,
and if any new files have been added, then they will be downloaded. New versions
of existing files is not supported. No timestamp checking is implemented
currently.

=item B<uninstall>

A C<fetchware uninstall> will cause fetchware to delete this fetchware package
from its database as well as recursively deleting everything inside your
C<destination_directory> as well as that directory itself. So when you uninstall
a HTMLPageSync fetchware package ensure that you really want to, because it will
delete whatever files it downloaded for you in the first place.

However, if you would like fetchware to preserve your C<destination_directory>,
you can set the boolean C<keep_destination_directory> configuration option to
true, like C<keep_destination_directory 'True';>, to keep HTMLPageSync from
deleting your destination directory.

=back

=head1 HOW App::FetchwareX::HTMLPageSync OVERRIDES App::Fetchware

This sections documents how App::FetchwareX::HTMLPageSync overrides
App::Fetchware's API, and is only interesting if you're debugging
App::FetchwareX::HTMLPageSync, or you're writing your own App::Fetcwhare
extension. If not, you don't need to know these details.

=head2 App::Fetchware API Subroutines

=head3 new()

HTMLPageSync overrides new(), and implements its own Q&A wizard interface
helping users create HTMLPageSync Fetchwarefiles.

=head3 new_install()

HTMLPageSync just inherits App::Fetchware's new_install(), which just asks the
user if they would like Fetchware to instell the already generated
Fetchwarefile.

=head3 check_syntax()

check_syntax() is also overridden to check HTMLPageSync's own Fetchware-level
syntax.

=head3 start() and end()

HTMLPageSync just imports start() and end() from App::Fetchware to take
advantage of their ability to manage a temporary directory.

=head3 lookup()

lookup() is overridden, and downloads the C<html_page_url>, which is the main
configuration option that HTMLPageSync uses. Then lookup() parses that
C<html_page_url>, and determines what the download urls should be. If the
C<html_trebuilder_callback> and C<download_links_callbacks> exist, then they are
called to customize lookup()'s default bahavior. See their descriptions below.

=head3 download()

download() downloads the array ref of download links that lookup() returns.

=head3 verify()

verify() is overridden to do nothing.

=head3 unarchive()

verify() is overridden to do nothing.

=head3 build()

build() is overridden to do nothing.

=head3 install()

install() takes its argument, which is an arrayref of of the paths of the
files that were downloaded to the tempdir created by start(), and copies them to
the user's provided C<destination_directory>.

=head3 end() and start()

HTMLPageSync just imports end() and start() from App::Fetchware to take
advantage of their ability to manage a temporary directory.

=head3 uninstall()

uninstall() recursively deletes your C<destination_directory> where it stores
whatever links you choose to download unless of course the
C<keep_destination_directory> configuration option is set to true.

=head3 upgrade()

Determines if any looked up URLs have not been downloaded yet, and returns true
if that is the case.

=head2 App::FetchwareX::HTMLPageSync's Configuration Subroutines

Because HTMLPageSync is a App::Fetchware extension, it can not just use the same
configuration subroutines that App::Fetchware uses. Instead, it must create its
own configuration subroutines with App::Fetchware::CreateConfigOptions. These
configuration subroutines are the configuration options that you use in your
App::Fetchware or App::Fetchware extension.

=head3 page_name [MANDATORY]

HTMLPageSync's equivelent to App::Fetchware's C<program_name>. It's simply the
name of the page or what you want to download on that page.

=head3 html_page_url [MANDATORY]

HTMLPageSync's equivelent to App::Fetchware's C<lookup_url>, and is just as
mandatory. This is the url of the HTML page that will be downloaded and
processed.

=head3 destination_directory [MANDATORY]

This option is also mandatory, and it specifies the directory where the files
that you want to download are downloaded to.

=head3 user_agent [OPTIONAL]

This option is optional, and it allows you to have HTML::Tiny pretend to be a
Web browser or perhaps bot if you want to.

=head3 html_treebuilder_callback [OPTIONAL]

This optional option allows you to specify a perl C<CODEREF> that lookup() will
execute instead of its default callback that just looks for images.

It receives one parameter, which is an HTML::Element at the first C<a>,
anchor/link tag.

It must C<return 'True';> to indicate that that link should be included in the
list of download links, or return false, C<return undef>, to indicate that that
link should not be included in the list of download links.

=head3 download_links_callback [OPTIONAL]

This optional option specifies an optional callback that will allow you to do
post processing of the list of downloaded urls. This is needed, because the
results of the C<html_treebuilder_callback> are still HTML::Element objects that
need to be converted to just string download urls. That is what the default
C<download_links_callback> does.

It receives a list of all of the download HTML::Elements that
C<html_treebuilder_callback> returned true on. It is called only once, and
should return a list of string download links for download later by HTML::Tiny
in download().

=head3 keep_destination_directory [OPTIONAL]

This optional option is a boolean true or false configuration option that
when true prevents HTMLPageSync from deleting your destination_directory when
you run fetchware uninstall.

Its default is false, so by defualt HTMLPageSync B<will> delete your files from
your C<destination_directory> unless you set this to true.

=head1 ERRORS

As with the rest of App::Fetchware, App::Fetchware::Config does not return any
error codes; instead, all errors are die()'d if it's App::Fetchware::Config's
error, or croak()'d if its the caller's fault. These exceptions are simple
strings, and are listed in the L</DIAGNOSTICS> section below.

=head1 CAVEATS

Certain features of App::FetchwareX::HTMLPageSync require knowledge of the Perl
programming language in order for you to make use of them. However, this is
limited to optional callbacks that are not needed for most uses. These features
are the C<html_treebuilder_callback> and C<download_links_callback> callbacks.

=head1 AUTHOR

David Yingling <deeelwy@gmail.com>

=head1 COPYRIGHT AND LICENSE

This software is copyright (c) 2016 by David Yingling.

This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.

=cut

__END__














###BUGALERT### Actually implement croak or more likely confess() support!!!





( run in 3.815 seconds using v1.01-cache-2.11-cpan-cdf2f3d4e48 )