DateTime results from the CPAN

App-jupiter

C<message> is the HTTP status code; this code could be the real status code from
the server (such as 404 for a "not found" status) or one generated by Jupiter
such that it matches the status message (such as 206 for a "partial content"
status when there aren't any recent entries in the feed). This is set when
updating the feeds in your cache.

C<title> is the site's title. When you update the feeds in your cache, it is
taken from the OPML file. That's how the feed can have a title even if the
download failed. When you generate the HTML, the feeds in the cache are parsed
and if a title is provided, it is stored in the JSON file and overrides the
title in the OPML file.

C<link> is the site's link for humans. When you generate the HTML, the feeds in
the cache are parsed and if a link is provided, it is stored in the JSON file.
If the OPML element contained a C<htmlURL> attribute, however, that takes
precedence. The reasoning is that when a podcast is hosted on a platform which
generates a link that you don't like and you know the link to the human-readable
blog elsewhere, use the C<htmlURL> attribute in the OPML file to override this.

C<last_modified> and C<etag> are two headers used for caching from the HTTP
response that cannot be changed by data in the feed.

If we run into problems downloading a feed, this setup allows us to still link
to the feeds that aren't working, using their correct names, and describing the
error we encountered.

=head2 Logging

Use the C<--log=LEVEL> to set the log level. Valid values for LEVEL are debug,
info, warn, error, and fatal.

=head1 LICENSE

GNU Affero General Public License

=head1 INSTALLATION

Using C<cpan>:

    cpan App::jupiter

Manual install:

    perl Makefile.PL
    make
    make install

=head2 Dependencies

To run Jupiter on Debian we need:

C<libmodern-perl-perl> for L<Modern::Perl>

C<libmojolicious-perl> for L<Mojo::Template>, L<Mojo::UserAgent>, L<Mojo::Log>,
L<Mojo::JSON>, and L<Mojo::Util>

C<libxml-libxml-perl> for L<XML::LibXML>

C<libfile-slurper-perl> for L<File::Slurper>

C<libdatetime-perl> for L<DateTime>

C<libdatetime-format-mail-perl> for L<DateTime::Format::Mail>

C<libdatetime-format-iso8601-perl> for L<DateTime::Format::ISO8601>

Unfortunately, L<Mojo::UserAgent::Role::Queued> isn't packaged for Debian.
Therefore, let's build it and install it as a Debian package.

    sudo apt-get install libmodule-build-tiny-perl
    sudo apt-get install dh-make-perl
    sudo dh-make-perl --build --cpan Mojo::UserAgent::Role::Queued
    dpkg --install libmojo-useragent-role-queued-perl_1.15-1_all.deb

To generate the C<README.md> from the source file, you need F<pod2markdown>
which you get in C<libpod-markdown-perl>.

=head1 FILES

There are a number of files in the F<share> directory which you can use as
starting points.

F<template.html> is the HTML template.

F<default.css> is a small CSS file used by F<template.html>.

F<personalize.js> is a small Javascript file used by F<template.html> used to
allow visitors to jump from one article to the next using C<J> and C<K>.

F<jupiter.png> is used by F<template.html> as the icon.

F<jupiter.svg> is used by F<template.html> as the logo.

F<feed.png> is used by F<template.html> as the icon for the feeds in the
sidebar.

F<feed.rss> is the feed template.

=head1 OPTIONS

HTML generation uses a template, C<template.html>. It is written for
C<Mojo::Template> and you can find it in the F<share> directory of your
distribution. The default templates use other files, such as the logo, the feed
icon, a CSS file, and a small Javascript snippet to enable navigation using the
C<J> and C<K> keys (see above).

You can specify a different HTML file to generate:

B<jupiter html> I<your.html feed.opml>

If you specify two HTML files, the first is the HTML file to generate and the
second is the template to use. Both must use the C<.html> extension.

B<jupiter html> I<your.html your-template.html feed.opml>

Feed generation uses a template, C<feed.rss>. It writes all the entries into a
file called C<feed.xml>. Again, the template is written for C<Mojo::Template>.

You can specify up to two XML, RSS or ATOM files. They must uses one of these
three extensions: C<.xml>, C<.rss>, or C<.atom>. The first is the name of the
feed to generate, the second is the template to use:

B<jupiter html> I<atom.xml template.xml planet.html template.html feed.opml>

In the above case, Planet Jupiter will write a feed called F<atom.xml> based on
F<template.xml> and a HTML file called F<planet.html> based on F<template.html>,
using the cached entries matching the feeds in F<feed.opml>.

=cut

use DateTime;
use DateTime::Format::Mail;
use DateTime::Format::ISO8601;
use File::Basename;
use File::Slurper qw(read_binary write_binary read_text write_text);
use List::Util qw(uniq min shuffle);
use Modern::Perl;
use Mojo::Log;
use Mojo::JSON qw(decode_json encode_json);
use Mojo::Template;
use Mojo::UserAgent;
use Pod::Simple::Text;
use XML::LibXML;
use Mojo::Util qw(slugify trim xml_escape html_unescape);
use File::ShareDir 'dist_file';

use vars qw($log);
our $log = Mojo::Log->new;

my $xpc = XML::LibXML::XPathContext->new;
$xpc->registerNs('atom', 'http://www.w3.org/2005/Atom');
$xpc->registerNs('html', 'http://www.w3.org/1999/xhtml');
$xpc->registerNs('dc', 'http://purl.org/dc/elements/1.1/');
$xpc->registerNs('itunes', 'http://www.itunes.com/dtds/podcast-1.0.dtd');

my $undefined_date = DateTime->from_epoch( epoch => 0 );

my (%wday, %month, $wday_re, $month_re);
%wday = qw (lun. Mon mar. Tue mer. Wed jeu. Thu ven. Fri sam. Sat dim. Sun);
%month = qw (janv. Jan fÃ©vr. Feb mars Mar avr. Apr mai May juin Jun
	     juil. Jul aoÃ»t Aug sept. Sep oct. Oct nov. Nov dÃ©c. Dec);
$wday_re = join('|', map { quotemeta } keys %wday) unless $wday_re;
$month_re = join('|', map { quotemeta } keys %month) unless $month_re;

# Our tests don't want to call main
__PACKAGE__->main unless caller;

sub main {
  my ($log_level) = grep /^--log=/, @ARGV;
  $log->level(substr($log_level, 6)) if $log_level;
  my ($command) = grep /^[a-z]+$/, @ARGV;
  $command ||= 'help';
  if ($command eq 'update') {
    update_cache(@ARGV);
  } elsif ($command eq 'html') {
    make_html(@ARGV);
  } else {
    my $parser = Pod::Simple::Text->new();
    $parser->parse_file($0);
  }
}

sub update_cache {
  my ($feeds, $files) = read_opml(@_);
  make_directories($feeds);
  load_feed_metadata($feeds, $files);
  my $ua = Mojo::UserAgent->new->with_roles('+Queued')
      ->max_redirects(3)
      ->max_active(5);
  make_promises($ua, $feeds);
  fetch_feeds($feeds);
  save_feed_metadata($feeds, $files);
  cleanup_cache($feeds);
}

sub make_promises {
  my $ua = shift;
  my $feeds = shift;
  for my $feed (@$feeds) {
    my $url = html_unescape $feed->{url}; # undo xml_escape for the request
    $ua->on(start => sub {
      my ($ua, $tx) = @_;
      $tx->req->headers->if_none_match($feed->{etag}) if ($feed->{etag});
      $tx->req->headers->if_modified_since($feed->{last_modified}) if ($feed->{last_modified});
    });
    $feed->{promise} = $ua->get_p($url)
	->catch(sub {
	  $feed->{message} = "@_";
	  $feed->{code} = 521;
	  # returning 0 in the case of an error is important
	  0; })
	# sleeping to stop blogger.com from blocking us
	->finally(sub { $log->debug($url); sleep 2; });
  }
}

script/jupiter view on Meta::CPAN

  add_data($feeds, $entries);   # extract data from the xml
  save_feed_metadata($feeds, $files); # save title and link for feeds
  $entries = limit($entries, 100);
  write_text(html_file(@_), apply_template(read_text(html_template_file(@_)), $globals, $feeds, $entries));
  write_text(feed_file(@_), apply_template(read_text(feed_template_file(@_)), $globals, $feeds, $entries));
}

sub html_file {
  my ($html) = grep /\.html$/, @_;
  return $html || 'index.html';
}

sub html_template_file {
  my ($html, $template) = grep /\.html$/, @_;
  $template ||= dist_file('App-jupiter', 'template.html');
  die "HTML template $template not found\n" unless -r $template;
  return $template;
}

sub feed_file {
  my ($feed) = grep /\.(xml|rss|atom)$/, @_;
  return $feed if $feed;
  return 'feed.xml';
}

sub feed_template_file {
  my ($feed, $template) = grep /\.(xml|rss|atom)$/, @_;
  return $template if $template;
  return dist_file('App-jupiter', 'feed.rss');
}

sub apply_template {
  my $mnt = Mojo::Template->new;
  return $mnt->render(@_);
}

=head1 TEMPLATES

The page template is called with three hash references: C<globals>, C<feeds>,
and C<entries>. The keys of these three hash references are documented below.
The values of these hashes are all I<escaped HTML> except where noted (dates and
file names, for example).

The technical details of how to write the templates are documented in the man
page for L<Mojo::Template>.

=head2 Globals

There are not many global keys.

B<date> is the the publication date of the HTML page, in ISO date format:
YYYY-MM-DD.

B<files> is the list of OPML files used.

=cut

sub globals {
  my $files = shift;
  my @time = gmtime;
  my $today = DateTime->now->ymd;
  return {date => $today, files => $files};
}

=head2 Writing templates for feeds

Feeds have the following keys available:

B<title> is the title of the feed.

B<url> is the URL of the feed (RSS or Atom). This is not the link to the site!

B<link> is the URL of the web page (HTML). This is the link to the site.

B<opml_file> is the file name where this feed is listed.

B<cache_dir> is the directory where this feed is cached.

B<message> is the HTTP status message or other warning or error that we got
while fetching the feed.

B<code> is the HTTP status code we got while fetching the feed.

B<doc> is the L<XML::LibXML::Document>. Could be either Atom or RSS!

=cut

# Creates list of feeds. Each feed is a hash with keys title, url, opml_file,
# cache_dir and cache_file.
sub read_opml {
  my (@feeds, @files);
  my @filters = map { decode(locale => substr($_, 1, -1)) } grep /^\/.*\/$/, @_;
  for my $file (grep /\.opml$/, @_) {
    my $doc = XML::LibXML->load_xml(location => $file); # this better have no errors!
    my @nodes = $doc->findnodes('//outline[./@xmlUrl]');
    my ($name, $path) = fileparse($file, '.opml', '.xml');
    for my $node (@nodes) {
      my $title = xml_escape $node->getAttribute('title');
      my $url = xml_escape $node->getAttribute('xmlUrl');
      next if @filters > 0 and not grep { $url =~ /$_/ or $title =~ /$_/ } @filters;
      my $link = xml_escape $node->getAttribute('htmlUrl');
      push @feeds, {
        title => $title,    # title in the OPML file
        url => $url,        # feed URL in the OPML file
        link => $link,      # web URL in the OPML file
        opml_file => $file,
        cache_dir => "$path/$name",
        cache_file => "$path/$name/" . slugify($url),
      };
    }
    warn "No feeds found in the OPML file $file\n" unless @nodes;
    push @files, { file => $file, path => $path, name => $name };
  }
  @feeds = shuffle @feeds;
  return \@feeds, \@files;
}

sub entries {
  my $feeds = shift;
  my $limit = shift;
  my $date = DateTime->now(time_zone => 'UTC')->subtract( days => 90 ); # compute once
  my $now =  DateTime->now(time_zone => 'UTC');
  my @entries;
  for my $feed (@$feeds) {
    next unless -r $feed->{cache_file};
    my $doc = eval { XML::LibXML->load_xml(recover => 2, location => $feed->{cache_file} )};
    if (not $doc) {
      $feed->{message} = xml_escape "Parsing error: $@";
      $feed->{code} = 422; # unprocessable
      next;
    }
    $feed->{doc} = $doc;
    my @nodes = $xpc->findnodes("/rss/channel/item | /atom:feed/atom:entry", $doc);
    if (not @nodes) {
      $feed->{message} = "Empty feed";
      $feed->{code} = 204; # no content
      next;
    }
    # if this is an Atom feed, we need to sort the entries ourselves (older entries at the end)
    my @candidates = map {
      my $entry = {};
      $entry->{element} = $_;
      $entry->{id} = id($_);
      $entry->{date} = updated($_) || $undefined_date;
      $entry;
    } @nodes;
    @candidates = grep { DateTime->compare($_->{date}, $now) <= 0 } @candidates;
    @candidates = unique(sort { DateTime->compare( $b->{date}, $a->{date} ) } @candidates);
    @candidates = @candidates[0 .. min($#candidates, $limit - 1)];
    # now that we have limited the candidates, let's add more metadata from the feed
    for my $entry (@candidates) {
      $entry->{feed} = $feed;
      # these two are already escaped
      $entry->{blog_title} = $feed->{title};
      $entry->{blog_url} = $feed->{url};
    }
    add_age_warning($feed, \@candidates, $date);
    push @entries, @candidates;
  }
  return \@entries;
}

sub add_age_warning {
  my $feed = shift;
  my $entries = shift;
  my $date = shift;
  # feed modification date is smaller than the date given
  my ($node) = $xpc->findnodes("/rss/channel | /atom:feed", $feed->{doc});
  my $feed_date = updated($node);
  if ($feed_date and DateTime->compare($feed_date, $date) == -1) {
    $feed->{message} = "No feed updates in 90 days";
    $feed->{code} = 206; # partial content
    return;
  } else {
    # or no entry found with a modification date equal or bigger than the date given
    for my $entry (@$entries) {
      return if DateTime->compare($entry->{date}, $date) >= 0;
    }
    $feed->{message} = "No entry newer than 90 days";
    $feed->{code} = 206; # partial content
  }
}

sub updated {
  my $node = shift;
  return unless $node;
  my @nodes = $xpc->findnodes('pubDate | atom:published | atom:updated', $node) or return;
  my $date = $nodes[0]->textContent;
  my $dt = eval { DateTime::Format::Mail->parse_datetime($date) }
  || eval { DateTime::Format::ISO8601->parse_datetime($date) }
  || eval { DateTime::Format::Mail->parse_datetime(french($date)) };
  return $dt;
}

sub french {
  my $date = shift;
  $date =~ s/^($wday_re)/$wday{$1}/;
  $date =~ s/\b($month_re)/$month{$1}/;
  return $date;
}

sub id {
  my $node = shift;
  return unless $node;
  my $id = $xpc->findvalue('guid | atom:id', $node); # id is mandatory for Atom
  $id ||= $node->findvalue('link'); # one of the following three is mandatory for RSS
  $id ||= $node->findvalue('title');
  $id ||= $node->findvalue('description');
  return $id;
}

sub unique {
  my %seen;
  my @unique;
  for my $node (@_) {
    next if $seen{$node->{id}};
    $seen{$node->{id}} = 1;
    push(@unique, $node);
  }
  return @unique;
}

sub limit {
  my $entries = shift;
  my $limit = shift;
  # we want the most recent entries overall
  @$entries = sort { DateTime->compare( $b->{date}, $a->{date} ) } unique(@$entries);
  return [@$entries[0 .. min($#$entries, $limit - 1)]];
}

=head2 Writing templates for entries

Entries have the following keys available:

B<title> is the title of the post.

B<link> is the URL to the post on the web (probably a HTML page).

B<blog_title> is the title of the site.

B<blog_link> is the URL for the site on the web (probably a HTML page).

B<blog_url> is the URL for the site's feed (RSS or Atom).

B<authors> are the authors (or the Dublin Core contributor), a list of strings.

B<date> is the publication date, as a DateTime object.

B<day> is the publication date, in ISO date format: YYYY-MM-DD, for the UTC
timezone. The UTC timezone is picked so that the day doesn't jump back and forth
when sorting entries by date.

B<content> is the full post content, as string or encoded HTML.

B<excerpt> is the post content, limited to 500 characters, with paragraph
separators instead of HTML elements, as HTML. It is not encoded because the idea
is that it only gets added to the HTML and not to the feed, and the HTML it
contains is very controlled (only the pilcrow sign inside a span to indicate
paragraph breaks).

B<categories> are the categories, a list of strings.

B<element> is for internal use only. It contains the L<XML::LibXML::Element>
object. This could be RSS or Atom!

B<feed> is for internal use only. It's a reference to the feed this entry
belongs to.

=cut

sub add_data {
  my $feeds = shift;
  my $entries = shift;
  # A note on the use of xml_escape: whenever we get data from the feed itself,
  # it needs to be escaped if it gets printed into the HTML. For example: the
  # feed contains a feed title of "Foo &amp; Bar". findvalue returns "Foo &
  # Bar". When the template inserts the title, however, we want "Foo &amp; Bar",
  # not "Foo & Bar". Thus: any text we get from the feed needs to be escaped
  # if there's a chance we're going to print it again.
  for my $feed (@$feeds) {
    next unless $feed->{doc};
    # title and url in the feed overrides title and xmlUrl set in the OPML (XML already escaped)
    $feed->{title} = xml_escape($xpc->findvalue('/rss/channel/title | /atom:feed/atom:title', $feed->{doc})) || $feed->{title} || "";
    $feed->{url} = xml_escape($xpc->findvalue('/atom:feed/atom:link[@rel="self"]/@href', $feed->{doc})) || $feed->{url} || "";
    # link in the feed does not override htmlUrl in the OPML (XML already escaped)
    $feed->{link} = $feed->{link} || xml_escape($xpc->findvalue('/rss/channel/link | /atom:feed/atom:link[@rel="alternate"][@type="text/html"]/@href', $feed->{doc})) || "";
    # if they just pasted a domain "foo" then "//foo" is a valid URL
    $feed->{link} = "//" . $feed->{link} unless $feed->{link} =~ /\/\//;
  }
  for my $entry (@$entries) {
    # copy from the feed (XML is already escaped)
    $entry->{blog_link} = $entry->{feed}->{link};
    $entry->{blog_title} = $entry->{feed}->{title};
    $entry->{blog_url} = $entry->{feed}->{url};
    # parse the elements
    my $element = $entry->{element};
    $entry->{title} = xml_escape $xpc->findvalue('title | atom:title', $element) || "Untitled";
    my @links = map { xml_escape $_->to_literal } map { $xpc->findnodes($_, $element) }
      # sorted by preferences!
      qw(link atom:link[@rel="alternate"][@type="text/html"]/@href atom:link/@href);
    $entry->{link} = shift(@links) || "";
    my @authors = map { without_email(xml_escape strip_html($_->to_literal)) } $xpc->findnodes(
      'author | atom:author/atom:name | atom:contributor/atom:name | dc:creator | dc:contributor', $element);
    @authors = map { without_email(xml_escape strip_html($_->to_literal)) } $xpc->findnodes(
      '/atom:feed/atom:author/atom:name | '
      . '/atom:feed/atom:contributor/atom:name | '
      . '/rss/channel/dc:creator | '
      . '/rss/channel/dc:contributor | '
      . '/rss/channel/webMaster ', $element) unless @authors;
    $entry->{authors} = @authors ? \@authors : undef; # key must exist in the hash
    if (DateTime->compare($entry->{date}, $undefined_date) == 0) {
      $entry->{day} =  "(no date found)";
    } else {
      $entry->{day} = $entry->{date}->clone->set_time_zone('UTC')->ymd; # operate on a clone
    }
    my @categories = map { xml_escape strip_html($_->to_literal) } $xpc->findnodes('category | atom:category/@term', $element);
    $entry->{categories} = @categories ? \@categories : undef; # key must exist in the hash
    $entry->{excerpt} = '';
    $entry->{content} = '';
    my @nodes = $xpc->findnodes('description[text()!=""] | atom:content[text()!=""]', $element);
    @nodes = $xpc->findnodes('summary[text()!=""] | atom:summary[text()!=""] | itunes:summary[text()!=""]', $element) unless @nodes;
    my $content = shift(@nodes);
    # The default is that the content is either plain text or escaped HTML,
    # which is what we want for RSS. For Atom feeds, type="xhtml" means that the
    # content is XHTML, so it needs to be escaped.
    if ($content) {
      my $is_xhtml = $content->hasAttribute("type") && $content->getAttribute("type") eq "xhtml";
      $entry->{excerpt} = excerpt($content->to_literal);
      for my $child ($content->childNodes()) {
	my $c = $child->toString();
	$c = xml_escape $c if $is_xhtml;
	$entry->{content} .= $c;
      }
    }
  }
}

sub excerpt {
  my $content = shift;
  return '(no excerpt)' unless $content;
  my $doc = eval { XML::LibXML->load_html(recover => 2, string => $content) };
  my $separator = "Â¶";
  for my $node ($doc->findnodes('//style | //script | //time | //*[contains(@class, "post-header")] | //*[contains(@class, "post-share-buttons")] | //*[contains(@class, "post-footer")] | //*[contains(@class, "comments")]'),
      ){
    $node->parentNode->removeChild($node);
  }
  for my $node ($doc->findnodes('//p | //br | //blockquote | //li | //td | //th | //div | //h1 | //h2 | //h3 | //h4 | //h5 | //h6')) {
    $node->appendTextNode($separator);
  }
  my $text = strip_html($doc->textContent());
  $text =~ s/(Â +|----+)/ /g;
  # collapse whitespace and trim
  $text =~ s/\s+/ /g;
  $text = trim $text;
  # replace paragraph repeats with their surrounding spaces
  $text =~ s/ ?Â¶( ?Â¶)* ?/Â¶/g;
  $text =~ s/^Â¶//;
  $text =~ s/Â¶$//;
  my $len = length($text);
  $text = substr($text, 0, 500);
  $text .= "â€¦" if $len > 500;
  $text = xml_escape $text;
  $text =~ s/Â¶/<span class="paragraph">Â¶ <\/span>/g;
  return $text;
}

# When there's a value that's supposed to be text but isn't, then we can try to
# turn it to HTML and from there to text... This is an ugly hack and I wish it
# wasn't necessary.
sub strip_html {
  my $str = shift;

( run in 1.598 second using v1.01-cache-2.11-cpan-39bf76dae61 )