updated results from the CPAN

App-jupiter


=head1 DESCRIPTION

Planet Jupiter is used to pull together the latest updates from a bunch of other
sites and display them on a single web page, the "river of news". The sites we
get our updates from are defined in an OPML file.

A river of news, according to Dave Winer, is a feed aggregator. New items appear
at the top and old items disappear at the bottom. When it's gone, it's gone.
There is no count of unread items. The goal is to fight the I<fear of missing
out> (FOMO).

Each item looks similar to every other: headline, link, an extract, maybe a date
and an author. Extracts contain but the beginning of the article's text; all
markup is removed; no images. The goal is to make the page easy to skim.
Scroll down until you find something interesting and follow the link to the
original article if you want to read it.

=head2 The OPML file

You B<need> an OPML file. It's an XML file linking to I<feeds>. Here's an
example listing just one feed. In order to add more, add more C<outline>
elements with the C<xmlUrl> attribute. The exact order and nesting does not
matter. People can I<import> these OPML files into their own feed readers and
thus it may make sense to spend a bit more effort in making it presentable.

    <opml version="2.0">
      <body>
        <outline title="Alex Schroeder"
                 xmlUrl="https://alexschroeder.ch/view/index.rss"/>
      </body>
    </opml>

=head2 Update the feeds in your cache

This is how you update the feeds in a file called C<feed.opml>. It downloads all
the feeds linked to in the OPML file and stores them in the cache directory.

    jupiter update feed.opml

The directory used to keep a copy of all the feeds in the OPML file has the same
name as the OPML file but without the .opml extension. In other words, if your
OPML file is called C<feed.opml> then the cache directory is called C<feed>.

This operation takes long because it requests an update from all the sites
listed in your OPML file. Don't run it too often or you'll annoy the site
owners.

The OPML file must use the .opml extension. You can update the feeds for
multiple OPML files in one go.

=head2 Adding just one feed

After a while, the list of feeds in your OPML starts getting unwieldy. When you
add a new feed, you might not want to fetch all of them. In this case, provide a
regular expression surrounded by slashes to the C<update> command:

    jupiter update feed.opml /example/

Assuming a feed with a URL or title that matches the regular expression is
listed in your OPML file, only that feed is going to get updated.

There is no need to escape slashes in the regular expression: C<//rss/> works
just fine. Beware shell escaping, however. Most likely, you need to surround the
regular expression with single quotes if it contains spaces:

    jupiter update feed.opml '/Halberds & Helmets/'

Notice how we assume that named entities such as C<&amp;> have already been
parsed into the appropriate strings.

=head2 Generate the HTML

This is how you generate the C<index.html> file based on the feeds of your
C<feed.opml>. It assumes that you have already updated all the feeds (see
above).

    jupiter html feed.opml

See L</OPTIONS> for ways to change how the HTML is generated.

=head2 Generate the RSS feed

This happens at the same time as when you generate the HTML. It takes all the
entries that are being added to the HTML and puts the into a feed.

See L</OPTIONS> for ways to change how the HTML is generated.

=head2 Why separate the two steps?

The first reason is that tinkering with the templates involves running the
program again and again, and you don't want to contact all the sites whenever
you update your templates.

The other reason is that it allows you to create subsets. For example, you can
fetch the feeds for three different OPML files:

    jupiter update osr.opml indie.opml other.opml

And then you can create three different HTML files:

    jupiter html osr.html osr.opml
    jupiter html indie.html indie.opml
    jupiter html rpg.html osr.opml indie.opml other.opml

For an example of how it might look, check out the setup for the planets I run.
L<https://alexschroeder.ch/cgit/planet/about/>

=head2 What about the JSON file?

There's a JSON file that gets generated and updated as you run Planet Jupiter.
It's name depends on the OPML files used. It records metadata for every feed in
the OPML file that isn't stored in the feeds themselves.

C<message> is the HTTP status message, or a similar message such as "No entry
newer than 90 days." This is set when updating the feeds in your cache.

C<message> is the HTTP status code; this code could be the real status code from
the server (such as 404 for a "not found" status) or one generated by Jupiter
such that it matches the status message (such as 206 for a "partial content"
status when there aren't any recent entries in the feed). This is set when
updating the feeds in your cache.

C<title> is the site's title. When you update the feeds in your cache, it is
taken from the OPML file. That's how the feed can have a title even if the
download failed. When you generate the HTML, the feeds in the cache are parsed
and if a title is provided, it is stored in the JSON file and overrides the
title in the OPML file.

C<link> is the site's link for humans. When you generate the HTML, the feeds in
the cache are parsed and if a link is provided, it is stored in the JSON file.
If the OPML element contained a C<htmlURL> attribute, however, that takes
precedence. The reasoning is that when a podcast is hosted on a platform which
generates a link that you don't like and you know the link to the human-readable
blog elsewhere, use the C<htmlURL> attribute in the OPML file to override this.

C<last_modified> and C<etag> are two headers used for caching from the HTTP
response that cannot be changed by data in the feed.

If we run into problems downloading a feed, this setup allows us to still link
to the feeds that aren't working, using their correct names, and describing the
error we encountered.

=head2 Logging

Use the C<--log=LEVEL> to set the log level. Valid values for LEVEL are debug,
info, warn, error, and fatal.

=head1 LICENSE

GNU Affero General Public License

=head1 INSTALLATION

Using C<cpan>:

    cpan App::jupiter

Manual install:

    perl Makefile.PL
    make
    make install

=head2 Dependencies

To run Jupiter on Debian we need:

C<libmodern-perl-perl> for L<Modern::Perl>

C<libmojolicious-perl> for L<Mojo::Template>, L<Mojo::UserAgent>, L<Mojo::Log>,

script/jupiter view on Meta::CPAN

B<doc> is the L<XML::LibXML::Document>. Could be either Atom or RSS!

=cut

# Creates list of feeds. Each feed is a hash with keys title, url, opml_file,
# cache_dir and cache_file.
sub read_opml {
  my (@feeds, @files);
  my @filters = map { decode(locale => substr($_, 1, -1)) } grep /^\/.*\/$/, @_;
  for my $file (grep /\.opml$/, @_) {
    my $doc = XML::LibXML->load_xml(location => $file); # this better have no errors!
    my @nodes = $doc->findnodes('//outline[./@xmlUrl]');
    my ($name, $path) = fileparse($file, '.opml', '.xml');
    for my $node (@nodes) {
      my $title = xml_escape $node->getAttribute('title');
      my $url = xml_escape $node->getAttribute('xmlUrl');
      next if @filters > 0 and not grep { $url =~ /$_/ or $title =~ /$_/ } @filters;
      my $link = xml_escape $node->getAttribute('htmlUrl');
      push @feeds, {
        title => $title,    # title in the OPML file
        url => $url,        # feed URL in the OPML file
        link => $link,      # web URL in the OPML file
        opml_file => $file,
        cache_dir => "$path/$name",
        cache_file => "$path/$name/" . slugify($url),
      };
    }
    warn "No feeds found in the OPML file $file\n" unless @nodes;
    push @files, { file => $file, path => $path, name => $name };
  }
  @feeds = shuffle @feeds;
  return \@feeds, \@files;
}

sub entries {
  my $feeds = shift;
  my $limit = shift;
  my $date = DateTime->now(time_zone => 'UTC')->subtract( days => 90 ); # compute once
  my $now =  DateTime->now(time_zone => 'UTC');
  my @entries;
  for my $feed (@$feeds) {
    next unless -r $feed->{cache_file};
    my $doc = eval { XML::LibXML->load_xml(recover => 2, location => $feed->{cache_file} )};
    if (not $doc) {
      $feed->{message} = xml_escape "Parsing error: $@";
      $feed->{code} = 422; # unprocessable
      next;
    }
    $feed->{doc} = $doc;
    my @nodes = $xpc->findnodes("/rss/channel/item | /atom:feed/atom:entry", $doc);
    if (not @nodes) {
      $feed->{message} = "Empty feed";
      $feed->{code} = 204; # no content
      next;
    }
    # if this is an Atom feed, we need to sort the entries ourselves (older entries at the end)
    my @candidates = map {
      my $entry = {};
      $entry->{element} = $_;
      $entry->{id} = id($_);
      $entry->{date} = updated($_) || $undefined_date;
      $entry;
    } @nodes;
    @candidates = grep { DateTime->compare($_->{date}, $now) <= 0 } @candidates;
    @candidates = unique(sort { DateTime->compare( $b->{date}, $a->{date} ) } @candidates);
    @candidates = @candidates[0 .. min($#candidates, $limit - 1)];
    # now that we have limited the candidates, let's add more metadata from the feed
    for my $entry (@candidates) {
      $entry->{feed} = $feed;
      # these two are already escaped
      $entry->{blog_title} = $feed->{title};
      $entry->{blog_url} = $feed->{url};
    }
    add_age_warning($feed, \@candidates, $date);
    push @entries, @candidates;
  }
  return \@entries;
}

sub add_age_warning {
  my $feed = shift;
  my $entries = shift;
  my $date = shift;
  # feed modification date is smaller than the date given
  my ($node) = $xpc->findnodes("/rss/channel | /atom:feed", $feed->{doc});
  my $feed_date = updated($node);
  if ($feed_date and DateTime->compare($feed_date, $date) == -1) {
    $feed->{message} = "No feed updates in 90 days";
    $feed->{code} = 206; # partial content
    return;
  } else {
    # or no entry found with a modification date equal or bigger than the date given
    for my $entry (@$entries) {
      return if DateTime->compare($entry->{date}, $date) >= 0;
    }
    $feed->{message} = "No entry newer than 90 days";
    $feed->{code} = 206; # partial content
  }
}

sub updated {
  my $node = shift;
  return unless $node;
  my @nodes = $xpc->findnodes('pubDate | atom:published | atom:updated', $node) or return;
  my $date = $nodes[0]->textContent;
  my $dt = eval { DateTime::Format::Mail->parse_datetime($date) }
  || eval { DateTime::Format::ISO8601->parse_datetime($date) }
  || eval { DateTime::Format::Mail->parse_datetime(french($date)) };
  return $dt;
}

sub french {
  my $date = shift;
  $date =~ s/^($wday_re)/$wday{$1}/;
  $date =~ s/\b($month_re)/$month{$1}/;
  return $date;
}

sub id {
  my $node = shift;
  return unless $node;
  my $id = $xpc->findvalue('guid | atom:id', $node); # id is mandatory for Atom
  $id ||= $node->findvalue('link'); # one of the following three is mandatory for RSS
  $id ||= $node->findvalue('title');
  $id ||= $node->findvalue('description');
  return $id;
}

sub unique {
  my %seen;
  my @unique;
  for my $node (@_) {
    next if $seen{$node->{id}};
    $seen{$node->{id}} = 1;
    push(@unique, $node);
  }
  return @unique;
}

sub limit {
  my $entries = shift;
  my $limit = shift;
  # we want the most recent entries overall
  @$entries = sort { DateTime->compare( $b->{date}, $a->{date} ) } unique(@$entries);
  return [@$entries[0 .. min($#$entries, $limit - 1)]];
}

=head2 Writing templates for entries

Entries have the following keys available:

B<title> is the title of the post.

B<link> is the URL to the post on the web (probably a HTML page).

B<blog_title> is the title of the site.

B<blog_link> is the URL for the site on the web (probably a HTML page).

B<blog_url> is the URL for the site's feed (RSS or Atom).

B<authors> are the authors (or the Dublin Core contributor), a list of strings.

B<date> is the publication date, as a DateTime object.

( run in 0.976 second using v1.01-cache-2.11-cpan-f56aa216473 )