formula results from the CPAN

Mail-SpamAssassin


=head1 NAME

Mail::SpamAssassin::Plugin::TxRep - Normalize scores with sender reputation records

=head1 SYNOPSIS

The TxRep (Reputation) plugin is designed as an improved replacement of the AWL
(Auto-Welcomelist) plugin. It adjusts the final message spam score by looking up 
and taking in consideration the reputation of the sender.

To try TxRep out, you B<have to> first disable the AWL plugin (if enabled), and
back up its database. AWL is loaded in v310.pre and can be disabled by
commenting out the loadplugin line:

 # loadplugin   Mail::SpamAssassin::Plugin::AWL

When AWL is not disabled, TxRep will refuse to run.

TxRep should be enabled by uncommenting the following line in v341.pre:

  loadplugin   Mail::SpamAssassin::Plugin::TxRep

Use the supplied 60_txreputation.cf file or add these lines to a .cf file:

 header         TXREP   eval:check_senders_reputation()
 describe       TXREP   Score normalizing based on sender's reputation
 tflags         TXREP   userconf noautolearn
 priority       TXREP   1000


=head1 DESCRIPTION

This plugin is intended to replace the former AWL - AutoWelcomeList. Although the
concept and the scope differ, the purpose remains the same - the normalizing of spam
score results based on previous sender's history. The name was intentionally changed
from "whitelist" to "reputation" to avoid any confusion, since the result score can
be adjusted in both directions.

The TxRep plugin keeps track of the average SpamAssassin score for senders.
Senders are tracked using multiple identificators, or their combinations: the  From:
email address, the originating IP and/or an originating block of IPs, sender's domain
name, the DKIM signature, and the HELO name. TxRep then uses the average score to reduce
the variability in scoring from message to message, and modifies the final score by
pushing the result towards the historical average. This improves the accuracy of
filtering for most email.

In comparison with the original AWL plugin, several conceptual changes were implemented
in TxRep:

1. B<Scoring> - at AWL, although it tracks the number of messages received from each
respective sender, when calculating the corrective score at a new message, it does
not take it in count in any way. So for example a sender who previously sent a single
ham message with the score of -5, and then sends a second one with the score of +10,
AWL will issue a corrective score bringing the score towards the -5. With the default
C<auto_welcomelist_factor> of 0.5, the resulting score would be only 2.5. And it would be
exactly the same even if the sender previously sent 1,000 messages with the average of
-5. TxRep tries to take the maximal advantage of the collected data, and adjusts the
final score not only with the mean reputation score stored in the database, but also
respecting the number of messages already seen from the sender. You can see the exact
formula in the section L</C<txrep_factor>>.

2. B<Learning> - AWL ignores any spam/ham learning. In fact it acts against it, which
often leads to a frustrating situation, where a user repeatedly tags all messages of a
given sender as spam (resp. ham), but at any new message from the sender, AWL will
adjust the score of the message back to the historical average which does B<not> include
the learned scores. This is now changed at TxRep, and every spam/ham learning will be
recorded in the reputation database, and hence taken in consideration at future email
from the respective sender. See the section L</"LEARNING SPAM / HAM"> for more details.

3. B<Auto-Learning> - in certain situations SpamAssassin may declare a message an
obvious spam resp. ham, and launch the auto-learning process, so that the message can be
re-evaluated. AWL, by design, did not perform any auto-learning adjustments. This plugin
will readjust the stored reputation by the value defined by L</C<txrep_learn_penalty>>
resp. L</C<txrep_learn_bonus>>. Auto-learning score thresholds may be tuned, or the
auto-learning completely disabled, through the setting L</C<txrep_autolearn>>.

4. B<Relearning> - messages that were wrongly learned or auto-learned, can be relearned.
Old reputations are removed from the database, and new ones added instead of them. The
relearning works better when message tracking is enabled through the
L</C<txrep_track_messages>> option. Without it, the relearned score is simply added to
the reputation, without removing the old ones.

5. B<Aging> - with AWL, any historical record of given sender has the same weight. It
means that changes in senders behavior, or modified SA rules may take long time, or
be virtually negated by the AWL normalization, especially at senders with high count
of past messages, and low recent frequency. It also turns to be particularly
counterproductive when the administrator detects new patterns in certain messages, and
applies new rules to better tag such messages as spam or ham. AWL will practically
eliminate the effect of the new rules, by adjusting the score back towards the (wrong)
historical average. Only setting the C<auto_welcomelist_factor> lower would help, but in
the same time it would also reduce the overall impact of AWL, and put doubts on its
purpose. TxRep, besides the L</C<txrep_factor>> (replacement of the C<auto_welcomelist_factor>),
introduces also the L</C<txrep_dilution_factor>> to help coping with this issue by
progressively reducing the impact of past records. More details can be found in the
description of the factor below.

6. B<Blocklisting and Welcomelisting> - when a welcomelisting or blocklisting was requested
through SpamAssassin's API, AWL adjusts the historical total score of the plain email
address without IP (and deleted records bound to an IP), but since during the reception 
new records with IP will be added, the blocklisted entry would cease acting during 
scanning. TxRep always uses the record of the plain email address without IP together 
with the one bound to an IP address, DKIM signature, or SPF pass (unless the weight 
factor for the EMAIL reputation is set to zero). AWL uses the score of 100 (resp. -100) 
for the blocklisting (resp. welcomelisting) purposes. TxRep increases the value 
proportionally to the weight factor of the EMAIL reputation. It is explained in details 
in the section L<BLOCKLISTING / WELCOMELISTING>. TxRep can blocklist or welcomelist also
IP addresses, domain names, and dotless HELO names.

7. B<Sender Identification> - AWL identifies a sender on the basis of the email address
used, and the originating IP address (better told its part defined by the mask setting).
The main purpose of this measure is to avoid assigning false good scores to spammers who
spoof known email addresses. The disadvantage appears at senders who send from frequently
changing locations or even when connecting through dynamical IP addresses that are not
within the block defined by the mask setting. Their score is difficult or sometimes
impossible to track. Another disadvantage is, for example, at a spammer persistently
sending spam from the same IP address, just under different email addresses. AWL will not
find his previous scores, unless he reuses the same email address again. TxRep uses several
identificators, and creates separate database entries for each of them. It tracks not only
the email/IP address combination like AWL, but also the standalone email address (regardless
of the originating IP), the standalone IP (regardless of email address used), the domain

lib/Mail/SpamAssassin/Plugin/TxRep.pm view on Meta::CPAN

    else {
      my $origip_obj = NetAddr::IP->new($origip . '/' . $mask_len);
      if (!defined $origip_obj) {                       # invalid IPv4 address
        dbg("TxRep: bad IPv4 address $origip");
      } else {
        $result =        $origip_obj->network->addr;
        $result =~s/(\.0){1,3}\z//;                     # truncate zero tail
      }
    }
  } elsif (index($origip, ':') >= 0 &&                            # triage
           $origip =~
           /^ [0-9a-f]{0,4} (?: : [0-9a-f]{0,4} | \. [0-9]{1,3} ){2,9} $/xsi) {
    # looks like an IPv6 address
    my $mask_len = $self->{conf}->{txrep_ipv6_mask_len};
    $mask_len = 48  if !defined $mask_len;
    my $origip_obj = NetAddr::IP->new6($origip . '/' . $mask_len);
    if (!defined $origip_obj) {                         # invalid IPv6 address
      dbg("TxRep: bad IPv6 address $origip");
    } else {
      $result = $origip_obj->network->full6;            # string in a canonical form
      $result =~ s/(:0000){1,7}\z/::/;                  # compress zero tail
    }
  } else {
    dbg("TxRep: bad IP address $origip");
  }
  if (defined $result && length($result) > 39) {        # just in case, keep under
    $result = substr($result,0,39);                     # the awl.ip field size
  }
# if (defined $result) {dbg("TxRep: IP masking %s -> %s", $origip || '?', $result || '?');}
  return $result;
}


###########################################################################
sub pack_addr {
###########################################################################
  my ($self, $addr, $origip) = @_;

  $addr = lc $addr;
  $addr =~ s/[\000\;\'\"\!\|]/_/gs;                     # paranoia

  if ( defined $origip) {$origip = $self->ip_to_awl_key($origip);}
  if (!defined $origip) {$origip = 'none';}
  if ( $self->{conf}->{txrep_welcomelist_out} &&
    defined $self->{pms}->{relays_internal} &&  @{$self->{pms}->{relays_internal}} &&
    (!defined $self->{pms}->{relays_external} || !@{$self->{pms}->{relays_external}})
    and $addr =~ /(?:[^\s\@]+)\@(?:[^\s\@]+)/) {
      $origip = 'WELCOMELIST_OUT';
  }
  return $addr . "|ip=" . $origip;
}


=head1 LEARNING SPAM / HAM

When SpamAssassin is told to learn (or relearn) a given message as spam or
ham, all reputations relevant to the message (email, email_ip, domain, ip, helo)
in both global and user storages will be updated using the C<txrep_learn_penalty>
respectively the C<rxrep_learn_bonus> values. The new reputation of given sender
property (email, domain,...) will be the respective result of one of the following
formulas:

   new_reputation = old_reputation + learn_penalty
   new_reputation = old_reputation - learn_bonus

The TxRep plugin currently does track each message individually, hence it
does not detect when you learn the message repeatedly. It will add/subtract
the penalty/bonus score each time the message is fed to the spam learner.

=cut

######################################################### plugin hook #####
sub learner_new {
###########################################################################
  my ($self) = @_;

  $self->{txKeepStoreTied} = undef;
  return $self;
}


######################################################### plugin hook #####
sub autolearn {
###########################################################################
  my ($self, $params) = @_;

  $self->{last_pms} = $params->{permsgstatus};
  return $self->{autolearn} = 1;
}


######################################################### plugin hook #####
sub learn_message {
###########################################################################
  my ($self, $params) = @_;
  return 0 unless (defined $params->{isspam});

  dbg("TxRep: learning a message");
  my $pms = ($self->{last_pms})? $self->{last_pms} : Mail::SpamAssassin::PerMsgStatus->new($self->{main}, $params->{msg});
  if (!defined $pms->{relays_internal} && !defined $pms->{relays_external}) {
    $pms->extract_message_metadata();
  }

  if ($params->{isspam})
        {$self->{learning} =      $self->{conf}->{txrep_learn_penalty};}
  else  {$self->{learning} = -1 * $self->{conf}->{txrep_learn_bonus};}

  my $ret = !$self->{learning} || $self->check_senders_reputation($pms);
  $self->{learning} = undef;
  return $ret;
}


######################################################### plugin hook #####
sub forget_message {
###########################################################################
  my ($self, $params) = @_;
  return 0 unless ($self->{conf}->{use_txrep});
  my $pms = ($self->{last_pms})? $self->{last_pms} : Mail::SpamAssassin::PerMsgStatus->new($self->{main}, $params->{msg});

  dbg("TxRep: forgetting a message");

( run in 0.587 second using v1.01-cache-2.11-cpan-9581c071862 )