process results from the CPAN

process

AI-MaxEntropy

view release on metacpan or search on metacpan

use strict;
use warnings;

package AI::MaxEntropy;

use Algorithm::LBFGS;
use AI::MaxEntropy::Model;
use XSLoader;

our $VERSION = '0.20';
XSLoader::load('AI::MaxEntropy', $VERSION);

sub new {
    my $class = shift;
    my $self = {
       smoother => {},
       algorithm => {},
       @_,
       samples => [],
       x_bucket => {},
       y_bucket => {},
       x_list => [],
       y_list => [],
       x_num => 0,
       y_num => 0,
       f_num => 0,
       af_num => 0,
       f_freq => [],
       f_map => [],
       last_cut => -1,
       _c => {}
    };
    return bless $self, $class;
}

sub see {
    my ($self, $x, $y, $w) = @_;
    $w = 1 if not defined($w);
    my ($x1, $y1) = ([], undef);
    # preprocess if $x is hashref
    $x = [
        map {
	    my $attr = $_;
	    ref($x->{$attr}) eq 'ARRAY' ? 
	        map { "$attr:$_" } @{$x->{$attr}} : "$_:$x->{$_}" 
        } keys %$x
    ] if ref($x) eq 'HASH';
    # update af_num
    $self->{af_num} = scalar(@$x) if $self->{af_num} == 0;
    $self->{af_num} = -1 if $self->{af_num} != scalar(@$x);
    # convert y from string to ID
    my $y_id = $self->{y_bucket}->{$y};
    # new y
    if (!defined($y_id)) {
        # update y_list, y_num, y_bucket, f_freq
        push @{$self->{y_list}}, $y;
	$self->{y_num} = scalar(@{$self->{y_list}});
	$y_id = $self->{y_num} - 1;
	$self->{y_bucket}->{$y} = $y_id;
	push @{$self->{f_freq}}, [map { 0 } (1 .. $self->{x_num})];
	# save ID
	$y1 = $y_id;
    }
    # old y
    else { $y1 = $y_id }
    # convert x from strings to IDs
    for (@$x) {
        my $x_id = $self->{x_bucket}->{$_};
	# new x
	if (!defined($x_id)) {
	    # update x_list, x_num, x_bucket, f_freq
	    push @{$self->{x_list}}, $_;
	    $self->{x_num} = scalar(@{$self->{x_list}});
	    $x_id = $self->{x_num} - 1;
	    $self->{x_bucket}->{$_} = $x_id;
	    push @{$self->{f_freq}->[$_]}, 0 for (0 .. $self->{y_num} - 1);
	    # save ID
	    push @$x1, $x_id;
	}
        # old x
	else { push @$x1, $x_id }
	# update f_freq
	$self->{f_freq}->[$y_id]->[$x_id] += $w;
    }
    # add the sample
    push @{$self->{samples}}, [$x1, $y1, $w];
    $self->{last_cut} = -1;
}

sub cut {
    my ($self, $t) = @_;
    $self->{f_num} = 0;
    for my $y (0 .. $self->{y_num} - 1) {
        for my $x (0 .. $self->{x_num} - 1) {
	    if ($self->{f_freq}->[$y]->[$x] >= $t) {
	        $self->{f_map}->[$y]->[$x] = $self->{f_num};
		$self->{f_num}++;
	    }
	    else { $self->{f_map}->[$y]->[$x] = -1 }
	}

lib/AI/MaxEntropy.pm view on Meta::CPAN


will cut all features that occur less than one time.

=head2 learn 

Learn a model from all the samples that the learner have seen so far,
returns an L<AI::MaxEntropy::Model> object, which can be used to make
prediction on unlabeled samples.

  ...

  my $model = $me->learn;

  print $model->predict(['x1', 'x2', ...]);

=head1 PROPERTIES

=head2 algorithm

This property enables client program to choose different algorithms for
learning the ME model and set their parameters.

There are mainly 3 algorithm for learning ME models, they are GIS, IIS and
L-BFGS. This module implements 2 of them, namely,  L-BFGS and GIS.
L-BFGS provides full functionality, while GIS runs faster, but only 
applicable on limited scenarios.

To use GIS, the following conditions must be satisified:

1. All samples have same number of active features

2. No feature has been cut

3. No smoother is used (in fact, the property L</smoother> is simplly
ignored when the type of algorithm equal to 'gis').

This property C<algorithm> is supposed to be a hash ref, like

  {
    type => ...,
    progress_cb => ...,
    param_1 => ...,
    param_2 => ...,
    ...,
    param_n => ...
  }

=head3 type

The entry C<type =E<gt> ...> specifies which algorithm is used for the 
optimization. Valid values include:

  'lbfgs'       Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS)
  'gis'         General Iterative Scaling (GIS)

If ommited, C<'lbfgs'> is used by default.

=head3 progress_cb

The entry C<progress_cb =E<gt> ...> specifies the progress callback
subroutine which is used to trace the process of the algorithm. 
The specified callback routine will be called at each iteration of the
algorithm.

For L-BFGS, C<progress_cb> will be directly passed to
L<Algorithm::LBFGS/fmin>. C<f(x)> is the negative log-likelihood of current
lambda vector.

For GIS, the C<progress_cb> is supposed to have a prototype like

  progress_cb(i, lambda, d_lambda, lambda_norm, d_lambda_norm)

C<i> is the number of the iterations, C<lambda> is an array ref containing
the current lambda vector, C<d_lambda> is an array ref containing the
delta of the lambda vector in current iteration, C<lambda_norm> and
C<d_lambda_norm> are Euclid norms of C<lambda> and C<d_lambda> respectively.

For both L-BFGS and GIS, the client program can also pass a string
C<'verbose'> to C<progress_cb> to use a default progress callback
which simply print out the progress on the screen.

C<progress_cb> can also be omitted if the client program
do not want to trace the progress.

=head3 parameters

The rest entries are parameters for the specified algorithm.
Each parameter will be assigned with its default value when it is not
given explicitly.

For L-BFGS, the parameters will be directly passed to
L<Algorithm::LBFGS> object, please refer to L<Algorithm::LBFGS/Parameters>
for details.

For GIS, there is only one parameter C<epsilon>, which controls the
precision of the algorithm (similar to the C<epsilon> in
L<Algorithm::LBFGS>). Generally speaking, a smaller C<epsilon> produces
a more precise result. The default value of C<epsilon> is 1e-3.

=head2 smoother

The smoother is a solution to the over-fitting problem. 
This property chooses which type of smoother the client program want to
apply and sets the smoothing parameters. 

Only one smoother have been implemented in this version of the module, 
the Gaussian smoother.

One can apply the Gaussian smoother as following,

  my $me = AI::MaxEntropy->new(
      smoother => { type => 'gaussian', sigma => 0.6 }
  );

The parameter C<sigma> indicates the strength of smoothing.
Usually, sigma is a positive number no greater than 1.0.
The strength of smoothing grows as sigma getting close to 0.

=head1 SEE ALSO

L<AI::MaxEntropy::Model>, L<AI::MaxEntropy::Util>

( run in 0.597 second using v1.01-cache-2.11-cpan-39bf76dae61 )