Algorithm-ExpectationMaximization

 view release on metacpan or  search on metacpan

examples/canned_example1.pl  view on Meta::CPAN

#  is for visualizing the raw data.  At the third location, the mask is for
#  visualizing the final clusters.  The mask shown below means that the symbolic data
#  for each data record is in the first column, and that the next three columns are
#  to be used for clustering.
my $mask = "N11";    

my $clusterer = Algorithm::ExpectationMaximization->new(
                                datafile            => $datafile,
                                mask                => $mask,
                                K                   => 3,
                                max_em_iterations   => 300,
                                seeding             => 'random',
                                terminal_output     => 1,
                                debug               => 0,
                );

$clusterer->read_data_from_file();

# For visualizing the raw data:
my $data_visualization_mask = "11";
$clusterer->visualize_data($data_visualization_mask);

examples/canned_example1.pl  view on Meta::CPAN

print "----------------------------------------------------\n\n";

# CLUSTER VISUALIZATION:

my $visualization_mask = "11"; 

$clusterer->visualize_clusters($visualization_mask);
$clusterer->visualize_distributions($visualization_mask);
$clusterer->plot_hardcopy_clusters($visualization_mask);
$clusterer->plot_hardcopy_distributions($visualization_mask);
$clusterer->display_fisher_quality_vs_iterations();
$clusterer->display_mdl_quality_vs_iterations();
my $estimated_priors = $clusterer->return_estimated_priors();
print "Estimated class priors: @$estimated_priors\n";

examples/canned_example2.pl  view on Meta::CPAN



my $datafile = "mydatafile2.dat";          # from param2.txt

my $mask = "N11";                      

my $clusterer = Algorithm::ExpectationMaximization->new(
                                datafile            => $datafile,
                                mask                => $mask,
                                K                   => 2,
                                max_em_iterations   => 300,
                                seeding             => 'random',
#                               seeding             => 'kmeans',
                                terminal_output     => 1,
                                debug               => 0,
                );

$clusterer->read_data_from_file();

my $data_visualization_mask = "11";
$clusterer->visualize_data($data_visualization_mask);

examples/canned_example2.pl  view on Meta::CPAN

print "----------------------------------------------------\n\n";

# VISUALIZATION:

my $visualization_mask = "11"; 

$clusterer->visualize_clusters($visualization_mask);
$clusterer->visualize_distributions($visualization_mask);
$clusterer->plot_hardcopy_clusters($visualization_mask);
$clusterer->plot_hardcopy_distributions($visualization_mask);
$clusterer->display_fisher_quality_vs_iterations();
$clusterer->display_mdl_quality_vs_iterations();
my $estimated_priors = $clusterer->return_estimated_priors();
print "Estimated class priors: @$estimated_priors\n";

examples/canned_example3.pl  view on Meta::CPAN



my $datafile = "mydatafile3.dat";  

my $mask = "N11";  

my $clusterer = Algorithm::ExpectationMaximization->new(
                                datafile            => $datafile,
                                mask                => $mask,
                                K                   => 3,
                                max_em_iterations   => 300,
                                seeding             => 'manual',
                                seed_tags           => ['a7', 'b3', 'c4'],
                                terminal_output     => 1,
                                debug               => 0,
                );

$clusterer->read_data_from_file();

my $data_visualization_mask = "11";
$clusterer->visualize_data($data_visualization_mask);

examples/canned_example3.pl  view on Meta::CPAN

print "----------------------------------------------------\n\n";

# VISUALIZATION:

my $visualization_mask = "11"; 

$clusterer->visualize_clusters($visualization_mask);
$clusterer->visualize_distributions($visualization_mask);
$clusterer->plot_hardcopy_clusters($visualization_mask);
$clusterer->plot_hardcopy_distributions($visualization_mask);
$clusterer->display_fisher_quality_vs_iterations();
$clusterer->display_mdl_quality_vs_iterations();
my $estimated_priors = $clusterer->return_estimated_priors();
print "Estimated class priors: @$estimated_priors\n";

examples/canned_example4.pl  view on Meta::CPAN

my $datafile = "mydatafile4.dat";  
#my $datafile = "sphericaldata.csv"; 


my $mask = "N111";  

my $clusterer = Algorithm::ExpectationMaximization->new(
                                datafile            => $datafile,
                                mask                => $mask,
                                K                   => 3,
                                max_em_iterations   => 300,
                                seeding             => 'random',
                                terminal_output     => 1,
                                debug               => 0,
                );

$clusterer->read_data_from_file();

my $data_visualization_mask = "111";
$clusterer->visualize_data($data_visualization_mask);
$clusterer->plot_hardcopy_data($data_visualization_mask);

examples/canned_example4.pl  view on Meta::CPAN

print "----------------------------------------------------\n\n";

# VISUALIZATION:

my $visualization_mask = "111"; 

$clusterer->visualize_clusters($visualization_mask);
$clusterer->visualize_distributions($visualization_mask);
$clusterer->plot_hardcopy_clusters($visualization_mask);
$clusterer->plot_hardcopy_distributions($visualization_mask);
$clusterer->display_fisher_quality_vs_iterations();
$clusterer->display_mdl_quality_vs_iterations();
my $estimated_priors = $clusterer->return_estimated_priors();
print "Estimated class priors: @$estimated_priors\n";

examples/canned_example5.pl  view on Meta::CPAN



my $datafile = "mydatafile5.dat";  

my $mask = "N111";  

my $clusterer = Algorithm::ExpectationMaximization->new(
                                datafile              => $datafile,
                                mask                  => $mask,
                                K                     => 3,
                                max_em_iterations     => 300,
                                seeding               => 'random',
                                terminal_output       => 1,
                                debug                 => 0,
                );

$clusterer->read_data_from_file();

my $data_visualization_mask = "111";
$clusterer->visualize_data($data_visualization_mask);
$clusterer->plot_hardcopy_data($data_visualization_mask);

examples/canned_example5.pl  view on Meta::CPAN

print "----------------------------------------------------\n\n";

# VISUALIZATION:

my $visualization_mask = "111"; 

$clusterer->visualize_clusters($visualization_mask);
$clusterer->visualize_distributions($visualization_mask);
$clusterer->plot_hardcopy_clusters($visualization_mask);
$clusterer->plot_hardcopy_distributions($visualization_mask);
$clusterer->display_fisher_quality_vs_iterations();
$clusterer->display_mdl_quality_vs_iterations();
my $estimated_priors = $clusterer->return_estimated_priors();
print "Estimated class priors: @$estimated_priors\n";

examples/canned_example6.pl  view on Meta::CPAN


my $datafile = "mydatafile7.dat";


my $mask = "N1";    

my $clusterer = Algorithm::ExpectationMaximization->new(
                                datafile            => $datafile,
                                mask                => $mask,
                                K                   => 2,
                                max_em_iterations   => 300,
                                seeding             => 'random',
                                terminal_output     => 1,
                                debug               => 0,
                );

$clusterer->read_data_from_file();

my $data_visualization_mask = "1";
$clusterer->visualize_data($data_visualization_mask);
$clusterer->plot_hardcopy_data($data_visualization_mask);

examples/canned_example6.pl  view on Meta::CPAN



# VISUALIZATION:

my $visualization_mask = "1"; 

$clusterer->visualize_clusters($visualization_mask);
$clusterer->visualize_distributions($visualization_mask);
$clusterer->plot_hardcopy_clusters($visualization_mask);
$clusterer->plot_hardcopy_distributions($visualization_mask);
$clusterer->display_fisher_quality_vs_iterations();
$clusterer->display_mdl_quality_vs_iterations();
my $estimated_priors = $clusterer->return_estimated_priors();
print "Estimated class priors: @$estimated_priors\n";

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

    croak "\nYou have used a wrong name for a keyword argument " .
          "--- perhaps a misspelling\n" 
          if check_for_illegal_params(@params) == 0;
    bless {
        _datafile         =>   $args{datafile} || croak("datafile required"),
        _mask             =>   $args{mask}     || croak("mask required"),
        _K                =>   $args{K}       || croak("number of clusters required"),
        _terminal_output  =>   $args{terminal_output}   || 0,
        _seeding          =>   $args{seeding}           || 'random',
        _seed_tags        =>   $args{seed_tags}         || [],
        _max_em_iterations=>   $args{max_em_iterations} || 100,
        _class_priors     =>   $args{class_priors}      || [],
        _debug            =>   $args{debug}             || 0,
        _N                =>   0,
        _data             =>   {},
        _data_id_tags     =>   [],
        _clusters         =>   [],
        _cluster_centers  =>   [],
        _data_dimensions  =>   0,
        _cluster_normalizers            => [],
        _cluster_means                  => [],
        _cluster_covariances            => [],
        _class_labels_for_data          => {},
        _class_probs_at_each_data_point => {},
        _expected_class_probs           => {}, 
        _old_priors                     => [],
        _old_old_priors                 => [],
        _fisher_quality_vs_iteration    => [],
        _mdl_quality_vs_iterations      => [],
    }, $class;
}


sub read_data_from_file {
    my $self = shift;
    my $filename = $self->{_datafile};
    $self->read_data_from_file_csv() if $filename =~ /.csv$/;
    $self->read_data_from_file_dat() if $filename =~ /.dat$/;
}

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

             "the number of clusters requested for the clusters to be " .
             "meaningful $!" 
                         if ( $self->{_N} < (2 * $self->{_K} ** 2) );
    }
}


# This is the heart of the module --- in the sense that this method implements the EM
# algorithm for the estimating the parameters of a Gaussian mixture model for the
# data. In the implementation shown below, we declare convergence for the EM
# algorithm when the change in the class priors over three iterations falls below a
# threshold.  The current value of this threshold, as can be seen in the function
# compare_array_floats(), is 0.00001.
sub EM {
    my $self = shift;
    $self->initialize_class_priors();
    for (my $em_iteration=0; $em_iteration < $self->{_max_em_iterations}; 
                                                             $em_iteration++) {
        if ($em_iteration == 0) {
            print "\nSeeding the EM algorithm with:\n";                     
            $self->display_seeding_stats();
            print "\nFinished displaying the seeding information\n";
            print "\nWill print out a dot for each iteration of EM:\n\n";
        }
        my $progress_indicator = $em_iteration % 5 == 0 ? $em_iteration : ".";
        print $progress_indicator;
        foreach my $data_id (@{$self->{_data_id_tags}}) {

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

        foreach my $cluster_index(0..$self->{_K}-1) {      
            $self->{_class_priors}->[$cluster_index] = 
                 $self->{_cluster_normalizers}->[$cluster_index] / $self->{_N};
        }
        my @priors = @{$self->{_class_priors}};
        print "\nUpdated priors: @priors\n\n\n" if $self->{_debug};
        push @{$self->{_fisher_quality_vs_iteration}}, 
                                                 $self->clustering_quality_fisher();
        push @{$self->{_mdl_quality_vs_iteration}}, $self->clustering_quality_mdl();
        if ( ($em_iteration > 5 && $self->reached_convergence()) 
             || ($em_iteration ==  $self->{_max_em_iterations} - 1) ) {
            my @old_old_priors = @{$self->{_old_old_priors}};  
            my @old_priors = @{$self->{_old_priors}};
            print "\n\nPrevious to previous priors:       @old_old_priors\n";
            print "Previous priors:                   @old_priors\n";
            print "Current class priors:              @{$self->{_class_priors}}\n";
            print "\n\nCONVERGENCE ACHIEVED AT ITERATION $em_iteration\n\n"
                if $em_iteration < $self->{_max_em_iterations} - 1; 
            last;
        }
    }
    print "\n\n\n";
}

sub reached_convergence {
    my $self = shift;
    return 1 if compare_array_floats($self->{_old_old_priors}, 
                                     $self->{_old_priors}) 

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

    my $self = shift;
    foreach my $cluster_index(0..$self->{_K}-1) {      
        print "\nSeeding for cluster $cluster_index:\n";
        my $mean = $self->{_cluster_means}->[$cluster_index];
        display_matrix("The mean is: ", $mean);        
        my $covariance = $self->{_cluster_covariances}->[$cluster_index];
        display_matrix("The covariance is: ", $covariance);
    }
}

sub display_fisher_quality_vs_iterations {
    my $self = shift;
    print "\n\nFisher Quality vs. Iterations: " .
                      "@{$self->{_fisher_quality_vs_iteration}}\n\n";
}

sub display_mdl_quality_vs_iterations {
    my $self = shift;
    print "\n\nMDL Quality vs. Iterations: @{$self->{_mdl_quality_vs_iteration}}\n\n";
}

sub find_prob_at_each_datapoint_for_given_mean_and_covar {
    my $self = shift;
    my $mean_vec_ref = shift;
    my $covar_ref = shift;
    foreach my $data_id (keys %{$self->{_data}}) {
        my $data_vec = Math::GSL::Matrix->new($self->{_data_dimensions},1);        

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

    my $self = shift;
    my $K = $self->{_K};
    $self->cluster_for_fixed_K_single_smart_try($K);
    if ((defined $self->{_clusters}) && (defined $self->{_cluster_centers})){
        return ($self->{_clusters}, $self->{_cluster_centers});
    } else {
        die "kmeans clustering failed.";
    }
}

# Used by the kmeans algorithm for the initialization of the EM iterations.  We do
# initial kmeans cluster seeding by subjecting the data to principal components
# analysis in order to discover the direction of maximum variance in the data space.
# Subsequently, we try to find the K largest peaks along this direction.  The
# coordinates of these peaks serve as the seeds for the K clusters.
sub cluster_for_fixed_K_single_smart_try {
    my $self = shift;
    my $K = shift;
    print "Clustering for K = $K\n" if $self->{_terminal_output};
    my ($clusters, $cluster_centers) =
                              $self->cluster_for_given_K($K);

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

        my ($min, $best_center_index) = minimum( \@dist_from_clust_centers );
        push @{$clusters[$best_center_index]}, $ele;
    }
    return \@clusters;
}    

# Used by the kmeans part of the code: This is the main routine that along with the
# update_cluster_centers() routine constitute the two key steps of the K-Means
# algorithm.  In most cases, the infinite while() loop will terminate automatically
# when the cluster assignments of the data points remain unchanged. For the sake of
# safety, we keep track of the number of iterations. If this number reaches 100, we
# exit the while() loop anyway.  In most cases, this limit will not be reached.
sub assign_data_to_clusters {
    my $self = shift;
    my $clusters = shift;
    my $K = shift;
    my $final_cluster_centers;
    my $iteration_index = 0;
    while (1) {
        my $new_clusters;
        my $assignment_changed_flag = 0;

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

    }
    return ($mean, $variance);
}

sub check_for_illegal_params {
    my @params = @_;
    my @legal_params = qw / datafile
                            mask
                            K
                            terminal_output
                            max_em_iterations
                            seeding
                            class_priors
                            seed_tags
                            debug
                          /;
    my $found_match_flag;
    foreach my $param (@params) {
        foreach my $legal (@legal_params) {
            $found_match_flag = 0;
            if ($param eq $legal) {

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

  my $mask = "N0111";

  #  Now construct an instance of the clusterer.  The parameter `K' controls the
  #  number of clusters.  Here is an example call to the constructor for instance
  #  creation:

  my $clusterer = Algorithm::ExpectationMaximization->new(
                                      datafile            => $datafile,
                                      mask                => $mask,
                                      K                   => 3,
                                      max_em_iterations   => 300,
                                      seeding             => 'random',
                                      terminal_output     => 1,
                  );
 
  #  Note the choice for `seeding'. The choice `random' means that the clusterer will
  #  randomly select `K' data points to serve as initial cluster centers.  Other
  #  possible choices for the constructor parameter `seeding' are `kmeans' and
  #  `manual'.  With the `kmeans' option for `seeding', the output of a K-means
  #  clusterer is used for the cluster seeds and the initial cluster covariances.  If
  #  you use the `manual' option for seeding, you must also specify the data elements

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

  #  Here is an example of a call to the constructor when we choose the `manual'
  #  option for seeding the clusters and for specifying the data elements for
  #  seeding.  The data elements are specified by their tag names.  In this case,
  #  these names are `a26', `b53', and `c49':

  my $clusterer = Algorithm::ExpectationMaximization->new(
                                      datafile            => $datafile,
                                      mask                => $mask,
                                      class_priors        => [0.6, 0.2, 0.2],
                                      K                   => 3,
                                      max_em_iterations   => 300,
                                      seeding             => 'manual',
                                      seed_tags           => ['a26', 'b53', 'c49'],
                                      terminal_output     => 1,
                                    );

  #  This example call to the constructor also illustrates how you can inject class
  #  priors into the clustering process. The class priors are the prior probabilities
  #  of the class distributions in your dataset.  As explained later, injecting class
  #  priors in the manner shown above makes statistical sense only for the case of
  #  manual seeding.  When you do inject class priors, the order in which the priors

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

data point, whose values dictate the selection of the Gaussian for that data point.
This point is explained in great detail in my on-line tutorial at
L<https://engineering.purdue.edu/kak/Tutorials/ExpectationMaximization.pdf>.

The EM algorithm in our context reduces to an iterative invocation of the following
steps: (1) Given the current guess for the means and the covariances of the different
Gaussians in our mixture model, use Bayes' Rule to update the posterior class
probabilities at each of the data points; (2) Using the updated posterior class
probabilities, first update the class priors; (3) Using the updated class priors,
update the class means and the class covariances; and go back to Step (1).  Ideally,
the iterations should terminate when the expected log-likelihood of the observed data
has reached a maximum and does not change with any further iterations.  The stopping
rule used in this module is the detection of no change over three consecutive
iterations in the values calculated for the priors.

This module provides three different choices for seeding the clusters: (1) random,
(2) kmeans, and (3) manual.  When random seeding is chosen, the algorithm randomly
selects C<K> data elements as cluster seeds.  That is, the data vectors associated
with these seeds are treated as initial guesses for the means of the Gaussian
distributions.  The covariances are then set to the values calculated from the entire
dataset with respect to the means corresponding to the seeds. With kmeans seeding, on
the other hand, the means and the covariances are set to whatever values are returned
by the kmeans algorithm.  And, when seeding is set to manual, you are allowed to
choose C<K> data elements --- by specifying their tag names --- for the seeds.  The

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN


A call to C<new()> constructs a new instance of the
C<Algorithm::ExpectationMaximization> class.  A typical form
of this call when you want to use random option for seeding
the algorithm is:

    my $clusterer = Algorithm::ExpectationMaximization->new(
                                datafile            => $datafile,
                                mask                => $mask,
                                K                   => 3,
                                max_em_iterations   => 300,
                                seeding             => 'random',
                                terminal_output     => 1,
                    );

where C<K> is the expected number of clusters and
C<max_em_iterations> the maximum number of EM iterations
that you want to allow until convergence is achieved.
Depending on your dataset and on the choice of the initial
seeds, the actual number of iterations used could be as few
as 10 and as many as reaching 300.  The output produced by
the algorithm shows the actual number of iterations used to
arrive at convergence.

The data file supplied through the C<datafile> option is
expected to contain entries in the following format

   c20  0  10.7087017086940  9.63528386251712  10.9512155258108  ...
   c7   0  12.8025925026787  10.6126270065785  10.5228482095349  ...
   b9   0  7.60118206283120  5.05889245193079  5.82841781759102  ...
   ....
   ....

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

based seeding, just replace the option C<random> supplied
for C<seeding> by C<kmeans>.  You can also do manual seeding
by designating a specified set of data elements to serve as
cluster seeds.  The call to the constructor in this case
looks like

    my $clusterer = Algorithm::ExpectationMaximization->new(
                                datafile            => $datafile,
                                mask                => $mask,
                                K                   => 3,
                                max_em_iterations   => 300,
                                seeding             => 'manual',
                                seed_tags           => ['a26', 'b53', 'c49'],
                                terminal_output     => 1,
                    );

where the option C<seed_tags> is set to an anonymous array
of symbolic names associated with the data elements.

If you know the class priors, you can supply them through an
additional option to the constructor that looks like

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

=item B<plot_hardcopy_distributions($visualization_mask):>

    $clusterer->plot_hardcopy_distributions($visualization_mask);

This method create a PNG file from the C<gnuplot> created
display of the clusters that correspond to the posterior
class probabilities exceeding a specified threshold. The
plotting functionality of C<gnuplot> is accessed through the
Perl wrappers provided by the C<Graphics::GnuplotIF> module.

=item B<display_fisher_quality_vs_iterations():>

    $clusterer->display_fisher_quality_vs_iterations();

This method measures the quality of clustering by
calculating the normalized average squared distance between
the cluster centers, the normalization being provided by the
average cluster covariance. See the Description for further
details.  In general, this measure is NOT useful for
overlapping clusters.

=item B<display_mdl_quality_vs_iterations():>

    $clusterer->display_mdl_quality_vs_iterations();

At the end of each iteration, this method measures the
quality of clustering my calculating its MDL (Minimum
Description Length).  As stated earlier in Description, the
MDL measure is a difference of a log-likelihood term for all
of the observed data and a model-complexity penalty term.
The smaller the value returned by this method, the better
the clustering.

=item B<return_estimated_priors():>



( run in 1.019 second using v1.01-cache-2.11-cpan-96521ef73a4 )