Algorithm-ExpectationMaximization

 view release on metacpan or  search on metacpan

examples/data_generator.pl  view on Meta::CPAN

use strict;
use Algorithm::ExpectationMaximization;

# The Parameter File:

# How the synthetic data is generated for clustering is
# controlled entirely by the input_parameter_file keyword in
# the function call shown below.  The class prior
# probabilities, the mean vectors and covariance matrix
# entries in file must be according to the syntax shown in
# the example param.txt file.  It is best to edit that file
# as needed for the purpose of data generation.

#my $parameter_file = "param1.txt";             #2D
#my $parameter_file = "param2.txt";             #2D     
#my $parameter_file = "param3.txt";             #2D
#my $parameter_file = "param4.txt";             #3D    
#my $parameter_file = "param5.txt";             #3D    
#my $parameter_file = "param6.txt";             #3D    
my $parameter_file = "param7.txt";             #1D    

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

                  $min + $peaks_index_hash->{$peak} * $scale * $delta;
        push @K_highest_peak_locations, $unscaled_peak_point
            if $k < $how_many;
        last if ++$k == $how_many;
    }
    return @K_highest_peak_locations;
}

# Used by the kmeans part of the code: The purpose of this routine is to form initial
# clusters by assigning the data samples to the initial clusters formed by the
# previous routine on the basis of the best proximity of the data samples to the
# different cluster centers.
sub assign_data_to_clusters_initial {
    my $self = shift;
    my @cluster_centers = @{ shift @_ };
    my @clusters;
    foreach my $ele (@{$self->{_data_id_tags}}) {
        my $best_cluster;
        my @dist_from_clust_centers;
        foreach my $center (@cluster_centers) {
            push @dist_from_clust_centers, $self->distance($ele, $center);
        }
        my ($min, $best_center_index) = minimum( \@dist_from_clust_centers );
        push @{$clusters[$best_center_index]}, $ele;
    }
    return \@clusters;
}    

# Used by the kmeans part of the code: This is the main routine that along with the
# update_cluster_centers() routine constitute the two key steps of the K-Means
# algorithm.  In most cases, the infinite while() loop will terminate automatically
# when the cluster assignments of the data points remain unchanged. For the sake of
# safety, we keep track of the number of iterations. If this number reaches 100, we
# exit the while() loop anyway.  In most cases, this limit will not be reached.

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

        $iteration_index++;
        foreach my $cluster (@$clusters) {
            my $current_cluster_center = 
                          $cluster_centers->[$current_cluster_center_index];
            foreach my $ele (@$cluster) {
                my @dist_from_clust_centers;
                foreach my $center (@$cluster_centers) {
                    push @dist_from_clust_centers, 
                               $self->distance($ele, $center);
                }
                my ($min, $best_center_index) = 
                              minimum( \@dist_from_clust_centers );
                my $best_cluster_center = 
                                 $cluster_centers->[$best_center_index];
                if (vector_equal($current_cluster_center, 
                                         $best_cluster_center)){
                    push @{$new_clusters->[$current_cluster_center_index]}, 
                                  $ele;
                } else {
                    $assignment_changed_flag = 1;             
                    push @{$new_clusters->[$best_center_index]}, $ele;
                }
            }
            $current_cluster_center_index++;
        }
        # Now make sure that we still have K clusters since K is fixed:
        next if ((@$new_clusters != @$clusters) && ($iteration_index < 100));
        # Now make sure that none of the K clusters is an empty cluster:
        foreach my $newcluster (@$new_clusters) {
            $cluster_size_zero_condition = 1 if ((!defined $newcluster) 
                                             or  (@$newcluster == 0));

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

  #  probability distributions is written out to a file named posterior_prob_plot.png
  #  and the PNG image of the disjoint clusters to a file called cluster_plot.png.

  # SYNTHETIC DATA GENERATION:

  #  The module has been provided with a class method for generating multivariate
  #  data for experimenting with the EM algorithm.  The data generation is controlled
  #  by the contents of a parameter file that is supplied as an argument to the data
  #  generator method.  The priors, the means, and the covariance matrices in the
  #  parameter file must be according to the syntax shown in the `param1.txt' file in
  #  the `examples' directory. It is best to edit a copy of this file for your
  #  synthetic data generation needs.

  my $parameter_file = "param1.txt";
  my $out_datafile = "mydatafile1.dat";
  Algorithm::ExpectationMaximization->cluster_data_generator(
                          input_parameter_file => $parameter_file,
                          output_datafile => $out_datafile,
                          total_number_of_data_points => $N );

  #  where the value of $N is the total number of data points you would like to see

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

the clusters in your data.  What's amazing is that, despite the large number of
variables that must be optimized simultaneously, the EM algorithm will very likely
give you a good approximation to the right answer.

At its core, EM depends on the notion of unobserved data and the averaging of the
log-likelihood of the data actually observed over all admissible probabilities for
the unobserved data.  But what is unobserved data?  While in some cases where EM is
used, the unobserved data is literally the missing data, in others, it is something
that cannot be seen directly but that nonetheless is relevant to the data actually
observed. For the case of clustering multidimensional numerical data that can be
modeled as a Gaussian mixture, it turns out that the best way to think of the
unobserved data is in terms of a sequence of random variables, one for each observed
data point, whose values dictate the selection of the Gaussian for that data point.
This point is explained in great detail in my on-line tutorial at
L<https://engineering.purdue.edu/kak/Tutorials/ExpectationMaximization.pdf>.

The EM algorithm in our context reduces to an iterative invocation of the following
steps: (1) Given the current guess for the means and the covariances of the different
Gaussians in our mixture model, use Bayes' Rule to update the posterior class
probabilities at each of the data points; (2) Using the updated posterior class
probabilities, first update the class priors; (3) Using the updated class priors,

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

file. The filenames for such clusters look like:

    posterior_prob_cluster1.txt
    posterior_prob_cluster2.txt
    ...

=head1 WHAT IF THE NUMBER OF CLUSTERS IS UNKNOWN?

The module constructor requires that you supply a value for the parameter C<K>, which
is the number of clusters you expect to see in the data.  But what if you do not have
a good value for C<K>?  Note that it is possible to search for the best C<K> by using
the two clustering quality criteria included in the module.  However, I have
intentionally not yet incorporated that feature in the module because it slows down
the execution of the code --- especially when the dimensionality of the data
increases.  However, nothing prevents you from writing a script --- along the lines
of the five "canned_example" scripts in the C<examples> directory --- that would use
the two clustering quality metrics for finding the best choice for C<K> for a given
dataset.  Obviously, you will now have to incorporate the call to the constructor in
a loop and check the value of the quality measures for each value of C<K>.

=head1 SOME RESULTS OBTAINED WITH THIS MODULE

If you would like to see some results that have been obtained with this module, check
out Section 7 of the report
L<https://engineering.purdue.edu/kak/Tutorials/ExpectationMaximization.pdf>.

=head1 THE C<examples> DIRECTORY

Becoming familiar with this directory should be your best
strategy to become comfortable with this module (and its
future versions).  You are urged to start by executing the
following five example scripts:

=over 16

=item I<canned_example1.pl>

This example applies the EM algorithm to the data contained in the datafile
C<mydatafile.dat>.  The mixture data in the file corresponds to three overlapping

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

file.

=head1 CAVEATS

When you run the scripts in the C<examples> directory, your results will NOT always
look like what I have shown in the PNG image files in the directory.  As mentioned
earlier in Description, the EM algorithm starting from randomly chosen initial
guesses for the cluster means can get stuck in a local maximum.

That raises an interesting question of how one judges the correctness of clustering
results when dealing with real experimental data.  For real data, the best approach
is to try the EM algorithm multiple times with all of the seeding options included in
this module.  It would be safe to say that, at least in low dimensional spaces and
with sufficient data, a majority of your runs should yield "correct" results.

Also bear in mind that a pure Perl implementation is not meant for the clustering of
very large data files.  It is really designed more for researching issues related to
EM based approaches to clustering.


=head1 REQUIRED



( run in 0.304 second using v1.01-cache-2.11-cpan-4e96b696675 )