Algorithm-ExpectationMaximization

 view release on metacpan or  search on metacpan

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

  #  priors into the clustering process. The class priors are the prior probabilities
  #  of the class distributions in your dataset.  As explained later, injecting class
  #  priors in the manner shown above makes statistical sense only for the case of
  #  manual seeding.  When you do inject class priors, the order in which the priors
  #  are expressed must correspond to the manually specified seeds for the clusters.

  #  After the invocation of the constructor, the following calls are mandatory
  #  for reasons that should be obvious from the names of the methods:

  $clusterer->read_data_from_file();
  srand(time);
  $clusterer->seed_the_clusters();
  $clusterer->EM();
  $clusterer->run_bayes_classifier();
  my $clusters = $clusterer->return_disjoint_clusters();

  #  where the call to `EM()' is the invocation of the expectation-maximization
  #  algorithm.  The call to `srand(time)' is to seed the pseudo random number
  #  generator afresh for each run of the cluster seeding procedure.  If you want to
  #  see repeatable results from one run to another of the algorithm with random
  #  seeding, you would obviously not invoke `srand(time)'.

  #  The call `run_bayes_classifier()' shown above carries out a disjoint clustering
  #  of all the data points using the naive Bayes' classifier. And the call
  #  `return_disjoint_clusters()' returns the clusters thus formed to you.  Once you
  #  have obtained access to the clusters in this manner, you can display them in
  #  your terminal window by

  foreach my $index (0..@$clusters-1) {
      print "Cluster $index (Naive Bayes):   @{$clusters->[$index]}\n\n"
  }

  #  If you would like to also see the clusters purely on the basis of the posterior
  #  class probabilities exceeding a threshold, call

  my $theta1 = 0.2;
  my $posterior_prob_clusters =
           $clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);

  #  where you can obviously set the threshold $theta1 to any value you wish.  Note
  #  that now you may end up with clusters that overlap.  You can display them in
  #  your terminal window in the same manner as shown above for the naive Bayes'
  #  clusters.

  #  You can write the naive Bayes' clusters out to files, one cluster per file, by
  #  calling

  $clusterer->write_naive_bayes_clusters_to_files();  

  #  The clusters are placed in files with names like

         naive_bayes_cluster1.txt
         naive_bayes_cluster2.txt
         ...

  #  In the same manner, you can write out the posterior probability based possibly
  #  overlapping clusters to files by calling:

  $clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1);

  #  where the threshold $theta1 sets the probability threshold for deciding which
  #  data elements to place in a cluster.  These clusters are placed in files with
  #  names like

         posterior_prob_cluster1.txt
         posterior_prob_cluster2.txt
         ...

  # CLUSTER VISUALIZATION:

  #  You must first set the mask for cluster visualization. This mask tells the 
  #  module which 2D or 3D subspace of the original data space you wish to visualize 
  #  the clusters in:

  my $visualization_mask = "111";
  $clusterer->visualize_clusters($visualization_mask);
  $clusterer->visualize_distributions($visualization_mask);
  $clusterer->plot_hardcopy_clusters($visualization_mask);
  $clusterer->plot_hardcopy_distributions($visualization_mask);

  #  where the last two invocations are for writing out the PNG plots of the
  #  visualization displays to disk files.  The PNG image of the posterior
  #  probability distributions is written out to a file named posterior_prob_plot.png
  #  and the PNG image of the disjoint clusters to a file called cluster_plot.png.

  # SYNTHETIC DATA GENERATION:

  #  The module has been provided with a class method for generating multivariate
  #  data for experimenting with the EM algorithm.  The data generation is controlled
  #  by the contents of a parameter file that is supplied as an argument to the data
  #  generator method.  The priors, the means, and the covariance matrices in the
  #  parameter file must be according to the syntax shown in the `param1.txt' file in
  #  the `examples' directory. It is best to edit a copy of this file for your
  #  synthetic data generation needs.

  my $parameter_file = "param1.txt";
  my $out_datafile = "mydatafile1.dat";
  Algorithm::ExpectationMaximization->cluster_data_generator(
                          input_parameter_file => $parameter_file,
                          output_datafile => $out_datafile,
                          total_number_of_data_points => $N );

  #  where the value of $N is the total number of data points you would like to see
  #  generated for all of the Gaussians.  How this total number is divided up amongst
  #  the Gaussians is decided by the prior probabilities for the Gaussian components
  #  as declared in input parameter file.  The synthetic data may be visualized in a
  #  terminal window and the visualization written out as a PNG image to a diskfile
  #  by

  my $data_visualization_mask = "11";                                            
  $clusterer->visualize_data($data_visualization_mask);                          
  $clusterer->plot_hardcopy_data($data_visualization_mask);


=head1 CHANGES

Version 1.22 should work with data in CSV files.

Version 1.21 incorporates minor code clean up.  Overall, the module implementation
remains unchanged.

Version 1.2 allows the module to also be used for 1-D data.  The visualization code
for 1-D shows the clusters through their histograms.

Version 1.1 incorporates much cleanup of the documentation associated with the
module.  Both the top-level module documentation, especially the Description part,
and the comments embedded in the code were revised for better utilization of the
module.  The basic implementation code remains unchanged.


=head1 DESCRIPTION

B<Algorithm::ExpectationMaximization> is a I<perl5> module for the
Expectation-Maximization (EM) method of clustering numerical data that lends itself
to modeling as a Gaussian mixture.  Since the module is entirely in Perl (in the
sense that it is not a Perl wrapper around a C library that actually does the
clustering), the code in the module can easily be modified to experiment with several
aspects of EM.

Gaussian Mixture Modeling (GMM) is based on the assumption that the data consists of
C<K> Gaussian components, each characterized by its own mean vector and its own
covariance matrix.  Obviously, given observed data for clustering, we do not know
which of the C<K> Gaussian components was responsible for any of the data elements.
GMM also associates a prior probability with each Gaussian component.  In general,
these priors will also be unknown.  So the problem of clustering consists of
estimating the posterior class probability at each data element and also estimating
the class priors. Once these posterior class probabilities and the priors are
estimated with EM, we can use the naive Bayes' classifier to partition the data into
disjoint clusters.  Or, for "soft" clustering, we can find all the data elements that
belong to a Gaussian component on the basis of the posterior class probabilities at
the data elements exceeding a prescribed threshold.

If you do not mind the fact that it is possible for the EM algorithm to occasionally
get stuck in a local maximum and to, therefore, produce a wrong answer even when you
know the data to be perfectly multimodal Gaussian, EM is probably the most magical
approach to clustering multidimensional data.  Consider the case of clustering
three-dimensional data.  Each Gaussian cluster in 3D space is characterized by the
following 10 variables: the 6 unique elements of the C<3x3> covariance matrix (which
must be symmetric positive-definite), the 3 unique elements of the mean, and the
prior associated with the Gaussian. Now let's say you expect to see six Gaussians in
your data.  What that means is that you would want the values for 59 variables
(remember the unit-summation constraint on the class priors which reduces the overall
number of variables by one) to be estimated by the algorithm that seeks to discover
the clusters in your data.  What's amazing is that, despite the large number of
variables that must be optimized simultaneously, the EM algorithm will very likely

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

MDL quality measure, the better the clustering of the data.

For our second measure of clustering quality, we use `trace( SW^-1 . SB)' where SW is
the within-class scatter matrix, more commonly denoted S_w, and SB the between-class
scatter matrix, more commonly denoted S_b (the underscore means subscript).  This
measure can be thought of as the normalized average distance between the clusters,
the normalization being provided by average cluster covariance SW^-1. Therefore, the
larger the value of this quality measure, the better the separation between the
clusters.  Since this measure has its roots in the Fisher linear discriminant
function, we incorporate the word C<fisher> in the name of the quality measure.
I<Note that this measure is good only when the clusters are disjoint.> When the
clusters exhibit significant overlap, the numbers produced by this quality measure
tend to be generally meaningless.

=head1 METHODS

The module provides the following methods for EM based
clustering, for cluster visualization, for data
visualization, and for the generation of data for testing a
clustering algorithm:

=over

=item B<new():>

A call to C<new()> constructs a new instance of the
C<Algorithm::ExpectationMaximization> class.  A typical form
of this call when you want to use random option for seeding
the algorithm is:

    my $clusterer = Algorithm::ExpectationMaximization->new(
                                datafile            => $datafile,
                                mask                => $mask,
                                K                   => 3,
                                max_em_iterations   => 300,
                                seeding             => 'random',
                                terminal_output     => 1,
                    );

where C<K> is the expected number of clusters and
C<max_em_iterations> the maximum number of EM iterations
that you want to allow until convergence is achieved.
Depending on your dataset and on the choice of the initial
seeds, the actual number of iterations used could be as few
as 10 and as many as reaching 300.  The output produced by
the algorithm shows the actual number of iterations used to
arrive at convergence.

The data file supplied through the C<datafile> option is
expected to contain entries in the following format

   c20  0  10.7087017086940  9.63528386251712  10.9512155258108  ...
   c7   0  12.8025925026787  10.6126270065785  10.5228482095349  ...
   b9   0  7.60118206283120  5.05889245193079  5.82841781759102  ...
   ....
   ....

where the first column contains the symbolic ID tag for each
data record and the rest of the columns the numerical
information.  As to which columns are actually used for
clustering is decided by the string value of the mask.  For
example, if we wanted to cluster on the basis of the entries
in just the 3rd, the 4th, and the 5th columns above, the
mask value would be C<N0111> where the character C<N>
indicates that the ID tag is in the first column, the
character C<0> that the second column is to be ignored, and
the C<1>'s that follow that the 3rd, the 4th, and the 5th
columns are to be used for clustering.

If instead of random seeding, you wish to use the kmeans
based seeding, just replace the option C<random> supplied
for C<seeding> by C<kmeans>.  You can also do manual seeding
by designating a specified set of data elements to serve as
cluster seeds.  The call to the constructor in this case
looks like

    my $clusterer = Algorithm::ExpectationMaximization->new(
                                datafile            => $datafile,
                                mask                => $mask,
                                K                   => 3,
                                max_em_iterations   => 300,
                                seeding             => 'manual',
                                seed_tags           => ['a26', 'b53', 'c49'],
                                terminal_output     => 1,
                    );

where the option C<seed_tags> is set to an anonymous array
of symbolic names associated with the data elements.

If you know the class priors, you can supply them through an
additional option to the constructor that looks like

    class_priors    => [0.6, 0.2, 0.2],

for the case of C<K> equal to 3.  B<In general, this would
be a useful thing to do only for the case of manual
seeding.> If you go for manual seeding, the order in which
the priors are expressed should correspond to the order of
the manually chosen tags supplied through the C<seed_tags>
option.

Note that the parameter C<terminal_output> is boolean; when
not supplied in the call to C<new()> it defaults to 0.  When
set, this parameter displays useful information in the
window of the terminal screen in which you invoke the
algorithm.

=item B<read_data_from_file():>

    $clusterer->read_data_from_file()

This is a required call after the constructor is invoked. As
you would expect, this call reads in the data for
clustering.

=item B<seed_the_clusters():>

    $clusterer->seed_the_clusters();

This is also a required call.  It processes the option you
supplied for C<seeding> in the constructor call to choose

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

C<K> disjoint clusters using the naive Bayes' classifier.

=item B<return_disjoint_clusters():>

    my $clusters = $clusterer->return_disjoint_clusters();

This allows you to access the clusters obtained with the
application of the naive Bayes' classifier in your own
scripts.  If, say, you wanted to see the data records placed
in each cluster, you could subsequently invoke the following
loop in your own script:

    foreach my $index (0..@$clusters-1) {
        print "Cluster $index (Naive Bayes):   @{$clusters->[$index]}\n\n"
    }

where C<$clusters> holds the array reference returned by the
call to C<return_disjoint_clusters()>.

=item B<write_naive_bayes_clusters_to_files():>

    $clusterer->write_naive_bayes_clusters_to_files();

This method writes the clusters obtained by applying the
naive Bayes' classifier to disk files, one cluster per
file.  What is written out to each file consists of the
symbolic names of the data records that belong to the
cluster corresponding to that file.  The clusters are placed
in files with names like

    naive_bayes_cluster1.txt
    naive_bayes_cluster2.txt
    ...

=item B<return_clusters_with_posterior_probs_above_threshold($theta1):>

    my $theta1 = 0.2;
    my $posterior_prob_clusters =
       $clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);

This method returns a reference to an array of C<K>
anonymous arrays, each consisting of the symbolic names for
the data records where the posterior class probability
exceeds the threshold as specified by C<$theta1>.
Subsequently, you can access each of these
posterior-probability based clusters through a loop
construct such as

    foreach my $index (0..@$posterior_prob_clusters-1) {
        print "Cluster $index (based on posterior probs exceeding $theta1): " .
              "@{$posterior_prob_clusters->[$index]}\n\n"
    }

=item B<write_posterior_prob_clusters_above_threshold_to_files($theta1):>

    $clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1);

This call writes out the posterior-probability based soft
clusters to disk files.  As in the previous method, the
threshold C<$theta1> sets the probability threshold for
deciding which data elements belong to a cluster.  These
clusters are placed in files with names like

    posterior_prob_cluster1.txt
    posterior_prob_cluster2.txt
    ...

=item B<return_individual_class_distributions_above_given_threshold($theta):>

    my $theta2 = 0.00001;
    my $class_distributions =
      $clusterer->return_individual_class_distributions_above_given_threshold($theta2);

This is the method to call if you wish to see the individual
Gaussians in your own script. The method returns a reference
to an array of anonymous arrays, with each anonymous array
representing data membership in each Gaussian.  Only those
data points are included in each Gaussian where the
probability exceeds the threshold C<$theta2>. Note that the
larger the covariance and the higher the data
dimensionality, the smaller this threshold must be for you
to see any of the data points in a Gaussian.  After you have
accessed the Gaussian mixture in this manner, you can
display the data membership in each Gaussian through the
following sort of a loop:

    foreach my $index (0..@$class_distributions-1) {
        print "Gaussian Distribution $index (only shows data elements " .
              "whose probabilities exceed the threshold $theta2:  " .
              "@{$class_distributions->[$index]}\n\n"
    }

=item B<visualize_clusters($visualization_mask):>

    my $visualization_mask = "11";
    $clusterer->visualize_clusters($visualization_mask);

The visualization mask here does not have to be identical to
the one used for clustering, but must be a subset of that
mask.  This is convenient for visualizing the clusters in
two- or three-dimensional subspaces of the original space.
The subset is specified by placing `0's in the positions
corresponding to the dimensions you do NOT want to see
through visualization.  Depending on the mask used, this
method creates a 2D or a 3D scatter plot of the clusters
obtained through the naive Bayes' classification rule.

=item B<visualize_distributions($visualization_mask):>

    $clusterer->visualize_distributions($visualization_mask);

This is the method to call if you want to visualize the soft
clustering corresponding to the posterior class
probabilities exceeding the threshold specified in the call
to
C<return_clusters_with_posterior_probs_above_threshold($theta1)>.
Again, depending on the visualization mask used, you will
see either a 2D plot or a 3D scatter plot.

=item B<plot_hardcopy_clusters($visualization_mask):>

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

the priors to be used for the Gaussians, their means, and their covariance matrices.
The format of the information contained in the parameter file must be as shown in the
file C<param1.txt> provided in the C<examples> directory.  It will be easiest for you
to just edit a copy of this file for your data generation needs.  In addition to the
format of the parameter file, the main constraint you need to observe in specifying
the parameters is that the dimensionality of the covariance matrices must correspond
to the dimensionality of the mean vectors.  The multivariate random numbers are
generated by calling the C<Math::Random> module.  As you would expect, this module
requires that the covariance matrices you specify in your parameter file be symmetric
and positive definite.  Should the covariances in your parameter file not obey this
condition, the C<Math::Random> module will let you know.

=item B<visualize_data($data_visualization_mask):>

    $clusterer->visualize_data($data_visualization_mask);                          

This is the method to call if you want to visualize the data
you plan to cluster with the EM algorithm.  You'd need to
specify argument mask in a manner similar to the
visualization of the clusters, as explained earlier.

=item B<plot_hardcopy_data($data_visualization_mask):>

    $clusterer->plot_hardcopy_data($data_visualization_mask); 

This method creates a PNG file that can be used to print out
a hardcopy of the data in different 2D and 3D subspaces of
the data space. The visualization mask is used to select the
subspace for the PNG image.

=back

=head1 HOW THE CLUSTERS ARE OUTPUT

This module produces two different types of clusters: the "hard" clusters and the
"soft" clusters.  The hard clusters correspond to the naive Bayes' classification of
the data points on the basis of the Gaussian distributions and the class priors
estimated by the EM algorithm. Such clusters partition the data into disjoint
subsets.  On the other hand, the soft clusters correspond to the posterior class
probabilities calculated by the EM algorithm.  A data element belongs to a cluster if
its posterior probability for that Gaussian exceeds a threshold.

After the EM algorithm has finished running, the hard clusters are created by
invoking the method C<run_bayes_classifier()> on an instance of the module and then
made user-accessible by calling C<return_disjoint_clusters()>.  These clusters may
then be displayed in a terminal window by dereferencing each element of the array
whose reference is returned b C<return_disjoint_clusters()>.  The hard clusters can
be written out to disk files by invoking C<write_naive_bayes_clusters_to_files()>.
This method writes out the clusters to files, one cluster per file.  What is written
out to each file consists of the symbolic names of the data records that belong to
the cluster corresponding to that file.  The clusters are placed in files with names
like

    naive_bayes_cluster1.txt
    naive_bayes_cluster2.txt
    ...

The soft clusters on the other hand are created by calling
C<return_clusters_with_posterior_probs_above_threshold($theta1)>
on an instance of the module, where the argument C<$theta1>
is the threshold for deciding whether a data element belongs
in a soft cluster.  The posterior class probability at a
data element must exceed the threshold for the element to
belong to the corresponding cluster.  The soft cluster can
be written out to disk files by calling
C<write_posterior_prob_clusters_above_threshold_to_files($theta1)>.
As with the hard clusters, each cluster is placed in a separate
file. The filenames for such clusters look like:

    posterior_prob_cluster1.txt
    posterior_prob_cluster2.txt
    ...

=head1 WHAT IF THE NUMBER OF CLUSTERS IS UNKNOWN?

The module constructor requires that you supply a value for the parameter C<K>, which
is the number of clusters you expect to see in the data.  But what if you do not have
a good value for C<K>?  Note that it is possible to search for the best C<K> by using
the two clustering quality criteria included in the module.  However, I have
intentionally not yet incorporated that feature in the module because it slows down
the execution of the code --- especially when the dimensionality of the data
increases.  However, nothing prevents you from writing a script --- along the lines
of the five "canned_example" scripts in the C<examples> directory --- that would use
the two clustering quality metrics for finding the best choice for C<K> for a given
dataset.  Obviously, you will now have to incorporate the call to the constructor in
a loop and check the value of the quality measures for each value of C<K>.

=head1 SOME RESULTS OBTAINED WITH THIS MODULE

If you would like to see some results that have been obtained with this module, check
out Section 7 of the report
L<https://engineering.purdue.edu/kak/Tutorials/ExpectationMaximization.pdf>.

=head1 THE C<examples> DIRECTORY

Becoming familiar with this directory should be your best
strategy to become comfortable with this module (and its
future versions).  You are urged to start by executing the
following five example scripts:

=over 16

=item I<canned_example1.pl>

This example applies the EM algorithm to the data contained in the datafile
C<mydatafile.dat>.  The mixture data in the file corresponds to three overlapping
Gaussian components in a star-shaped pattern.  The EM based clustering for this data
is shown in the files C<save_example_1_cluster_plot.png> and
C<save_example_1_posterior_prob_plot.png>, the former displaying the hard clusters
obtained by using the naive Bayes' classifier and the latter showing the soft
clusters obtained on the basis of the posterior class probabilities at the data
points.  

=item I<canned_example2.pl>

The datafile used in this example is C<mydatafile2.dat>.  This mixture data
corresponds to two well-separated relatively isotropic Gaussians.  EM based clustering for this
data is shown in the files C<save_example_2_cluster_plot.png> and
C<save_example_2_posterior_prob_plot.png>, the former displaying the hard clusters
obtained by using the naive Bayes' classifier and the latter showing the soft
clusters obtained by using the posterior class probabilities at the data points.



( run in 2.269 seconds using v1.01-cache-2.11-cpan-5b529ec07f3 )