Algorithm-ExpectationMaximization

 view release on metacpan or  search on metacpan

MANIFEST  view on Meta::CPAN

examples/canned_example1.pl
examples/canned_example2.pl
examples/canned_example3.pl
examples/canned_example4.pl
examples/canned_example5.pl
examples/canned_example6.pl
examples/cleanup_directory.pl
examples/cluster_plot.png
examples/data_generator.pl
examples/data_scatter_plot.png
examples/datafile_1d.txt
examples/mydatafile1.dat
examples/mydatafile2.dat
examples/mydatafile3.dat
examples/mydatafile4.dat
examples/mydatafile5.dat
examples/mydatafile6.dat
examples/mydatafile7.dat
examples/sphericaldata.csv
examples/param1.txt
examples/param2.txt
examples/param3.txt
examples/param4.txt
examples/param5.txt
examples/param6.txt
examples/param7.txt
examples/posterior_prob_plot.png
examples/README
examples/save_example_1_cluster_plot.png
examples/save_example_1_posterior_prob_plot.png
examples/save_example_2_cluster_plot.png
examples/save_example_2_posterior_prob_plot.png
examples/save_example_3_cluster_plot.png
examples/save_example_3_posterior_prob_plot.png
examples/save_example_4_cluster_plot.png
examples/save_example_4_posterior_prob_plot.png
examples/save_example_5_cluster_plot.png
examples/save_example_5_posterior_prob_plot.png
examples/save_example_6_cluster_plot.png
examples/save_example_6_posterior_prob_plot.png
lib/Algorithm/ExpectationMaximization.pm
Makefile.PL
MANIFEST			This list of files
README
t/test.t
META.yml                                 Module YAML meta-data (added by MakeMaker)
META.json                                Module JSON meta-data (added by MakeMaker)

examples/README  view on Meta::CPAN



1)  canned_example1.pl           

          This example illustrates 2D clustering of co-located but
          overloapping clusters with different covariances.

          Unless your run gets trapped in a local maximum, your results
          should look like those shown in the following image files:

             save_example_1_cluster_plot.png          (for hard clustering)

             save_example_1_posterior_prob_plot.png   (for soft clustering)

          If you are using a Linux machine, you can display these image
          files with the 'display' utility.



2)  canned_example2.pl           

          This example illustrates 2D clustering involving non-overlapping
          clusters.

          Unless your run gets trapped in a local maximum, your results
          should look like those shown in the following image files:

             save_example_2_cluster_plot.png          (for hard clustering)

             save_example_2_posterior_prob_plot.png   (for soft clustering)



3)  canned_example3.pl           

          This example illustrates 2D clustering involving overlapping
          clusters whose means are at different locations.

          Unless your run gets trapped in a local maximum, your results
          should look like those shown in the following image files:

             save_example_3_cluster_plot.png          (for hard clustering)

             save_example_3_posterior_prob_plot.png   (for soft clustering)



4)  canned_example4.pl           

          This example illustrates 3D clustering involving non-overlapping
          clusters.

          Unless your run gets trapped in a local maximum, your results
          should look like those shown in the following image files:

             save_example_4_cluster_plot.png          (for hard clustering)

             save_example_4_posterior_prob_plot.png   (for soft clustering)



5)  canned_example5.pl           

          This example illustrates 3D clustering involving overlapping
          clusters.

          Unless your run gets trapped in a local maximum, your results
          should look like those shown in the following image files:

             save_example_5_cluster_plot.png          (for hard clustering)

             save_example_5_posterior_prob_plot.png   (for soft clustering)


6)  canned_example6.pl

          This example was added in Version 1.2 to illustrate clustering
          of 1-D data.

          Unless your run gets trapped in a local maximum, your results
          should look like those shown in the following image files:

             save_example_6_cluster_plot.png          (for hard clustering)

             save_example_6_posterior_prob_plot.png   (for soft clustering)



========================================================================

Support scripts in the `examples' directory:


1)  For generating the data for experiments with clustering

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

                print OUTPUT "$i $histogram[$i]\n";        
            }
#            $arg_string .= "\"$temp_file\" using 1:2 ti col smooth frequency with boxes lc $cindex, ";
            $arg_string .= "\"$temp_file\" using 2:xtic(1) ti col smooth frequency with boxes lc $cindex, ";
            close OUTPUT;
        }
    }
    $arg_string = $arg_string =~ /^(.*),[ ]+$/;
    $arg_string = $1;
    if ($visualization_data_field_width > 2) {
        $plot->gnuplot_cmd( 'set terminal png color',
                            'set output "cluster_plot.png"');
        $plot->gnuplot_cmd( "splot $arg_string" );
    } elsif ($visualization_data_field_width == 2) {
        $plot->gnuplot_cmd('set terminal png',
                           'set output "cluster_plot.png"');
        $plot->gnuplot_cmd( "plot $arg_string" );
    } elsif ($visualization_data_field_width == 1) {
        $plot->gnuplot_cmd('set terminal png',
                           'set output "cluster_plot.png"');
        $plot->gnuplot_cmd( "plot $arg_string" );
    }
}

# This method is for the visualization of the posterior class distributions.  In
# other words, this method allows us to see the soft clustering produced by the EM
# algorithm.  While much of the gnuplot logic here is the same as in the
# visualize_clusters() method, there are significant differences in how the data is
# pooled for the purpose of display.
sub visualize_distributions {

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

                print OUTPUT "$i $histogram[$i]\n";        
            }
            $arg_string .= "\"$temp_file\" using 2:xtic(1) ti col smooth frequency with boxes lc $cindex, ";
            close OUTPUT;
        }
    }
    $arg_string = $arg_string =~ /^(.*),[ ]+$/;
    $arg_string = $1;

    if ($visualization_data_field_width > 2) {
        $plot->gnuplot_cmd( 'set terminal png',
                            'set output "posterior_prob_plot.png"');
        $plot->gnuplot_cmd( "splot $arg_string" );
    } elsif ($visualization_data_field_width == 2) {
        $plot->gnuplot_cmd( 'set terminal png',
                            'set output "posterior_prob_plot.png"');
        $plot->gnuplot_cmd( "plot $arg_string" );
    } elsif ($visualization_data_field_width == 1) {
        $plot->gnuplot_cmd( 'set terminal png',
                            'set output "posterior_prob_plot.png"');
        $plot->gnuplot_cmd( "plot $arg_string" );
    }
}

#  The method shown below should be called only AFTER you have called the method
#  read_data_from_file().  The visualize_data() is meant for the visualization of the
#  original data in its various 2D or 3D subspaces.
sub visualize_data {
    my $self = shift;
    my $v_mask = shift || die "visualization mask missing";

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

        foreach my $i (0..@all_data-1) {
            $histogram[int( ($all_data[$i] - $minval) / $delta )]++;
        }
        foreach my $i (0..@histogram-1) {
            print OUTPUT "$i $histogram[$i]\n";        
        }
        $arg_string = "\"$temp_file\" using 2:xtic(1) ti col smooth frequency with boxes lc rgb 'green'";
        close OUTPUT;
    }
    if ($visualization_data_field_width > 2) {
        $plot->gnuplot_cmd( 'set terminal png',
                            'set output "data_scatter_plot.png"');
        $plot->gnuplot_cmd( "splot $arg_string" );
    } elsif ($visualization_data_field_width == 2) {
        $plot->gnuplot_cmd( 'set terminal png',
                            'set output "data_scatter_plot.png"');
        $plot->gnuplot_cmd( "plot $arg_string" );
    } elsif ($visualization_data_field_width == 1) {
        $plot->gnuplot_cmd( 'set terminal png',
                            'set output "data_scatter_plot.png"');
        $plot->gnuplot_cmd( "plot $arg_string" );
    }
}


###################  Generating Synthetic Data for Clustering  ###################

#  The data generated corresponds to a multivariate distribution.  The mean and the
#  covariance of each Gaussian in the distribution are specified individually in a
#  parameter file. The parameter file must also state the prior probabilities to be

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

  #  the clusters in:

  my $visualization_mask = "111";
  $clusterer->visualize_clusters($visualization_mask);
  $clusterer->visualize_distributions($visualization_mask);
  $clusterer->plot_hardcopy_clusters($visualization_mask);
  $clusterer->plot_hardcopy_distributions($visualization_mask);

  #  where the last two invocations are for writing out the PNG plots of the
  #  visualization displays to disk files.  The PNG image of the posterior
  #  probability distributions is written out to a file named posterior_prob_plot.png
  #  and the PNG image of the disjoint clusters to a file called cluster_plot.png.

  # SYNTHETIC DATA GENERATION:

  #  The module has been provided with a class method for generating multivariate
  #  data for experimenting with the EM algorithm.  The data generation is controlled
  #  by the contents of a parameter file that is supplied as an argument to the data
  #  generator method.  The priors, the means, and the covariance matrices in the
  #  parameter file must be according to the syntax shown in the `param1.txt' file in
  #  the `examples' directory. It is best to edit a copy of this file for your
  #  synthetic data generation needs.

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

future versions).  You are urged to start by executing the
following five example scripts:

=over 16

=item I<canned_example1.pl>

This example applies the EM algorithm to the data contained in the datafile
C<mydatafile.dat>.  The mixture data in the file corresponds to three overlapping
Gaussian components in a star-shaped pattern.  The EM based clustering for this data
is shown in the files C<save_example_1_cluster_plot.png> and
C<save_example_1_posterior_prob_plot.png>, the former displaying the hard clusters
obtained by using the naive Bayes' classifier and the latter showing the soft
clusters obtained on the basis of the posterior class probabilities at the data
points.  

=item I<canned_example2.pl>

The datafile used in this example is C<mydatafile2.dat>.  This mixture data
corresponds to two well-separated relatively isotropic Gaussians.  EM based clustering for this
data is shown in the files C<save_example_2_cluster_plot.png> and
C<save_example_2_posterior_prob_plot.png>, the former displaying the hard clusters
obtained by using the naive Bayes' classifier and the latter showing the soft
clusters obtained by using the posterior class probabilities at the data points.

=item I<canned_example3.pl>

Like the first example, this example again involves three Gaussians, but now their
means are not co-located.  Additionally, we now seed the clusters manually by
specifying three selected data points as the initial guesses for the cluster means.
The datafile used for this example is C<mydatafile3.dat>.  The EM based clustering
for this data is shown in the files C<save_example_3_cluster_plot.png> and
C<save_example_3_posterior_prob_plot.png>, the former displaying the hard clusters
obtained by using the naive Bayes' classifier and the latter showing the soft
clusters obtained on the basis of the posterior class probabilities at the data
points.

=item I<canned_example4.pl>

Whereas the three previous examples demonstrated EM based clustering of 2D data, we
now present an example of clustering in 3D.  The datafile used in this example is
C<mydatafile4.dat>.  This mixture data corresponds to three well-separated but highly
anisotropic Gaussians. The EM derived clustering for this data is shown in the files
C<save_example_4_cluster_plot.png> and C<save_example_4_posterior_prob_plot.png>, the
former displaying the hard clusters obtained by using the naive Bayes' classifier and
the latter showing the soft clusters obtained on the basis of the posterior class
probabilities at the data points.

You may also wish to run this example on the data in a CSV file in the C<examples>
directory. The name of the file is C<sphericaldata.csv>.  

=item I<canned_example5.pl>

We again demonstrate clustering in 3D but now we have one Gaussian cluster that
"cuts" through the other two Gaussian clusters.  The datafile used in this example is
C<mydatafile5.dat>.  The three Gaussians in this case are highly overlapping and
highly anisotropic.  The EM derived clustering for this data is shown in the files
C<save_example_5_cluster_plot.png> and C<save_example_5_posterior_prob_plot.png>, the
former displaying the hard clusters obtained by using the naive Bayes' classifier and
the latter showing the soft clusters obtained through the posterior class
probabilities at the data points.

=item I<canned_example6.pl>

This example, added in Version 1.2, demonstrates the use of this module for 1-D data.
In order to visualize the clusters for the 1-D case, we show them through their
respective histograms.  The datafile used in this example is C<mydatafile7.dat>.  The
data consists of two overlapping Gaussians.  The EM derived clustering for this data
is shown in the files C<save_example_6_cluster_plot.png> and
C<save_example_6_posterior_prob_plot.png>, the former displaying the hard clusters
obtained by using the naive Bayes' classifier and the latter showing the soft
clusters obtained through the posterior class probabilities at the data points.

=back

Going through the six examples listed above will make you familiar with how to make
the calls to the clustering and the visualization methods.  The C<examples> directory
also includes several parameter files with names like

    param1.txt



( run in 1.316 second using v1.01-cache-2.11-cpan-df04353d9ac )