view release on metacpan or search on metacpan
examples/mydatafile6.dat
examples/mydatafile7.dat
examples/sphericaldata.csv
examples/param1.txt
examples/param2.txt
examples/param3.txt
examples/param4.txt
examples/param5.txt
examples/param6.txt
examples/param7.txt
examples/posterior_prob_plot.png
examples/README
examples/save_example_1_cluster_plot.png
examples/save_example_1_posterior_prob_plot.png
examples/save_example_2_cluster_plot.png
examples/save_example_2_posterior_prob_plot.png
examples/save_example_3_cluster_plot.png
examples/save_example_3_posterior_prob_plot.png
examples/save_example_4_cluster_plot.png
examples/save_example_4_posterior_prob_plot.png
examples/save_example_5_cluster_plot.png
examples/save_example_5_posterior_prob_plot.png
examples/save_example_6_cluster_plot.png
examples/save_example_6_posterior_prob_plot.png
lib/Algorithm/ExpectationMaximization.pm
Makefile.PL
MANIFEST This list of files
README
t/test.t
META.yml Module YAML meta-data (added by MakeMaker)
META.json Module JSON meta-data (added by MakeMaker)
examples/README view on Meta::CPAN
1) canned_example1.pl
This example illustrates 2D clustering of co-located but
overloapping clusters with different covariances.
Unless your run gets trapped in a local maximum, your results
should look like those shown in the following image files:
save_example_1_cluster_plot.png (for hard clustering)
save_example_1_posterior_prob_plot.png (for soft clustering)
If you are using a Linux machine, you can display these image
files with the 'display' utility.
2) canned_example2.pl
This example illustrates 2D clustering involving non-overlapping
clusters.
Unless your run gets trapped in a local maximum, your results
should look like those shown in the following image files:
save_example_2_cluster_plot.png (for hard clustering)
save_example_2_posterior_prob_plot.png (for soft clustering)
3) canned_example3.pl
This example illustrates 2D clustering involving overlapping
clusters whose means are at different locations.
Unless your run gets trapped in a local maximum, your results
should look like those shown in the following image files:
save_example_3_cluster_plot.png (for hard clustering)
save_example_3_posterior_prob_plot.png (for soft clustering)
4) canned_example4.pl
This example illustrates 3D clustering involving non-overlapping
clusters.
Unless your run gets trapped in a local maximum, your results
should look like those shown in the following image files:
save_example_4_cluster_plot.png (for hard clustering)
save_example_4_posterior_prob_plot.png (for soft clustering)
5) canned_example5.pl
This example illustrates 3D clustering involving overlapping
clusters.
Unless your run gets trapped in a local maximum, your results
should look like those shown in the following image files:
save_example_5_cluster_plot.png (for hard clustering)
save_example_5_posterior_prob_plot.png (for soft clustering)
6) canned_example6.pl
This example was added in Version 1.2 to illustrate clustering
of 1-D data.
Unless your run gets trapped in a local maximum, your results
should look like those shown in the following image files:
save_example_6_cluster_plot.png (for hard clustering)
save_example_6_posterior_prob_plot.png (for soft clustering)
========================================================================
Support scripts in the `examples' directory:
1) For generating the data for experiments with clustering
examples/canned_example1.pl view on Meta::CPAN
# Once you have the clusters in your own top-level script,
# you can now examine the contents of the clusters by the
# following sort of code:
print "\n\nDisjoint clusters obtained with Naive Bayes' classifier:\n\n";
foreach my $index (0..@$clusters-1) {
print "Cluster $index (Naive Bayes): @{$clusters->[$index]}\n\n"
}
print "----------------------------------------------------\n\n";
my $theta1 = 0.2;
print "Possibly overlapping clusters based on posterior probabilities " .
"exceeding the threshold $theta1:\n\n";
my $posterior_prob_clusters =
$clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);
foreach my $index (0..@$posterior_prob_clusters-1) {
print "Cluster $index (based on posterior probs exceeding $theta1): " .
"@{$posterior_prob_clusters->[$index]}\n\n"
}
$clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1);
print "\n----------------------------------------------------\n\n";
my $theta2 = 0.00001;
print "Showing the data element membership in each Gaussian. Only those " .
"data points are included in each Gaussian where the probability " .
"exceeds the threshold $theta2. Note that the larger the covariance " .
"and the higher the data dimensionality, the smaller this threshold " .
"must be for you to see any of the data points in a Gaussian: \n\n";
my $class_distributions =
$clusterer->return_individual_class_distributions_above_given_threshold($theta2);
examples/canned_example2.pl view on Meta::CPAN
# Once you have the clusters in your own top-level script,
# you can now examine the contents of the clusters by the
# following sort of code:
print "\n\nDisjoint clusters obtained with Naive Bayes' classifier:\n\n";
foreach my $index (0..@$clusters-1) {
print "Cluster $index (Naive Bayes): @{$clusters->[$index]}\n\n"
}
print "----------------------------------------------------\n\n";
my $theta1 = 0.2;
print "Possibly overlapping clusters based on posterior probabilities " .
"exceeding the threshold $theta1:\n\n";
my $posterior_prob_clusters =
$clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);
foreach my $index (0..@$posterior_prob_clusters-1) {
print "Cluster $index (based on posterior probs exceeding $theta1): " .
"@{$posterior_prob_clusters->[$index]}\n\n"
}
$clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1);
print "\n----------------------------------------------------\n\n";
my $theta2 = 0.00001;
print "Showing the data element membership in each Gaussian. Only those " .
"data points are included in each Gaussian where the probability " .
"exceeds the threshold $theta2. Note that the larger the covariance " .
"and the higher the data dimensionality, the smaller this threshold " .
"must be for you to see any of the data points in a Gaussian: \n\n";
my $class_distributions =
$clusterer->return_individual_class_distributions_above_given_threshold($theta2);
examples/canned_example3.pl view on Meta::CPAN
# Once you have the clusters in your own top-level script,
# you can now examine the contents of the clusters by the
# following sort of code:
print "\n\nDisjoint clusters obtained with Naive Bayes' classifier:\n\n";
foreach my $index (0..@$clusters-1) {
print "Cluster $index (Naive Bayes): @{$clusters->[$index]}\n\n"
}
print "----------------------------------------------------\n\n";
my $theta1 = 0.2;
print "Possibly overlapping clusters based on posterior probabilities " .
"exceeding the threshold $theta1:\n\n";
my $posterior_prob_clusters =
$clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);
foreach my $index (0..@$posterior_prob_clusters-1) {
print "Cluster $index (based on posterior probs exceeding $theta1): " .
"@{$posterior_prob_clusters->[$index]}\n\n"
}
$clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1);
print "\n----------------------------------------------------\n\n";
my $theta2 = 0.00001;
print "Showing the data element membership in each Gaussian. Only those " .
"data points are included in each Gaussian where the probability " .
"exceeds the threshold $theta2. Note that the larger the covariance " .
"and the higher the data dimensionality, the smaller this threshold " .
"must be for you to see any of the data points in a Gaussian: \n\n";
my $class_distributions =
$clusterer->return_individual_class_distributions_above_given_threshold($theta2);
examples/canned_example4.pl view on Meta::CPAN
# Once you have the clusters in your own top-level script,
# you can now examine the contents of the clusters by the
# following sort of code:
print "\n\nDisjoint clusters obtained with Naive Bayes' classifier:\n\n";
foreach my $index (0..@$clusters-1) {
print "Cluster $index (Naive Bayes): @{$clusters->[$index]}\n\n"
}
print "----------------------------------------------------\n\n";
my $theta1 = 0.2;
print "Possibly overlapping clusters based on posterior probabilities " .
"exceeding the threshold $theta1:\n\n";
my $posterior_prob_clusters =
$clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);
foreach my $index (0..@$posterior_prob_clusters-1) {
print "Cluster $index (based on posterior probs exceeding $theta1): " .
"@{$posterior_prob_clusters->[$index]}\n\n"
}
$clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1);
print "\n----------------------------------------------------\n\n";
my $theta2 = 0.00001;
print "Showing the data element membership in each Gaussian. Only those " .
"data points are included in each Gaussian where the probability " .
"exceeds the threshold $theta2. Note that the larger the covariance " .
"and the higher the data dimensionality, the smaller this threshold " .
"must be for you to see any of the data points in a Gaussian: \n\n";
my $class_distributions =
$clusterer->return_individual_class_distributions_above_given_threshold($theta2);
examples/canned_example5.pl view on Meta::CPAN
# Once you have the clusters in your own top-level script,
# you can now examine the contents of the clusters by the
# following sort of code:
print "\n\nDisjoint clusters obtained with Naive Bayes' classifier:\n\n";
foreach my $index (0..@$clusters-1) {
print "Cluster $index (Naive Bayes): @{$clusters->[$index]}\n\n"
}
print "----------------------------------------------------\n\n";
my $theta1 = 0.2;
print "Possibly overlapping clusters based on posterior probabilities " .
"exceeding the threshold $theta1:\n\n";
my $posterior_prob_clusters =
$clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);
foreach my $index (0..@$posterior_prob_clusters-1) {
print "Cluster $index (based on posterior probs exceeding $theta1): " .
"@{$posterior_prob_clusters->[$index]}\n\n"
}
$clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1);
print "\n----------------------------------------------------\n\n";
my $theta2 = 0.00001;
print "Showing the data element membership in each Gaussian. Only those " .
"data points are included in each Gaussian where the probability " .
"exceeds the threshold $theta2. Note that the larger the covariance " .
"and the higher the data dimensionality, the smaller this threshold " .
"must be for you to see any of the data points in a Gaussian: \n\n";
my $class_distributions =
$clusterer->return_individual_class_distributions_above_given_threshold($theta2);
examples/canned_example6.pl view on Meta::CPAN
# Once you have the clusters in your own top-level script,
# you can now examine the contents of the clusters by the
# following sort of code:
print "\n\nDisjoint clusters obtained with Naive Bayes' classifier:\n\n";
foreach my $index (0..@$clusters-1) {
print "Cluster $index (Naive Bayes): @{$clusters->[$index]}\n\n"
}
print "----------------------------------------------------\n\n";
my $theta1 = 0.2;
print "Possibly overlapping clusters based on posterior probabilities " .
"exceeding the threshold $theta1:\n\n";
my $posterior_prob_clusters =
$clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);
foreach my $index (0..@$posterior_prob_clusters-1) {
print "Cluster $index (based on posterior probs exceeding $theta1): " .
"@{$posterior_prob_clusters->[$index]}\n\n"
}
$clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1);
print "\n----------------------------------------------------\n\n";
my $theta2 = 0.00001;
print "Showing the data element membership in each Gaussian. Only those " .
"data points are included in each Gaussian where the probability " .
"exceeds the threshold $theta2. Note that the larger the covariance " .
"and the higher the data dimensionality, the smaller this threshold " .
"must be for you to see any of the data points in a Gaussian: \n\n";
my $class_distributions =
$clusterer->return_individual_class_distributions_above_given_threshold($theta2);
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
}
$self->{_cluster_normalizers} = [];
if ($self->{_debug}) {
print "\n\nDisplaying prob of a data point vis-a-vis each class:\n\n";
foreach my $data_id (sort keys %{$self->{_data}}) {
my $class_probs_at_a_point =
$self->{_class_probs_at_each_data_point}->{$data_id};
print "Class probs for $data_id: @$class_probs_at_a_point\n"
}
}
# Calculate prob(C_i | x) which is the posterior prob of class
# considered as a r.v. to be C_i at a given point x. For a given
# x, the sum of such probabilities over all C_i must add up to 1:
$self->find_expected_classes_at_each_datapoint();
if ($self->{_debug}) {
print "\n\nDisplaying expected class probs at each data point:\n\n";
foreach my $data_id (sort keys %{$self->{_expected_class_probs}}) {
my $expected_classes_at_a_point =
$self->{_expected_class_probs}->{$data_id};
print "Expected classes $data_id: @$expected_classes_at_a_point\n";
}
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
$self->classify_all_data_tuples_bayes($self->{_cluster_means},
$self->{_cluster_covariances});
}
# Should NOT be called before run_bayes_classifier is run
sub return_disjoint_clusters {
my $self = shift;
return $self->{_clusters};
}
sub return_clusters_with_posterior_probs_above_threshold {
my $self = shift;
my $theta = shift;
my @class_distributions;
foreach my $cluster_index (0..$self->{_K}-1) {
push @class_distributions, [];
}
foreach my $data_tag (@{$self->{_data_id_tags}}) {
foreach my $cluster_index (0..$self->{_K}-1) {
push @{$class_distributions[$cluster_index]}, $data_tag
if $self->{_expected_class_probs}->{$data_tag}->[$cluster_index]
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
$datavec_minus_mean ) );
} else {
my @data_minus_mean = $datavec_minus_mean->as_list;
my $data_minus_mean_val = $data_minus_mean[0];
my @covar_as_matrix = $covariance->as_list;
my $covar_val = $covar_as_matrix[0];
$log_likely = -0.5 * ($data_minus_mean_val ** 2) / $covar_val;
}
my $posterior_log_likely = $log_likely +
log( $self->{_class_priors}[$cluster_index] );
push @log_likelihoods, $posterior_log_likely;
}
my ($minlikely, $maxlikely) = minmax(\@log_likelihoods);
my $cluster_index_for_data_point =
get_index_at_value( $maxlikely, \@log_likelihoods );
return $cluster_index_for_data_point;
}
sub find_cluster_means_and_covariances {
my $clusters = shift;
my $data_dimensions = find_data_dimensionality($clusters);
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
print "Writing cluster $i to file $filename\n"
if $self->{_terminal_output};
open FILEHANDLE, "| sort > $filename" or die "Unable to open file: $!";
foreach my $ele (@{$clusters[$i-1]}) {
print FILEHANDLE "$ele\n";
}
close FILEHANDLE;
}
}
sub write_posterior_prob_clusters_above_threshold_to_files {
my $self = shift;
my $theta = shift;
my @class_distributions;
foreach my $cluster_index (0..$self->{_K}-1) {
push @class_distributions, [];
}
foreach my $data_tag (@{$self->{_data_id_tags}}) {
foreach my $cluster_index (0..$self->{_K}-1) {
push @{$class_distributions[$cluster_index]}, $data_tag
if $self->{_expected_class_probs}->{$data_tag}->[$cluster_index]
> $theta;
}
}
unlink glob "posterior_prob_cluster*.txt";
foreach my $i (1..@class_distributions) {
my $filename = "posterior_prob_cluster" . $i . ".txt";
print "Writing posterior prob cluster $i to file $filename\n"
if $self->{_terminal_output};
open FILEHANDLE, "| sort > $filename" or die "Unable to open file: $!";
foreach my $ele (@{$class_distributions[$i-1]}) {
print FILEHANDLE "$ele\n";
}
close FILEHANDLE;
}
}
sub DESTROY {
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
$plot->gnuplot_cmd('set terminal png',
'set output "cluster_plot.png"');
$plot->gnuplot_cmd( "plot $arg_string" );
} elsif ($visualization_data_field_width == 1) {
$plot->gnuplot_cmd('set terminal png',
'set output "cluster_plot.png"');
$plot->gnuplot_cmd( "plot $arg_string" );
}
}
# This method is for the visualization of the posterior class distributions. In
# other words, this method allows us to see the soft clustering produced by the EM
# algorithm. While much of the gnuplot logic here is the same as in the
# visualize_clusters() method, there are significant differences in how the data is
# pooled for the purpose of display.
sub visualize_distributions {
my $self = shift;
my $v_mask;
my $pause_time;
if (@_ == 1) {
$v_mask = shift || die "visualization mask missing";
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
$plot = Graphics::GnuplotIF->new( persist => 1 );
} else {
$plot = Graphics::GnuplotIF->new();
}
my $arg_string = "";
if ($visualization_data_field_width > 2) {
$plot->gnuplot_cmd( "set noclip" );
$plot->gnuplot_cmd( "set pointsize 2" );
foreach my $i (0..$self->{_K}-1) {
my $j = $i + 1;
$arg_string .= "\"$temp_file\" index $i using 1:2:3 title \"Cluster $i (based on posterior probs)\" with points lt $j pt $j, ";
}
} elsif ($visualization_data_field_width == 2) {
$plot->gnuplot_cmd( "set noclip" );
$plot->gnuplot_cmd( "set pointsize 2" );
foreach my $i (0..$self->{_K}-1) {
my $j = $i + 1;
$arg_string .= "\"$temp_file\" index $i using 1:2 title \"Cluster $i (based on posterior probs)\" with points lt $j pt $j, ";
my $ellipse_filename = "__contour2_" . $i . ".dat";
$arg_string .= "\"$ellipse_filename\" with line lt $j title \"\", ";
}
} elsif ($visualization_data_field_width == 1 ) {
open INPUT, "$temp_file" or die "Unable to open a temp file in this directory: $!";
my @all_data = <INPUT>;
close INPUT;
@all_data = map {chomp $_; $_ =~ /\d/ ? $_ : "SEPERATOR" } @all_data;
my $all_joined_data = join ':', @all_data;
my @separated = split /:SEPERATOR:SEPERATOR/, $all_joined_data;
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
$plot = Graphics::GnuplotIF->new( persist => 1 );
} else {
$plot = Graphics::GnuplotIF->new();
}
my $arg_string = "";
if ($visualization_data_field_width > 2) {
$plot->gnuplot_cmd( "set noclip" );
$plot->gnuplot_cmd( "set pointsize 2" );
foreach my $i (0..$self->{_K}-1) {
my $j = $i + 1;
$arg_string .= "\"$temp_file\" index $i using 1:2:3 title \"Cluster $i (based on posterior probs)\" with points lt $j pt $j, ";
}
} elsif ($visualization_data_field_width == 2) {
$plot->gnuplot_cmd( "set noclip" );
$plot->gnuplot_cmd( "set pointsize 2" );
foreach my $i (0..$self->{_K}-1) {
my $j = $i + 1;
$arg_string .= "\"$temp_file\" index $i using 1:2 title \"Cluster $i (based on posterior probs)\" with points lt $j pt $j, ";
my $ellipse_filename = "__contour2_" . $i . ".dat";
$arg_string .= "\"$ellipse_filename\" with line lt $j title \"\", ";
}
} elsif ($visualization_data_field_width == 1 ) {
open INPUT, "$temp_file" or die "Unable to open a temp file in this directory: $!";
my @all_data = <INPUT>;
close INPUT;
@all_data = map {chomp $_; $_ =~ /\d/ ? $_ : "SEPERATOR" } @all_data;
my $all_joined_data = join ':', @all_data;
my @separated = split /:SEPERATOR:SEPERATOR/, $all_joined_data;
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
}
$arg_string .= "\"$temp_file\" using 2:xtic(1) ti col smooth frequency with boxes lc $cindex, ";
close OUTPUT;
}
}
$arg_string = $arg_string =~ /^(.*),[ ]+$/;
$arg_string = $1;
if ($visualization_data_field_width > 2) {
$plot->gnuplot_cmd( 'set terminal png',
'set output "posterior_prob_plot.png"');
$plot->gnuplot_cmd( "splot $arg_string" );
} elsif ($visualization_data_field_width == 2) {
$plot->gnuplot_cmd( 'set terminal png',
'set output "posterior_prob_plot.png"');
$plot->gnuplot_cmd( "plot $arg_string" );
} elsif ($visualization_data_field_width == 1) {
$plot->gnuplot_cmd( 'set terminal png',
'set output "posterior_prob_plot.png"');
$plot->gnuplot_cmd( "plot $arg_string" );
}
}
# The method shown below should be called only AFTER you have called the method
# read_data_from_file(). The visualize_data() is meant for the visualization of the
# original data in its various 2D or 3D subspaces.
sub visualize_data {
my $self = shift;
my $v_mask = shift || die "visualization mask missing";
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
# The call `run_bayes_classifier()' shown above carries out a disjoint clustering
# of all the data points using the naive Bayes' classifier. And the call
# `return_disjoint_clusters()' returns the clusters thus formed to you. Once you
# have obtained access to the clusters in this manner, you can display them in
# your terminal window by
foreach my $index (0..@$clusters-1) {
print "Cluster $index (Naive Bayes): @{$clusters->[$index]}\n\n"
}
# If you would like to also see the clusters purely on the basis of the posterior
# class probabilities exceeding a threshold, call
my $theta1 = 0.2;
my $posterior_prob_clusters =
$clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);
# where you can obviously set the threshold $theta1 to any value you wish. Note
# that now you may end up with clusters that overlap. You can display them in
# your terminal window in the same manner as shown above for the naive Bayes'
# clusters.
# You can write the naive Bayes' clusters out to files, one cluster per file, by
# calling
$clusterer->write_naive_bayes_clusters_to_files();
# The clusters are placed in files with names like
naive_bayes_cluster1.txt
naive_bayes_cluster2.txt
...
# In the same manner, you can write out the posterior probability based possibly
# overlapping clusters to files by calling:
$clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1);
# where the threshold $theta1 sets the probability threshold for deciding which
# data elements to place in a cluster. These clusters are placed in files with
# names like
posterior_prob_cluster1.txt
posterior_prob_cluster2.txt
...
# CLUSTER VISUALIZATION:
# You must first set the mask for cluster visualization. This mask tells the
# module which 2D or 3D subspace of the original data space you wish to visualize
# the clusters in:
my $visualization_mask = "111";
$clusterer->visualize_clusters($visualization_mask);
$clusterer->visualize_distributions($visualization_mask);
$clusterer->plot_hardcopy_clusters($visualization_mask);
$clusterer->plot_hardcopy_distributions($visualization_mask);
# where the last two invocations are for writing out the PNG plots of the
# visualization displays to disk files. The PNG image of the posterior
# probability distributions is written out to a file named posterior_prob_plot.png
# and the PNG image of the disjoint clusters to a file called cluster_plot.png.
# SYNTHETIC DATA GENERATION:
# The module has been provided with a class method for generating multivariate
# data for experimenting with the EM algorithm. The data generation is controlled
# by the contents of a parameter file that is supplied as an argument to the data
# generator method. The priors, the means, and the covariance matrices in the
# parameter file must be according to the syntax shown in the `param1.txt' file in
# the `examples' directory. It is best to edit a copy of this file for your
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
sense that it is not a Perl wrapper around a C library that actually does the
clustering), the code in the module can easily be modified to experiment with several
aspects of EM.
Gaussian Mixture Modeling (GMM) is based on the assumption that the data consists of
C<K> Gaussian components, each characterized by its own mean vector and its own
covariance matrix. Obviously, given observed data for clustering, we do not know
which of the C<K> Gaussian components was responsible for any of the data elements.
GMM also associates a prior probability with each Gaussian component. In general,
these priors will also be unknown. So the problem of clustering consists of
estimating the posterior class probability at each data element and also estimating
the class priors. Once these posterior class probabilities and the priors are
estimated with EM, we can use the naive Bayes' classifier to partition the data into
disjoint clusters. Or, for "soft" clustering, we can find all the data elements that
belong to a Gaussian component on the basis of the posterior class probabilities at
the data elements exceeding a prescribed threshold.
If you do not mind the fact that it is possible for the EM algorithm to occasionally
get stuck in a local maximum and to, therefore, produce a wrong answer even when you
know the data to be perfectly multimodal Gaussian, EM is probably the most magical
approach to clustering multidimensional data. Consider the case of clustering
three-dimensional data. Each Gaussian cluster in 3D space is characterized by the
following 10 variables: the 6 unique elements of the C<3x3> covariance matrix (which
must be symmetric positive-definite), the 3 unique elements of the mean, and the
prior associated with the Gaussian. Now let's say you expect to see six Gaussians in
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
that cannot be seen directly but that nonetheless is relevant to the data actually
observed. For the case of clustering multidimensional numerical data that can be
modeled as a Gaussian mixture, it turns out that the best way to think of the
unobserved data is in terms of a sequence of random variables, one for each observed
data point, whose values dictate the selection of the Gaussian for that data point.
This point is explained in great detail in my on-line tutorial at
L<https://engineering.purdue.edu/kak/Tutorials/ExpectationMaximization.pdf>.
The EM algorithm in our context reduces to an iterative invocation of the following
steps: (1) Given the current guess for the means and the covariances of the different
Gaussians in our mixture model, use Bayes' Rule to update the posterior class
probabilities at each of the data points; (2) Using the updated posterior class
probabilities, first update the class priors; (3) Using the updated class priors,
update the class means and the class covariances; and go back to Step (1). Ideally,
the iterations should terminate when the expected log-likelihood of the observed data
has reached a maximum and does not change with any further iterations. The stopping
rule used in this module is the detection of no change over three consecutive
iterations in the values calculated for the priors.
This module provides three different choices for seeding the clusters: (1) random,
(2) kmeans, and (3) manual. When random seeding is chosen, the algorithm randomly
selects C<K> data elements as cluster seeds. That is, the data vectors associated
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
This is the workhorse of the module, as you would expect.
The means, the covariances, and the priors estimated by this
method are stored in instance variables that are subsequently
accessed by other methods for the purpose of displaying the
clusters, the probability distributions, etc.
=item B<run_bayes_classifier():>
$clusterer->run_bayes_classifier();
Using the posterior probability distributions estimated by
the C<EM()> method, this method partitions the data into the
C<K> disjoint clusters using the naive Bayes' classifier.
=item B<return_disjoint_clusters():>
my $clusters = $clusterer->return_disjoint_clusters();
This allows you to access the clusters obtained with the
application of the naive Bayes' classifier in your own
scripts. If, say, you wanted to see the data records placed
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
naive Bayes' classifier to disk files, one cluster per
file. What is written out to each file consists of the
symbolic names of the data records that belong to the
cluster corresponding to that file. The clusters are placed
in files with names like
naive_bayes_cluster1.txt
naive_bayes_cluster2.txt
...
=item B<return_clusters_with_posterior_probs_above_threshold($theta1):>
my $theta1 = 0.2;
my $posterior_prob_clusters =
$clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);
This method returns a reference to an array of C<K>
anonymous arrays, each consisting of the symbolic names for
the data records where the posterior class probability
exceeds the threshold as specified by C<$theta1>.
Subsequently, you can access each of these
posterior-probability based clusters through a loop
construct such as
foreach my $index (0..@$posterior_prob_clusters-1) {
print "Cluster $index (based on posterior probs exceeding $theta1): " .
"@{$posterior_prob_clusters->[$index]}\n\n"
}
=item B<write_posterior_prob_clusters_above_threshold_to_files($theta1):>
$clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1);
This call writes out the posterior-probability based soft
clusters to disk files. As in the previous method, the
threshold C<$theta1> sets the probability threshold for
deciding which data elements belong to a cluster. These
clusters are placed in files with names like
posterior_prob_cluster1.txt
posterior_prob_cluster2.txt
...
=item B<return_individual_class_distributions_above_given_threshold($theta):>
my $theta2 = 0.00001;
my $class_distributions =
$clusterer->return_individual_class_distributions_above_given_threshold($theta2);
This is the method to call if you wish to see the individual
Gaussians in your own script. The method returns a reference
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
corresponding to the dimensions you do NOT want to see
through visualization. Depending on the mask used, this
method creates a 2D or a 3D scatter plot of the clusters
obtained through the naive Bayes' classification rule.
=item B<visualize_distributions($visualization_mask):>
$clusterer->visualize_distributions($visualization_mask);
This is the method to call if you want to visualize the soft
clustering corresponding to the posterior class
probabilities exceeding the threshold specified in the call
to
C<return_clusters_with_posterior_probs_above_threshold($theta1)>.
Again, depending on the visualization mask used, you will
see either a 2D plot or a 3D scatter plot.
=item B<plot_hardcopy_clusters($visualization_mask):>
$clusterer->plot_hardcopy_clusters($visualization_mask);
This method create a PNG file from the C<gnuplot> created
display of the naive Bayes' clusters obtained from the data.
The plotting functionality of C<gnuplot> is accessed through
the Perl wrappers provided by the C<Graphics::GnuplotIF>
module.
=item B<plot_hardcopy_distributions($visualization_mask):>
$clusterer->plot_hardcopy_distributions($visualization_mask);
This method create a PNG file from the C<gnuplot> created
display of the clusters that correspond to the posterior
class probabilities exceeding a specified threshold. The
plotting functionality of C<gnuplot> is accessed through the
Perl wrappers provided by the C<Graphics::GnuplotIF> module.
=item B<display_fisher_quality_vs_iterations():>
$clusterer->display_fisher_quality_vs_iterations();
This method measures the quality of clustering by
calculating the normalized average squared distance between
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
subspace for the PNG image.
=back
=head1 HOW THE CLUSTERS ARE OUTPUT
This module produces two different types of clusters: the "hard" clusters and the
"soft" clusters. The hard clusters correspond to the naive Bayes' classification of
the data points on the basis of the Gaussian distributions and the class priors
estimated by the EM algorithm. Such clusters partition the data into disjoint
subsets. On the other hand, the soft clusters correspond to the posterior class
probabilities calculated by the EM algorithm. A data element belongs to a cluster if
its posterior probability for that Gaussian exceeds a threshold.
After the EM algorithm has finished running, the hard clusters are created by
invoking the method C<run_bayes_classifier()> on an instance of the module and then
made user-accessible by calling C<return_disjoint_clusters()>. These clusters may
then be displayed in a terminal window by dereferencing each element of the array
whose reference is returned b C<return_disjoint_clusters()>. The hard clusters can
be written out to disk files by invoking C<write_naive_bayes_clusters_to_files()>.
This method writes out the clusters to files, one cluster per file. What is written
out to each file consists of the symbolic names of the data records that belong to
the cluster corresponding to that file. The clusters are placed in files with names
like
naive_bayes_cluster1.txt
naive_bayes_cluster2.txt
...
The soft clusters on the other hand are created by calling
C<return_clusters_with_posterior_probs_above_threshold($theta1)>
on an instance of the module, where the argument C<$theta1>
is the threshold for deciding whether a data element belongs
in a soft cluster. The posterior class probability at a
data element must exceed the threshold for the element to
belong to the corresponding cluster. The soft cluster can
be written out to disk files by calling
C<write_posterior_prob_clusters_above_threshold_to_files($theta1)>.
As with the hard clusters, each cluster is placed in a separate
file. The filenames for such clusters look like:
posterior_prob_cluster1.txt
posterior_prob_cluster2.txt
...
=head1 WHAT IF THE NUMBER OF CLUSTERS IS UNKNOWN?
The module constructor requires that you supply a value for the parameter C<K>, which
is the number of clusters you expect to see in the data. But what if you do not have
a good value for C<K>? Note that it is possible to search for the best C<K> by using
the two clustering quality criteria included in the module. However, I have
intentionally not yet incorporated that feature in the module because it slows down
the execution of the code --- especially when the dimensionality of the data
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
following five example scripts:
=over 16
=item I<canned_example1.pl>
This example applies the EM algorithm to the data contained in the datafile
C<mydatafile.dat>. The mixture data in the file corresponds to three overlapping
Gaussian components in a star-shaped pattern. The EM based clustering for this data
is shown in the files C<save_example_1_cluster_plot.png> and
C<save_example_1_posterior_prob_plot.png>, the former displaying the hard clusters
obtained by using the naive Bayes' classifier and the latter showing the soft
clusters obtained on the basis of the posterior class probabilities at the data
points.
=item I<canned_example2.pl>
The datafile used in this example is C<mydatafile2.dat>. This mixture data
corresponds to two well-separated relatively isotropic Gaussians. EM based clustering for this
data is shown in the files C<save_example_2_cluster_plot.png> and
C<save_example_2_posterior_prob_plot.png>, the former displaying the hard clusters
obtained by using the naive Bayes' classifier and the latter showing the soft
clusters obtained by using the posterior class probabilities at the data points.
=item I<canned_example3.pl>
Like the first example, this example again involves three Gaussians, but now their
means are not co-located. Additionally, we now seed the clusters manually by
specifying three selected data points as the initial guesses for the cluster means.
The datafile used for this example is C<mydatafile3.dat>. The EM based clustering
for this data is shown in the files C<save_example_3_cluster_plot.png> and
C<save_example_3_posterior_prob_plot.png>, the former displaying the hard clusters
obtained by using the naive Bayes' classifier and the latter showing the soft
clusters obtained on the basis of the posterior class probabilities at the data
points.
=item I<canned_example4.pl>
Whereas the three previous examples demonstrated EM based clustering of 2D data, we
now present an example of clustering in 3D. The datafile used in this example is
C<mydatafile4.dat>. This mixture data corresponds to three well-separated but highly
anisotropic Gaussians. The EM derived clustering for this data is shown in the files
C<save_example_4_cluster_plot.png> and C<save_example_4_posterior_prob_plot.png>, the
former displaying the hard clusters obtained by using the naive Bayes' classifier and
the latter showing the soft clusters obtained on the basis of the posterior class
probabilities at the data points.
You may also wish to run this example on the data in a CSV file in the C<examples>
directory. The name of the file is C<sphericaldata.csv>.
=item I<canned_example5.pl>
We again demonstrate clustering in 3D but now we have one Gaussian cluster that
"cuts" through the other two Gaussian clusters. The datafile used in this example is
C<mydatafile5.dat>. The three Gaussians in this case are highly overlapping and
highly anisotropic. The EM derived clustering for this data is shown in the files
C<save_example_5_cluster_plot.png> and C<save_example_5_posterior_prob_plot.png>, the
former displaying the hard clusters obtained by using the naive Bayes' classifier and
the latter showing the soft clusters obtained through the posterior class
probabilities at the data points.
=item I<canned_example6.pl>
This example, added in Version 1.2, demonstrates the use of this module for 1-D data.
In order to visualize the clusters for the 1-D case, we show them through their
respective histograms. The datafile used in this example is C<mydatafile7.dat>. The
data consists of two overlapping Gaussians. The EM derived clustering for this data
is shown in the files C<save_example_6_cluster_plot.png> and
C<save_example_6_posterior_prob_plot.png>, the former displaying the hard clusters
obtained by using the naive Bayes' classifier and the latter showing the soft
clusters obtained through the posterior class probabilities at the data points.
=back
Going through the six examples listed above will make you familiar with how to make
the calls to the clustering and the visualization methods. The C<examples> directory
also includes several parameter files with names like
param1.txt
param2.txt
param3.txt