Algorithm-ExpectationMaximization
view release on metacpan or search on metacpan
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
sub seed_the_clusters {
my $self = shift;
if ($self->{_seeding} eq 'random') {
my @covariances;
my @means;
my @all_tags = @{$self->{_data_id_tags}};
my @seed_tags;
foreach my $i (0..$self->{_K}-1) {
push @seed_tags, $all_tags[int rand( $self->{_N} )];
}
print "Random Seeding: Randomly selected seeding tags are @seed_tags\n\n";
my ($seed_means, $seed_covars) =
$self->find_seed_centered_covariances(\@seed_tags);
$self->{_cluster_means} = $seed_means;
$self->{_cluster_covariances} = $seed_covars;
} elsif ($self->{_seeding} eq 'kmeans') {
$self->kmeans();
my $clusters = $self->{_clusters};
my @dataclusters;
foreach my $index (0..@$clusters-1) {
push @dataclusters, [];
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
my $clusterer = Algorithm::ExpectationMaximization->new(
datafile => $datafile,
mask => $mask,
K => 3,
max_em_iterations => 300,
seeding => 'random',
terminal_output => 1,
);
# Note the choice for `seeding'. The choice `random' means that the clusterer will
# randomly select `K' data points to serve as initial cluster centers. Other
# possible choices for the constructor parameter `seeding' are `kmeans' and
# `manual'. With the `kmeans' option for `seeding', the output of a K-means
# clusterer is used for the cluster seeds and the initial cluster covariances. If
# you use the `manual' option for seeding, you must also specify the data elements
# to use for seeding the clusters.
# Here is an example of a call to the constructor when we choose the `manual'
# option for seeding the clusters and for specifying the data elements for
# seeding. The data elements are specified by their tag names. In this case,
# these names are `a26', `b53', and `c49':
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
give you a good approximation to the right answer.
At its core, EM depends on the notion of unobserved data and the averaging of the
log-likelihood of the data actually observed over all admissible probabilities for
the unobserved data. But what is unobserved data? While in some cases where EM is
used, the unobserved data is literally the missing data, in others, it is something
that cannot be seen directly but that nonetheless is relevant to the data actually
observed. For the case of clustering multidimensional numerical data that can be
modeled as a Gaussian mixture, it turns out that the best way to think of the
unobserved data is in terms of a sequence of random variables, one for each observed
data point, whose values dictate the selection of the Gaussian for that data point.
This point is explained in great detail in my on-line tutorial at
L<https://engineering.purdue.edu/kak/Tutorials/ExpectationMaximization.pdf>.
The EM algorithm in our context reduces to an iterative invocation of the following
steps: (1) Given the current guess for the means and the covariances of the different
Gaussians in our mixture model, use Bayes' Rule to update the posterior class
probabilities at each of the data points; (2) Using the updated posterior class
probabilities, first update the class priors; (3) Using the updated class priors,
update the class means and the class covariances; and go back to Step (1). Ideally,
the iterations should terminate when the expected log-likelihood of the observed data
has reached a maximum and does not change with any further iterations. The stopping
rule used in this module is the detection of no change over three consecutive
iterations in the values calculated for the priors.
This module provides three different choices for seeding the clusters: (1) random,
(2) kmeans, and (3) manual. When random seeding is chosen, the algorithm randomly
selects C<K> data elements as cluster seeds. That is, the data vectors associated
with these seeds are treated as initial guesses for the means of the Gaussian
distributions. The covariances are then set to the values calculated from the entire
dataset with respect to the means corresponding to the seeds. With kmeans seeding, on
the other hand, the means and the covariances are set to whatever values are returned
by the kmeans algorithm. And, when seeding is set to manual, you are allowed to
choose C<K> data elements --- by specifying their tag names --- for the seeds. The
rest of the EM initialization for the manual mode is the same as for the random mode.
The algorithm allows for the initial priors to be specified for the manual mode of
seeding.
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
you plan to cluster with the EM algorithm. You'd need to
specify argument mask in a manner similar to the
visualization of the clusters, as explained earlier.
=item B<plot_hardcopy_data($data_visualization_mask):>
$clusterer->plot_hardcopy_data($data_visualization_mask);
This method creates a PNG file that can be used to print out
a hardcopy of the data in different 2D and 3D subspaces of
the data space. The visualization mask is used to select the
subspace for the PNG image.
=back
=head1 HOW THE CLUSTERS ARE OUTPUT
This module produces two different types of clusters: the "hard" clusters and the
"soft" clusters. The hard clusters correspond to the naive Bayes' classification of
the data points on the basis of the Gaussian distributions and the class priors
estimated by the EM algorithm. Such clusters partition the data into disjoint
lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN
corresponds to two well-separated relatively isotropic Gaussians. EM based clustering for this
data is shown in the files C<save_example_2_cluster_plot.png> and
C<save_example_2_posterior_prob_plot.png>, the former displaying the hard clusters
obtained by using the naive Bayes' classifier and the latter showing the soft
clusters obtained by using the posterior class probabilities at the data points.
=item I<canned_example3.pl>
Like the first example, this example again involves three Gaussians, but now their
means are not co-located. Additionally, we now seed the clusters manually by
specifying three selected data points as the initial guesses for the cluster means.
The datafile used for this example is C<mydatafile3.dat>. The EM based clustering
for this data is shown in the files C<save_example_3_cluster_plot.png> and
C<save_example_3_posterior_prob_plot.png>, the former displaying the hard clusters
obtained by using the naive Bayes' classifier and the latter showing the soft
clusters obtained on the basis of the posterior class probabilities at the data
points.
=item I<canned_example4.pl>
Whereas the three previous examples demonstrated EM based clustering of 2D data, we
( run in 0.264 second using v1.01-cache-2.11-cpan-94b05bcf43c )