Algorithm-ExpectationMaximization

 view release on metacpan or  search on metacpan

README  view on Meta::CPAN

with regard to its ability to identify the individual
Gaussians in data that can be modeled as a Gaussian mixture.

This module requires the following three modules:

   Math::Random
   Graphics::GnuplotIF
   Math::GSL::Matrix

the first for generating the multivariate random numbers,
the second for the visualization of the clusters, and the
third for access to the Perl wrappers for the matrix
handling portion of the GNU Scientific Library for the
algebraic manipulation of the matrices that are needed for
the covariances of the Gaussians.

For installation, do the usual

    perl Makefile.PL
    make
    make test

examples/param6.txt  view on Meta::CPAN

# Comment lines must begin with #
#
# IMPORTANT:  Only one blank space allowed between consecutive numbers 
# in each line

priors 0.25 0.25 0.25 0.25

cluster

5 0 0.0

3 0 0
0 60 0

examples/param7.txt  view on Meta::CPAN

# Comment lines must begin with #
#
# IMPORTANT:  Only one blank space allowed between consecutive numbers 
# in each line

priors 0.5 0.5

cluster

5.0

3

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

    # then you need d mean value parameters, and one prior probability parameter
    # for the Gaussian. So   $L = K[1 + d + d(d+1)/2] - 1  where the final '1' that
    # is subtracted is to account for the normalization on the class priors.
    my $L = (0.5 * $self->{_K} * 
             ($self->{_data_dimensions}**2 + 3*$self->{_data_dimensions} + 2) ) - 1;
    my $model_complexity_penalty = 0.5 * $L * log( $self->{_N} );
    my $mdl_criterion = -1 * $log_likelihood + $model_complexity_penalty;
    return $mdl_criterion;
}

# For our second measure of clustering quality, we use `trace( SW^-1 . SB)' where SW
# is the within-class scatter matrix, more commonly denoted S_w, and SB the
# between-class scatter matrix, more commonly denoted S_b (the underscore means
# subscript).  This measure can be thought of as the normalized average distance
# between the clusters, the normalization being provided by average cluster
# covariance SW^-1. Therefore, the larger the value of this quality measure, the
# better the separation between the clusters.  Since this measure has its roots in
# the Fisher linear discriminant function, we incorporate the word 'fisher' in the
# name of the quality measure.  Note that this measure is good only when the clusters
# are disjoint.  When the clusters exhibit significant overlap, the numbers produced
# by this quality measure tend to be generally meaningless.  As an extreme case,

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

    my @v_mask = split //, $v_mask;
    my $visualization_mask_width = @v_mask;
    my $original_data_mask = $self->{_mask};
    my @mask = split //, $original_data_mask;
    my $data_field_width = scalar grep {$_ eq '1'} @mask;    
    die "\n\nABORTED: The width of the visualization mask (including " .
          "all its 1s and 0s) must equal the width of the original mask " .
          "used for reading the data file (counting only the 1's)"
          if $visualization_mask_width != $data_field_width;
    my $visualization_data_field_width = scalar grep {$_ eq '1'} @v_mask;
    # The following section is for the superimposed one-Mahalanobis-distance-unit 
    # ellipses that are shown only for 2D plots:
    if ($visualization_data_field_width == 2) {
        foreach my $cluster_index (0..$self->{_K}-1) {
            my $contour_filename = "__contour_" . $cluster_index . ".dat";
            my $mean = $self->{_cluster_means}->[$cluster_index];
            my $covariance = $self->{_cluster_covariances}->[$cluster_index];
            my ($mux,$muy) = $mean->as_list();
            my ($varx,$sigmaxy) = $covariance->row(0)->as_list();
            my ($sigmayx,$vary) = $covariance->row(1)->as_list();
            die "Your covariance matrix does not look right" 

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

=head1 SYNOPSIS

  use Algorithm::ExpectationMaximization;

  #  First name the data file:

  my $datafile = "mydatafile.csv";

  #  Next, set the mask to indicate which columns of the datafile to use for
  #  clustering and which column contains a symbolic ID for each data record. For
  #  example, if the symbolic name is in the first column, you want the second column
  #  to be ignored, and you want the next three columns to be used for 3D clustering:

  my $mask = "N0111";

  #  Now construct an instance of the clusterer.  The parameter `K' controls the
  #  number of clusters.  Here is an example call to the constructor for instance
  #  creation:

  my $clusterer = Algorithm::ExpectationMaximization->new(
                                      datafile            => $datafile,

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

L<https://engineering.purdue.edu/kak/Tutorials/ExpectationMaximization.pdf>.

The EM algorithm in our context reduces to an iterative invocation of the following
steps: (1) Given the current guess for the means and the covariances of the different
Gaussians in our mixture model, use Bayes' Rule to update the posterior class
probabilities at each of the data points; (2) Using the updated posterior class
probabilities, first update the class priors; (3) Using the updated class priors,
update the class means and the class covariances; and go back to Step (1).  Ideally,
the iterations should terminate when the expected log-likelihood of the observed data
has reached a maximum and does not change with any further iterations.  The stopping
rule used in this module is the detection of no change over three consecutive
iterations in the values calculated for the priors.

This module provides three different choices for seeding the clusters: (1) random,
(2) kmeans, and (3) manual.  When random seeding is chosen, the algorithm randomly
selects C<K> data elements as cluster seeds.  That is, the data vectors associated
with these seeds are treated as initial guesses for the means of the Gaussian
distributions.  The covariances are then set to the values calculated from the entire
dataset with respect to the means corresponding to the seeds. With kmeans seeding, on
the other hand, the means and the covariances are set to whatever values are returned
by the kmeans algorithm.  And, when seeding is set to manual, you are allowed to

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

Rissanen (J. Rissanen: "Modeling by Shortest Data Description," Automatica, 1978, and
"A Universal Prior for Integers and Estimation by Minimum Description Length," Annals
of Statistics, 1983.)  The MDL criterion is a difference of a log-likelihood term for
all of the observed data and a model-complexity penalty term. In general, both the
log-likelihood and the model-complexity terms increase as the number of clusters
increases.  The form of the MDL criterion in this module uses for the penalty term
the Bayesian Information Criterion (BIC) of G. Schwartz, "Estimating the Dimensions
of a Model," The Annals of Statistics, 1978.  In general, the smaller the value of
MDL quality measure, the better the clustering of the data.

For our second measure of clustering quality, we use `trace( SW^-1 . SB)' where SW is
the within-class scatter matrix, more commonly denoted S_w, and SB the between-class
scatter matrix, more commonly denoted S_b (the underscore means subscript).  This
measure can be thought of as the normalized average distance between the clusters,
the normalization being provided by average cluster covariance SW^-1. Therefore, the
larger the value of this quality measure, the better the separation between the
clusters.  Since this measure has its roots in the Fisher linear discriminant
function, we incorporate the word C<fisher> in the name of the quality measure.
I<Note that this measure is good only when the clusters are disjoint.> When the
clusters exhibit significant overlap, the numbers produced by this quality measure
tend to be generally meaningless.

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN

   ....

where the first column contains the symbolic ID tag for each
data record and the rest of the columns the numerical
information.  As to which columns are actually used for
clustering is decided by the string value of the mask.  For
example, if we wanted to cluster on the basis of the entries
in just the 3rd, the 4th, and the 5th columns above, the
mask value would be C<N0111> where the character C<N>
indicates that the ID tag is in the first column, the
character C<0> that the second column is to be ignored, and
the C<1>'s that follow that the 3rd, the 4th, and the 5th
columns are to be used for clustering.

If instead of random seeding, you wish to use the kmeans
based seeding, just replace the option C<random> supplied
for C<seeding> by C<kmeans>.  You can also do manual seeding
by designating a specified set of data elements to serve as
cluster seeds.  The call to the constructor in this case
looks like

lib/Algorithm/ExpectationMaximization.pm  view on Meta::CPAN



=head1 REQUIRED

This module requires the following three modules:

   Math::Random
   Graphics::GnuplotIF
   Math::GSL::Matrix

the first for generating the multivariate random numbers, the second for the
visualization of the clusters, and the last for access to the Perl wrappers for the
GNU Scientific Library.  The C<Matrix> module of this library is used for various
algebraic operations on the covariance matrices of the Gaussians.

=head1 EXPORT

None by design.


=head1 BUGS



( run in 1.464 second using v1.01-cache-2.11-cpan-39bf76dae61 )