process results from the CPAN

process

Algorithm-ExpectationMaximization

view release on metacpan or search on metacpan

lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN

=pod

=head1 NAME

Algorithm::ExpectationMaximization -- A Perl module for clustering numerical
multi-dimensional data with the Expectation-Maximization algorithm.

=head1 SYNOPSIS

  use Algorithm::ExpectationMaximization;

  #  First name the data file:

  my $datafile = "mydatafile.csv";

  #  Next, set the mask to indicate which columns of the datafile to use for
  #  clustering and which column contains a symbolic ID for each data record. For
  #  example, if the symbolic name is in the first column, you want the second column
  #  to be ignored, and you want the next three columns to be used for 3D clustering:

  my $mask = "N0111";

  #  Now construct an instance of the clusterer.  The parameter `K' controls the
  #  number of clusters.  Here is an example call to the constructor for instance
  #  creation:

  my $clusterer = Algorithm::ExpectationMaximization->new(
                                      datafile            => $datafile,
                                      mask                => $mask,
                                      K                   => 3,
                                      max_em_iterations   => 300,
                                      seeding             => 'random',
                                      terminal_output     => 1,
                  );
 
  #  Note the choice for `seeding'. The choice `random' means that the clusterer will
  #  randomly select `K' data points to serve as initial cluster centers.  Other
  #  possible choices for the constructor parameter `seeding' are `kmeans' and
  #  `manual'.  With the `kmeans' option for `seeding', the output of a K-means
  #  clusterer is used for the cluster seeds and the initial cluster covariances.  If
  #  you use the `manual' option for seeding, you must also specify the data elements
  #  to use for seeding the clusters.

  #  Here is an example of a call to the constructor when we choose the `manual'
  #  option for seeding the clusters and for specifying the data elements for
  #  seeding.  The data elements are specified by their tag names.  In this case,
  #  these names are `a26', `b53', and `c49':

  my $clusterer = Algorithm::ExpectationMaximization->new(
                                      datafile            => $datafile,
                                      mask                => $mask,
                                      class_priors        => [0.6, 0.2, 0.2],
                                      K                   => 3,
                                      max_em_iterations   => 300,
                                      seeding             => 'manual',
                                      seed_tags           => ['a26', 'b53', 'c49'],
                                      terminal_output     => 1,
                                    );

  #  This example call to the constructor also illustrates how you can inject class
  #  priors into the clustering process. The class priors are the prior probabilities
  #  of the class distributions in your dataset.  As explained later, injecting class
  #  priors in the manner shown above makes statistical sense only for the case of
  #  manual seeding.  When you do inject class priors, the order in which the priors
  #  are expressed must correspond to the manually specified seeds for the clusters.

  #  After the invocation of the constructor, the following calls are mandatory
  #  for reasons that should be obvious from the names of the methods:

  $clusterer->read_data_from_file();
  srand(time);
  $clusterer->seed_the_clusters();
  $clusterer->EM();
  $clusterer->run_bayes_classifier();
  my $clusters = $clusterer->return_disjoint_clusters();

  #  where the call to `EM()' is the invocation of the expectation-maximization
  #  algorithm.  The call to `srand(time)' is to seed the pseudo random number
  #  generator afresh for each run of the cluster seeding procedure.  If you want to
  #  see repeatable results from one run to another of the algorithm with random
  #  seeding, you would obviously not invoke `srand(time)'.

  #  The call `run_bayes_classifier()' shown above carries out a disjoint clustering
  #  of all the data points using the naive Bayes' classifier. And the call
  #  `return_disjoint_clusters()' returns the clusters thus formed to you.  Once you
  #  have obtained access to the clusters in this manner, you can display them in
  #  your terminal window by

  foreach my $index (0..@$clusters-1) {
      print "Cluster $index (Naive Bayes):   @{$clusters->[$index]}\n\n"
  }

  #  If you would like to also see the clusters purely on the basis of the posterior
  #  class probabilities exceeding a threshold, call

  my $theta1 = 0.2;
  my $posterior_prob_clusters =
           $clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);

  #  where you can obviously set the threshold $theta1 to any value you wish.  Note
  #  that now you may end up with clusters that overlap.  You can display them in
  #  your terminal window in the same manner as shown above for the naive Bayes'
  #  clusters.

  #  You can write the naive Bayes' clusters out to files, one cluster per file, by
  #  calling

  $clusterer->write_naive_bayes_clusters_to_files();  

  #  The clusters are placed in files with names like

         naive_bayes_cluster1.txt
         naive_bayes_cluster2.txt
         ...

  #  In the same manner, you can write out the posterior probability based possibly
  #  overlapping clusters to files by calling:

  $clusterer->write_posterior_prob_clusters_above_threshold_to_files($theta1);

  #  where the threshold $theta1 sets the probability threshold for deciding which

lib/Algorithm/ExpectationMaximization.pm view on Meta::CPAN

information.  As to which columns are actually used for
clustering is decided by the string value of the mask.  For
example, if we wanted to cluster on the basis of the entries
in just the 3rd, the 4th, and the 5th columns above, the
mask value would be C<N0111> where the character C<N>
indicates that the ID tag is in the first column, the
character C<0> that the second column is to be ignored, and
the C<1>'s that follow that the 3rd, the 4th, and the 5th
columns are to be used for clustering.

If instead of random seeding, you wish to use the kmeans
based seeding, just replace the option C<random> supplied
for C<seeding> by C<kmeans>.  You can also do manual seeding
by designating a specified set of data elements to serve as
cluster seeds.  The call to the constructor in this case
looks like

    my $clusterer = Algorithm::ExpectationMaximization->new(
                                datafile            => $datafile,
                                mask                => $mask,
                                K                   => 3,
                                max_em_iterations   => 300,
                                seeding             => 'manual',
                                seed_tags           => ['a26', 'b53', 'c49'],
                                terminal_output     => 1,
                    );

where the option C<seed_tags> is set to an anonymous array
of symbolic names associated with the data elements.

If you know the class priors, you can supply them through an
additional option to the constructor that looks like

    class_priors    => [0.6, 0.2, 0.2],

for the case of C<K> equal to 3.  B<In general, this would
be a useful thing to do only for the case of manual
seeding.> If you go for manual seeding, the order in which
the priors are expressed should correspond to the order of
the manually chosen tags supplied through the C<seed_tags>
option.

Note that the parameter C<terminal_output> is boolean; when
not supplied in the call to C<new()> it defaults to 0.  When
set, this parameter displays useful information in the
window of the terminal screen in which you invoke the
algorithm.

=item B<read_data_from_file():>

    $clusterer->read_data_from_file()

This is a required call after the constructor is invoked. As
you would expect, this call reads in the data for
clustering.

=item B<seed_the_clusters():>

    $clusterer->seed_the_clusters();

This is also a required call.  It processes the option you
supplied for C<seeding> in the constructor call to choose
the data elements for seeding the C<K> clusters.

=item B<EM():>

    $clusterer->EM();

This is the workhorse of the module, as you would expect.
The means, the covariances, and the priors estimated by this
method are stored in instance variables that are subsequently
accessed by other methods for the purpose of displaying the
clusters, the probability distributions, etc.

=item B<run_bayes_classifier():>

    $clusterer->run_bayes_classifier();

Using the posterior probability distributions estimated by
the C<EM()> method, this method partitions the data into the
C<K> disjoint clusters using the naive Bayes' classifier.

=item B<return_disjoint_clusters():>

    my $clusters = $clusterer->return_disjoint_clusters();

This allows you to access the clusters obtained with the
application of the naive Bayes' classifier in your own
scripts.  If, say, you wanted to see the data records placed
in each cluster, you could subsequently invoke the following
loop in your own script:

    foreach my $index (0..@$clusters-1) {
        print "Cluster $index (Naive Bayes):   @{$clusters->[$index]}\n\n"
    }

where C<$clusters> holds the array reference returned by the
call to C<return_disjoint_clusters()>.

=item B<write_naive_bayes_clusters_to_files():>

    $clusterer->write_naive_bayes_clusters_to_files();

This method writes the clusters obtained by applying the
naive Bayes' classifier to disk files, one cluster per
file.  What is written out to each file consists of the
symbolic names of the data records that belong to the
cluster corresponding to that file.  The clusters are placed
in files with names like

    naive_bayes_cluster1.txt
    naive_bayes_cluster2.txt
    ...

=item B<return_clusters_with_posterior_probs_above_threshold($theta1):>

    my $theta1 = 0.2;
    my $posterior_prob_clusters =
       $clusterer->return_clusters_with_posterior_probs_above_threshold($theta1);

This method returns a reference to an array of C<K>

( run in 1.372 second using v1.01-cache-2.11-cpan-600a1bdf6e4 )