Algorithm-KMeans

 view release on metacpan or  search on metacpan

lib/Algorithm/KMeans.pm  view on Meta::CPAN


  # You must first set the mask for cluster visualization. This mask tells the module
  # which 2D or 3D subspace of the original data space you wish to visualize the
  # clusters in:

  my $visualization_mask = "111";
  $clusterer->visualize_clusters($visualization_mask);


  # SYNTHETIC DATA GENERATION:

  # The module has been provided with a class method for generating multivariate data
  # for experimenting with clustering.  The data generation is controlled by the
  # contents of the parameter file that is supplied as an argument to the data
  # generator method.  The mean and covariance matrix entries in the parameter file
  # must be according to the syntax shown in the param.txt file in the examples
  # directory. It is best to edit this file as needed:

  my $parameter_file = "param.txt";
  my $out_datafile = "mydatafile.dat";
  Algorithm::KMeans->cluster_data_generator(
                          input_parameter_file => $parameter_file,
                          output_datafile => $out_datafile,
                          number_data_points_per_cluster => $N );

=head1 CHANGES

Version 2.05 removes the restriction on the version of Perl that is required.  This
is based on Srezic's recommendation.  He had no problem building and testing the
previous version with Perl 5.8.9.  Version 2.05 also includes a small augmentation of
the code in the method C<read_data_from_file_csv()> for guarding against user errors
in the specification of the mask that tells the module which columns of the data file
are to be used for clustering.

Version 2.04 allows you to use CSV data files for clustering.

Version 2.03 incorporates minor code cleanup.  The main implementation of the module
remains unchanged.

Version 2.02 downshifts the version of Perl that is required for this module.  The
module should work with versions 5.10 and higher of Perl.  The implementation code
for the module remains unchanged.

Version 2.01 removes many errors in the documentation. The changes made to the module
in Version 2.0 were not reflected properly in the documentation page for that
version.  The implementation code remains unchanged.

Version 2.0 includes significant additional functionality: (1) You now have the
option to cluster using the Mahalanobis distance metric (the default is the Euclidean
metric); and (2) With the two C<which_cluster> methods that have been added to the
module, you can now determine the best cluster for a new data sample after you have
created the clusters with the previously available data.  Finding the best cluster
for a new data sample can be done using either the Euclidean metric or the
Mahalanobis metric.

Version 1.40 includes a C<smart> option for seeding the clusters.  This option,
supplied through the constructor parameter C<cluster_seeding>, means that the
clusterer will (1) Subject the data to principal components analysis in order to
determine the maximum variance direction; (2) Project the data onto this direction;
(3) Find peaks in a smoothed histogram of the projected points; and (4) Use the
locations of the highest peaks as initial guesses for the cluster centers.  If you
don't want to use this option, set C<cluster_seeding> to C<random>. That should work
as in the previous version of the module.

Version 1.30 includes a bug fix for the case when the datafile contains empty lines,
that is, lines with no data records.  Another bug fix in Version 1.30 deals with the
case when you want the module to figure out how many clusters to form (this is the
C<K=0> option in the constructor call) and the number of data records is close to the
minimum.

Version 1.21 includes fixes to handle the possibility that, when clustering the data
for a fixed number of clusters, a cluster may become empty during iterative
calculation of cluster assignments of the data elements and the updating of the
cluster centers.  The code changes are in the C<assign_data_to_clusters()> and
C<update_cluster_centers()> subroutines.

Version 1.20 includes an option to normalize the data with respect to its variability
along the different coordinates before clustering is carried out.  

Version 1.1.1 allows for range limiting the values of C<K> to search through.  C<K>
stands for the number of clusters to form.  This version also declares the module
dependencies in the C<Makefile.PL> file.

Version 1.1 is a an object-oriented version of the implementation presented in
version 1.0.  The current version should lend itself more easily to code extension.
You could, for example, create your own class by subclassing from the class presented
here and, in your subclass, use your own criteria for the similarity distance between
the data points and for the QoC (Quality of Clustering) metric, and, possibly a
different rule to stop the iterations.  Version 1.1 also allows you to directly
access the clusters formed and the cluster centers in your calling script.


=head1 SPECIAL USAGE NOTE

If you were directly accessing in your own scripts the clusters produced by the older
versions of this module, you'd need to make changes to your code if you wish to use
Version 2.0 or higher.  Instead of returning arrays of clusters and cluster centers,
Versions 2.0 and higher return hashes. This change was made necessary by the logic
required for implementing the two new C<which_cluster> methods that were introduced
in Version 2.0.  These methods return the best cluster for a new data sample from the
clusters you created using the existing data.  One of the C<which_cluster> methods is
based on the Euclidean metric for finding the cluster that is closest to the new data
sample, and the other on the Mahalanobis metric.  Another point of incompatibility
with the previous versions is that you must now explicitly set the C<cluster_seeding>
parameter in the call to the constructor to either C<random> or C<smart>.  This
parameter does not have a default associated with it starting with Version 2.0.


=head1 DESCRIPTION

Clustering with K-Means takes place iteratively and involves two steps: 1) assignment
of data samples to clusters on the basis of how far the data samples are from the
cluster centers; and 2) Recalculation of the cluster centers (and cluster covariances
if you are using the Mahalanobis distance metric for clustering).

Obviously, before the two-step approach can proceed, we need to initialize the the
cluster centers.  How this initialization is carried out is important.  The module
gives you two very different ways for carrying out this initialization.  One option,
called the C<smart> option, consists of subjecting the data to principal components
analysis to discover the direction of maximum variance in the data space.  The data
points are then projected on to this direction and a histogram constructed from the

lib/Algorithm/KMeans.pm  view on Meta::CPAN


=head1 BUGS

Please notify the author if you encounter any bugs.  When sending email, please place
the string 'KMeans' in the subject line.

=head1 INSTALLATION

Download the archive from CPAN in any directory of your choice.  Unpack the archive
with a command that on a Linux machine would look like:

    tar zxvf Algorithm-KMeans-2.05.tar.gz

This will create an installation directory for you whose name will be
C<Algorithm-KMeans-2.05>.  Enter this directory and execute the following commands
for a standard install of the module if you have root privileges:

    perl Makefile.PL
    make
    make test
    sudo make install

If you do not have root privileges, you can carry out a non-standard install the
module in any directory of your choice by:

    perl Makefile.PL prefix=/some/other/directory/
    make
    make test
    make install

With a non-standard install, you may also have to set your PERL5LIB environment
variable so that this module can find the required other modules. How you do that
would depend on what platform you are working on.  In order to install this module in
a Linux machine on which I use tcsh for the shell, I set the PERL5LIB environment
variable by

    setenv PERL5LIB /some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/

If I used bash, I'd need to declare:

    export PERL5LIB=/some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/


=head1 THANKS

I thank Slaven for pointing out that I needed to downshift the required version of Perl
for this module.  Fortunately, I had access to an old machine still running Perl
5.10.1.  The current version, 2.02, is based on my testing the module on that machine.

I added two C<which_cluster> methods in Version 2.0 as a result of an email from
Jerome White who expressed a need for such methods in order to determine the best
cluster for a new data record after you have successfully clustered your existing
data.  Thanks Jerome for your feedback!

It was an email from Nadeem Bulsara that prompted me to create Version 1.40 of this
module.  Working with Version 1.30, Nadeem noticed that occasionally the module would
produce variable clustering results on the same dataset.  I believe that this
variability was caused (at least partly) by the purely random mode that was used in
Version 1.30 for the seeding of the cluster centers.  Version 1.40 now includes a
C<smart> mode. With the new mode the clusterer uses a PCA (Principal Components
Analysis) of the data to make good guesses for the cluster centers.  However,
depending on how the data is jumbled up, it is possible that the new mode will not
produce uniformly good results in all cases.  So you can still use the old mode by
setting C<cluster_seeding> to C<random> in the constructor.  Thanks Nadeem for your
feedback!

Version 1.30 resulted from Martin Kalin reporting problems with a very small data
set. Thanks Martin!

Version 1.21 came about in response to the problems encountered by Luis Fernando
D'Haro with version 1.20.  Although the module would yield the clusters for some of
its runs, more frequently than not the module would abort with an "empty cluster"
message for his data. Luis Fernando has also suggested other improvements (such as
clustering directly from the contents of a hash) that I intend to make in future
versions of this module.  Thanks Luis Fernando.

Chad Aeschliman was kind enough to test out the interface of this module and to give
suggestions for its improvement.  His key slogan: "If you cannot figure out how to
use a module in under 10 minutes, it's not going to be used."  That should explain
the longish Synopsis included here.

=head1 AUTHOR

Avinash Kak, kak@purdue.edu

If you send email, please place the string "KMeans" in your subject line to get past
my spam filter.

=head1 COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the
same terms as Perl itself.

 Copyright 2014 Avinash Kak

=cut



( run in 2.618 seconds using v1.01-cache-2.11-cpan-140bd7fdf52 )