Algorithm-KMeans
view release on metacpan or search on metacpan
lib/Algorithm/KMeans.pm view on Meta::CPAN
cleanup_directory.pl
The examples directory also includes a parameter file, C<param.txt>, for generating
synthetic data for clustering. Just edit this file if you would like to generate
your own multivariate data for clustering. The parameter file is for the 3D case,
but you can generate data with any dimensionality through appropriate entries in the
parameter file.
=head1 EXPORT
None by design.
=head1 CAVEATS
K-Means based clustering usually does not work well when the clusters are strongly
overlapping and when the extent of variability along the different dimensions is
different for the different clusters. The module does give you the ability to
normalize the variability in your data with the constructor option
C<do_variance_normalization>. However, as described elsewhere, this may actually
reduce the performance of the clusterer if the data variability along a direction is
more a result of the separation between the means than because of intra-cluster
variability. For better clustering with difficult-to-cluster data, you could try
using the author's C<Algorithm::ExpectationMaximization> module.
=head1 BUGS
Please notify the author if you encounter any bugs. When sending email, please place
the string 'KMeans' in the subject line.
=head1 INSTALLATION
Download the archive from CPAN in any directory of your choice. Unpack the archive
with a command that on a Linux machine would look like:
tar zxvf Algorithm-KMeans-2.05.tar.gz
This will create an installation directory for you whose name will be
C<Algorithm-KMeans-2.05>. Enter this directory and execute the following commands
for a standard install of the module if you have root privileges:
perl Makefile.PL
make
make test
sudo make install
If you do not have root privileges, you can carry out a non-standard install the
module in any directory of your choice by:
perl Makefile.PL prefix=/some/other/directory/
make
make test
make install
With a non-standard install, you may also have to set your PERL5LIB environment
variable so that this module can find the required other modules. How you do that
would depend on what platform you are working on. In order to install this module in
a Linux machine on which I use tcsh for the shell, I set the PERL5LIB environment
variable by
setenv PERL5LIB /some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/
If I used bash, I'd need to declare:
export PERL5LIB=/some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/
=head1 THANKS
I thank Slaven for pointing out that I needed to downshift the required version of Perl
for this module. Fortunately, I had access to an old machine still running Perl
5.10.1. The current version, 2.02, is based on my testing the module on that machine.
I added two C<which_cluster> methods in Version 2.0 as a result of an email from
Jerome White who expressed a need for such methods in order to determine the best
cluster for a new data record after you have successfully clustered your existing
data. Thanks Jerome for your feedback!
It was an email from Nadeem Bulsara that prompted me to create Version 1.40 of this
module. Working with Version 1.30, Nadeem noticed that occasionally the module would
produce variable clustering results on the same dataset. I believe that this
variability was caused (at least partly) by the purely random mode that was used in
Version 1.30 for the seeding of the cluster centers. Version 1.40 now includes a
C<smart> mode. With the new mode the clusterer uses a PCA (Principal Components
Analysis) of the data to make good guesses for the cluster centers. However,
depending on how the data is jumbled up, it is possible that the new mode will not
produce uniformly good results in all cases. So you can still use the old mode by
setting C<cluster_seeding> to C<random> in the constructor. Thanks Nadeem for your
feedback!
Version 1.30 resulted from Martin Kalin reporting problems with a very small data
set. Thanks Martin!
Version 1.21 came about in response to the problems encountered by Luis Fernando
D'Haro with version 1.20. Although the module would yield the clusters for some of
its runs, more frequently than not the module would abort with an "empty cluster"
message for his data. Luis Fernando has also suggested other improvements (such as
clustering directly from the contents of a hash) that I intend to make in future
versions of this module. Thanks Luis Fernando.
Chad Aeschliman was kind enough to test out the interface of this module and to give
suggestions for its improvement. His key slogan: "If you cannot figure out how to
use a module in under 10 minutes, it's not going to be used." That should explain
the longish Synopsis included here.
=head1 AUTHOR
Avinash Kak, kak@purdue.edu
If you send email, please place the string "KMeans" in your subject line to get past
my spam filter.
=head1 COPYRIGHT
This library is free software; you can redistribute it and/or modify it under the
same terms as Perl itself.
Copyright 2014 Avinash Kak
=cut
( run in 1.717 second using v1.01-cache-2.11-cpan-d7f47b0818f )