Algorithm-KMeans
view release on metacpan or search on metacpan
lib/Algorithm/KMeans.pm view on Meta::CPAN
module, you can now determine the best cluster for a new data sample after you have
created the clusters with the previously available data. Finding the best cluster
for a new data sample can be done using either the Euclidean metric or the
Mahalanobis metric.
Version 1.40 includes a C<smart> option for seeding the clusters. This option,
supplied through the constructor parameter C<cluster_seeding>, means that the
clusterer will (1) Subject the data to principal components analysis in order to
determine the maximum variance direction; (2) Project the data onto this direction;
(3) Find peaks in a smoothed histogram of the projected points; and (4) Use the
locations of the highest peaks as initial guesses for the cluster centers. If you
don't want to use this option, set C<cluster_seeding> to C<random>. That should work
as in the previous version of the module.
Version 1.30 includes a bug fix for the case when the datafile contains empty lines,
that is, lines with no data records. Another bug fix in Version 1.30 deals with the
case when you want the module to figure out how many clusters to form (this is the
C<K=0> option in the constructor call) and the number of data records is close to the
minimum.
Version 1.21 includes fixes to handle the possibility that, when clustering the data
lib/Algorithm/KMeans.pm view on Meta::CPAN
Jerome White who expressed a need for such methods in order to determine the best
cluster for a new data record after you have successfully clustered your existing
data. Thanks Jerome for your feedback!
It was an email from Nadeem Bulsara that prompted me to create Version 1.40 of this
module. Working with Version 1.30, Nadeem noticed that occasionally the module would
produce variable clustering results on the same dataset. I believe that this
variability was caused (at least partly) by the purely random mode that was used in
Version 1.30 for the seeding of the cluster centers. Version 1.40 now includes a
C<smart> mode. With the new mode the clusterer uses a PCA (Principal Components
Analysis) of the data to make good guesses for the cluster centers. However,
depending on how the data is jumbled up, it is possible that the new mode will not
produce uniformly good results in all cases. So you can still use the old mode by
setting C<cluster_seeding> to C<random> in the constructor. Thanks Nadeem for your
feedback!
Version 1.30 resulted from Martin Kalin reporting problems with a very small data
set. Thanks Martin!
Version 1.21 came about in response to the problems encountered by Luis Fernando
D'Haro with version 1.20. Although the module would yield the clusters for some of
( run in 0.503 second using v1.01-cache-2.11-cpan-ba35b6b0368 )