Algorithm-KMeans

 view release on metacpan or  search on metacpan

lib/Algorithm/KMeans.pm  view on Meta::CPAN

the data points and for the QoC (Quality of Clustering) metric, and, possibly a
different rule to stop the iterations.  Version 1.1 also allows you to directly
access the clusters formed and the cluster centers in your calling script.


=head1 SPECIAL USAGE NOTE

If you were directly accessing in your own scripts the clusters produced by the older
versions of this module, you'd need to make changes to your code if you wish to use
Version 2.0 or higher.  Instead of returning arrays of clusters and cluster centers,
Versions 2.0 and higher return hashes. This change was made necessary by the logic
required for implementing the two new C<which_cluster> methods that were introduced
in Version 2.0.  These methods return the best cluster for a new data sample from the
clusters you created using the existing data.  One of the C<which_cluster> methods is
based on the Euclidean metric for finding the cluster that is closest to the new data
sample, and the other on the Mahalanobis metric.  Another point of incompatibility
with the previous versions is that you must now explicitly set the C<cluster_seeding>
parameter in the call to the constructor to either C<random> or C<smart>.  This
parameter does not have a default associated with it starting with Version 2.0.


=head1 DESCRIPTION

Clustering with K-Means takes place iteratively and involves two steps: 1) assignment
of data samples to clusters on the basis of how far the data samples are from the
cluster centers; and 2) Recalculation of the cluster centers (and cluster covariances
if you are using the Mahalanobis distance metric for clustering).

Obviously, before the two-step approach can proceed, we need to initialize the the
cluster centers.  How this initialization is carried out is important.  The module
gives you two very different ways for carrying out this initialization.  One option,
called the C<smart> option, consists of subjecting the data to principal components
analysis to discover the direction of maximum variance in the data space.  The data
points are then projected on to this direction and a histogram constructed from the
projections.  Centers of the smoothed histogram are used to seed the clustering
operation.  The other option is to choose the cluster centers purely randomly.  You
get the first option if you set C<cluster_seeding> to C<smart> in the constructor,
and you get the second option if you set it to C<random>.

How to specify the number of clusters, C<K>, is one of the most vexing issues in any
approach to clustering.  In some case, we can set C<K> on the basis of prior
knowledge.  But, more often than not, no such prior knowledge is available.  When the
programmer does not explicitly specify a value for C<K>, the approach taken in the
current implementation is to try all possible values between 2 and some largest
possible value that makes statistical sense.  We then choose that value for C<K>
which yields the best value for the QoC (Quality of Clustering) metric.  It is
generally believed that the largest value for C<K> should not exceed C<sqrt(N/2)>
where C<N> is the number of data samples to be clustered.

What to use for the QoC metric is obviously a critical issue unto itself.  In the
current implementation, the value of QoC is the ratio of the average radius of the
clusters and the average distance between the cluster centers.

Every iterative algorithm requires a stopping criterion.  The criterion implemented
here is that we stop iterations when there is no re-assignment of the data points
during the assignment step.

Ordinarily, the output produced by a K-Means clusterer will correspond to a local
minimum for the QoC values, as opposed to a global minimum.  The current
implementation protects against that when the module constructor is called with the
C<random> option for C<cluster_seeding> by trying different randomly selected initial
cluster centers and then selecting the one that gives the best overall QoC value.

A K-Means clusterer will generally produce good results if the overlap between the
clusters is minimal and if each cluster exhibits variability that is uniform in all
directions.  When the data variability is different along the different directions in
the data space, the results you obtain with a K-Means clusterer may be improved by
first normalizing the data appropriately, as can be done in this module when you set
the C<do_variance_normalization> option in the constructor.  However, as pointed out
elsewhere in this documentation, such normalization may actually decrease the
performance of the clusterer if the overall data variability along any dimension is
more a result of separation between the means than a consequence of intra-cluster
variability.


=head1 METHODS

The module provides the following methods for clustering, for cluster visualization,
for data visualization, for the generation of data for testing a clustering
algorithm, and for determining the cluster membership of a new data sample:

=over 4

=item B<new():>

    my $clusterer = Algorithm::KMeans->new(datafile        => $datafile,
                                           mask            => $mask,
                                           K               => $K,
                                           cluster_seeding => 'random',     # also try 'smart'
                                           use_mahalanobis_metric => 1,     # also try '0'
                                           terminal_output => 1,     
                                           write_clusters_to_files => 1,
                                          );

A call to C<new()> constructs a new instance of the C<Algorithm::KMeans> class.  When
C<$K> is a non-zero positive integer, the module will construct exactly that many
clusters.  However, when C<$K> is 0, the module will find the best number of clusters
to partition the data into.  As explained in the Description, setting
C<cluster_seeding> to C<smart> causes PCA (principal components analysis) to be used
for discovering the best choices for the initial cluster centers.  If you want purely
random decisions to be made for the initial choices for the cluster centers, set
C<cluster_seeding> to C<random>.

The data file is expected to contain entries in the following format

   c20  0  10.7087017086940  9.63528386251712  10.9512155258108  ...
   c7   0  12.8025925026787  10.6126270065785  10.5228482095349  ...
   b9   0  7.60118206283120  5.05889245193079  5.82841781759102  ...
   ....
   ....

where the first column contains the symbolic ID tag for each data record and the rest
of the columns the numerical information.  As to which columns are actually used for
clustering is decided by the string value of the mask.  For example, if we wanted to
cluster on the basis of the entries in just the 3rd, the 4th, and the 5th columns
above, the mask value would be C<N0111> where the character C<N> indicates that the
ID tag is in the first column, the character C<0> that the second column is to be
ignored, and the C<1>'s that follow that the 3rd, the 4th, and the 5th columns are to
be used for clustering.

If you wish for the clusterer to search through a C<(Kmin,Kmax)> range of values for
C<K>, the constructor should be called in the following fashion:



( run in 2.424 seconds using v1.01-cache-2.11-cpan-13bb782fe5a )