Algorithm-KMeans
view release on metacpan or search on metacpan
lib/Algorithm/KMeans.pm view on Meta::CPAN
Every iterative algorithm requires a stopping criterion. The criterion implemented
here is that we stop iterations when there is no re-assignment of the data points
during the assignment step.
Ordinarily, the output produced by a K-Means clusterer will correspond to a local
minimum for the QoC values, as opposed to a global minimum. The current
implementation protects against that when the module constructor is called with the
C<random> option for C<cluster_seeding> by trying different randomly selected initial
cluster centers and then selecting the one that gives the best overall QoC value.
A K-Means clusterer will generally produce good results if the overlap between the
clusters is minimal and if each cluster exhibits variability that is uniform in all
directions. When the data variability is different along the different directions in
the data space, the results you obtain with a K-Means clusterer may be improved by
first normalizing the data appropriately, as can be done in this module when you set
the C<do_variance_normalization> option in the constructor. However, as pointed out
elsewhere in this documentation, such normalization may actually decrease the
performance of the clusterer if the overall data variability along any dimension is
more a result of separation between the means than a consequence of intra-cluster
variability.
=head1 METHODS
The module provides the following methods for clustering, for cluster visualization,
for data visualization, for the generation of data for testing a clustering
algorithm, and for determining the cluster membership of a new data sample:
=over 4
=item B<new():>
my $clusterer = Algorithm::KMeans->new(datafile => $datafile,
mask => $mask,
K => $K,
cluster_seeding => 'random', # also try 'smart'
use_mahalanobis_metric => 1, # also try '0'
terminal_output => 1,
write_clusters_to_files => 1,
);
A call to C<new()> constructs a new instance of the C<Algorithm::KMeans> class. When
C<$K> is a non-zero positive integer, the module will construct exactly that many
clusters. However, when C<$K> is 0, the module will find the best number of clusters
to partition the data into. As explained in the Description, setting
C<cluster_seeding> to C<smart> causes PCA (principal components analysis) to be used
for discovering the best choices for the initial cluster centers. If you want purely
random decisions to be made for the initial choices for the cluster centers, set
C<cluster_seeding> to C<random>.
The data file is expected to contain entries in the following format
c20 0 10.7087017086940 9.63528386251712 10.9512155258108 ...
c7 0 12.8025925026787 10.6126270065785 10.5228482095349 ...
b9 0 7.60118206283120 5.05889245193079 5.82841781759102 ...
....
....
where the first column contains the symbolic ID tag for each data record and the rest
of the columns the numerical information. As to which columns are actually used for
clustering is decided by the string value of the mask. For example, if we wanted to
cluster on the basis of the entries in just the 3rd, the 4th, and the 5th columns
above, the mask value would be C<N0111> where the character C<N> indicates that the
ID tag is in the first column, the character C<0> that the second column is to be
ignored, and the C<1>'s that follow that the 3rd, the 4th, and the 5th columns are to
be used for clustering.
If you wish for the clusterer to search through a C<(Kmin,Kmax)> range of values for
C<K>, the constructor should be called in the following fashion:
my $clusterer = Algorithm::KMeans->new(datafile => $datafile,
mask => $mask,
Kmin => 3,
Kmax => 10,
cluster_seeding => 'smart', # try 'random' also
terminal_output => 1,
);
where obviously you can choose any reasonable values for C<Kmin> and C<Kmax>. If you
choose a value for C<Kmax> that is statistically too large, the module will let you
know. Again, you may choose C<random> for C<cluster_seeding>, the default value being
C<smart>.
If you believe that the variability of the data is very different along the different
dimensions of the data space, you may get better clustering by first normalizing the
data coordinates by the standard-deviations along those directions. When you set the
constructor option C<do_variance_normalization> as shown below, the module uses the
overall data standard-deviation along a direction for the normalization in that
direction. (As mentioned elsewhere in the documentation, such a normalization could
backfire on you if the data variability along a dimension is more a result of the
separation between the means than a consequence of the intra-cluster variability.):
my $clusterer = Algorithm::KMeans->new( datafile => $datafile,
mask => "N111",
K => 2,
cluster_seeding => 'smart', # try 'random' also
terminal_output => 1,
do_variance_normalization => 1,
);
=back
=head2 Constructor Parameters
=over 8
=item C<datafile>:
This parameter names the data file that contains the multidimensional data records
you want the module to cluster.
=item C<mask>:
This parameter supplies the mask to be applied to the columns of your data file. See
the explanation in Synopsis for what this mask looks like.
=item C<K>:
This parameter supplies the number of clusters you are looking for. If you set this
option to 0, that means that you want the module to search for the best value for
C<K>. (Keep in mind the fact that searching for the best C<K> may take a long time
( run in 2.705 seconds using v1.01-cache-2.11-cpan-140bd7fdf52 )