Algorithm-KMeans
view release on metacpan or search on metacpan
lib/Algorithm/KMeans.pm view on Meta::CPAN
die "\n\nThe constructor options you have chosen do not work with the data. Try\n" .
"turning off the Mahalanobis option if you are using it.\n"
unless defined $clusters;
$self->{_clusters} = $clusters;
$self->{_cluster_centers} = $cluster_centers;
$self->{_QoC_values}->{"$K"} = $QoC;
if ($self->{_terminal_output}) {
print "\nDisplaying final clusters for best K (= $K) :\n";
display_clusters( $clusters );
$self->display_cluster_centers( $clusters );
print "\nQoC value (the smaller, the better): $QoC\n";
}
}
# For the smart try, we do initial cluster seeding by subjecting the data to
# principal components analysis in order to discover the direction of maximum
# variance in the data space. Subsequently, we try to find the K largest peaks along
# this direction. The coordinates of these peaks serve as the seeds for the K
# clusters.
sub cluster_for_fixed_K_single_smart_try {
my $self = shift;
my $K = shift;
my @all_data_ids = @{$self->{_data_id_tags}};
print "Clustering for K = $K\n" if $self->{_terminal_output};
my ($clusters, $cluster_centers);
eval {
($clusters, $cluster_centers) = $self->cluster_for_given_K($K);
};
if ($@) {
die "In cluster_for_fixed_K_single_smart_try: insufficient data for clustering " .
"with $self->{_K} clusters --- $@";
}
my $QoC = $self->cluster_quality( $clusters, $cluster_centers );
$self->{_clusters} = $clusters;
$self->{_cluster_centers} = $cluster_centers;
$self->{_QoC_values}->{"$K"} = $QoC;
if ($self->{_terminal_output}) {
print "\nDisplaying final clusters for best K (= $K) :\n";
display_clusters( $clusters );
$self->display_cluster_centers( $clusters );
print "\nQoC value (the smaller, the better): $QoC\n";
}
}
# The following subroutine is the top-level routine to call if you want the system to
# figure out on its own what value to use for K, the number of clusters. If you call
# the KMeans constructor with the K=0 option, the method below will try every
# possible value of K between 2 and the maximum possible depending on the number of
# data points available. For example, if the number of data points is 10,000, it will
# try all possible values of K between 2 and 70. For how the maximum value is set for
# K, see the comments made under Description. Note also how this method makes 20
# different tries for each value of K as a defense against the problem of the final
# result corresponding to some local minimum in the values of the QoC metric. Out of
# these 20 tries for each K, it retains the clusters and the cluster centers for only
# that try that yields the smallest value for the QoC metric. After estimating the
# "best" QoC values for all possible K in this manner, it then finds the K for which
# the QoC is the minimum. This is taken to be the best value for K. Finally, the
# output clusters are written out to separate files.
#
# If the KMeans constructor is invoked with the (Kmin, Kmax) options, then, instead
# of iterating through 2 and the maximum permissible value for K, the iterations are
# carried out only between Kmin and Kmax.
sub iterate_through_K {
my $self = shift;
my @all_data_ids = @{$self->{_data_id_tags}};
my $N = $self->{_N};
croak "You need more than 8 data samples. The number of data points must satisfy " .
"the relation N > 2xK**2 where K is the number of clusters. The smallest " .
"value for K is 2.\n" if $N <= 8;
my $K_statistical_max = int( sqrt( $N / 2.0 ) );
my $Kmin = $self->{_K_min} eq 'unknown'
? 2
: $self->{_K_min};
my $Kmax = $self->{_K_max} eq 'unknown'
? int( sqrt( $N / 2.0 ) )
: $self->{_K_max};
croak "\n\nYour Kmax value is too high. Ideally, it should not exceed sqrt(N/2) " .
"where N is the number of data points" if $Kmax > $K_statistical_max;
print "Value of Kmax is: $Kmax\n" if $self->{_terminal_output};
my @QoC_values;
my @array_of_clusters;
my @array_of_cluster_centers;
foreach my $K ($Kmin..$Kmax) {
my $QoC;
my $clusters;
my $cluster_centers;
print "Clustering for K = $K\n" if $self->{_terminal_output};
if ($self->{_cluster_seeding} eq 'random') {
foreach my $trial (1..21) {
print ". ";
my ($new_clusters, $new_cluster_centers);
if ($self->{_use_mahalanobis_metric}) {
eval {
($new_clusters, $new_cluster_centers) = $self->cluster_for_given_K($K);
};
next if $@;
} else {
($new_clusters, $new_cluster_centers) = $self->cluster_for_given_K($K);
}
my $newQoC = $self->cluster_quality( $new_clusters, $new_cluster_centers );
if ( (!defined $QoC) || ($newQoC < $QoC) ) {
$QoC = $newQoC;
$clusters = deep_copy_AoA( $new_clusters );
$cluster_centers = deep_copy_AoA( $new_cluster_centers );
}
}
print "\n";
} elsif ($self->{_cluster_seeding} eq 'smart') {
eval {
($clusters, $cluster_centers) = $self->cluster_for_given_K($K);
};
if ($@) {
$Kmax = $K - 1;
last;
}
$QoC = $self->cluster_quality($clusters,$cluster_centers);
} else {
die "You must either choose 'smart' for cluster_seeding or 'random'. " .
"Fix your constructor call."
}
push @QoC_values, $QoC;
lib/Algorithm/KMeans.pm view on Meta::CPAN
if $k < $how_many;
last if ++$k == $how_many;
}
return @K_highest_peak_locations;
}
# The purpose of this routine is to form initial clusters by assigning the data
# samples to the initial clusters formed by the previous routine on the basis of the
# best proximity of the data samples to the different cluster centers.
sub assign_data_to_clusters_initial {
my $self = shift;
my @cluster_centers = @{ shift @_ };
my @clusters;
foreach my $ele (@{$self->{_data_id_tags}}) {
my $best_cluster;
my @dist_from_clust_centers;
foreach my $center (@cluster_centers) {
push @dist_from_clust_centers, $self->distance($ele, $center);
}
my ($min, $best_center_index) = minimum( \@dist_from_clust_centers );
push @{$clusters[$best_center_index]}, $ele;
}
return \@clusters;
}
sub assign_data_to_clusters_initial_mahalanobis {
my $self = shift;
my @cluster_centers = @{ shift @_ };
my @clusters;
foreach my $ele (@{$self->{_data_id_tags}}) {
my $best_cluster;
my @dist_from_clust_centers;
foreach my $center (@cluster_centers) {
push @dist_from_clust_centers, $self->distance($ele, $center);
}
my ($min, $best_center_index) = minimum( \@dist_from_clust_centers );
push @{$clusters[$best_center_index]}, $ele if defined $best_center_index;
}
# Since a cluster center may not correspond to any particular sample, it is possible
# for one of the elements of the array @clusters to be null using the above
# strategy for populating the initial clusters. Let's say there are five cluster
# centers in the array @cluster_centers. The $best_center_index may populate the
# the elements of the array @clusters for the indices 0, 1, 2, 4, which would leave
# $clusters[3] as undefined. So, in what follows, we must first check if all of
# the elements of @clusters are defined.
my @determinants;
foreach my $cluster(@clusters) {
die "The clustering program started with bad initialization. Please start over"
unless defined $cluster;
my $covariance = $self->estimate_cluster_covariance($cluster);
my $determinant = $covariance->det();
push @determinants, $determinant;
}
return [\@clusters, \@determinants];
}
# This is the main routine that along with the update_cluster_centers() routine
# constitute the two key steps of the K-Means algorithm. In most cases, the infinite
# while() loop will terminate automatically when the cluster assignments of the data
# points remain unchanged. For the sake of safety, we keep track of the number of
# iterations. If this number reaches 100, we exit the while() loop anyway. In most
# cases, this limit will not be reached.
sub assign_data_to_clusters {
my $self = shift;
my $clusters = shift;
my $K = shift;
my $final_cluster_centers;
my $iteration_index = 0;
while (1) {
my $new_clusters;
my $assignment_changed_flag = 0;
my $current_cluster_center_index = 0;
my $cluster_size_zero_condition = 0;
my $how_many = @$clusters;
my $cluster_centers = $self->update_cluster_centers( deep_copy_AoA_with_nulls( $clusters ) );
$iteration_index++;
foreach my $cluster (@$clusters) {
my $current_cluster_center = $cluster_centers->[$current_cluster_center_index];
foreach my $ele (@$cluster) {
my @dist_from_clust_centers;
foreach my $center (@$cluster_centers) {
push @dist_from_clust_centers,
$self->distance($ele, $center);
}
my ($min, $best_center_index) = minimum( \@dist_from_clust_centers );
my $best_cluster_center = $cluster_centers->[$best_center_index];
if (vector_equal($current_cluster_center, $best_cluster_center)){
push @{$new_clusters->[$current_cluster_center_index]}, $ele;
} else {
$assignment_changed_flag = 1;
push @{$new_clusters->[$best_center_index]}, $ele;
}
}
$current_cluster_center_index++;
}
next if ((@$new_clusters != @$clusters) && ($iteration_index < 100));
# Now make sure that none of the K clusters is an empty cluster:
foreach my $newcluster (@$new_clusters) {
$cluster_size_zero_condition = 1 if ((!defined $newcluster) or (@$newcluster == 0));
}
push @$new_clusters, (undef) x ($K - @$new_clusters) if @$new_clusters < $K;
# During clustering for a fixed K, should a cluster inadvertently
# become empty, steal a member from the largest cluster to hopefully
# spawn a new cluster:
my $largest_cluster;
foreach my $local_cluster (@$new_clusters) {
next if !defined $local_cluster;
$largest_cluster = $local_cluster if !defined $largest_cluster;
if (@$local_cluster > @$largest_cluster) {
$largest_cluster = $local_cluster;
}
}
foreach my $local_cluster (@$new_clusters) {
if ( (!defined $local_cluster) || (@$local_cluster == 0) ) {
push @$local_cluster, pop @$largest_cluster;
}
}
next if (($cluster_size_zero_condition) && ($iteration_index < 100));
last if $iteration_index == 100;
$clusters = deep_copy_AoA( $new_clusters );
last if $assignment_changed_flag == 0;
lib/Algorithm/KMeans.pm view on Meta::CPAN
is based on Srezic's recommendation. He had no problem building and testing the
previous version with Perl 5.8.9. Version 2.05 also includes a small augmentation of
the code in the method C<read_data_from_file_csv()> for guarding against user errors
in the specification of the mask that tells the module which columns of the data file
are to be used for clustering.
Version 2.04 allows you to use CSV data files for clustering.
Version 2.03 incorporates minor code cleanup. The main implementation of the module
remains unchanged.
Version 2.02 downshifts the version of Perl that is required for this module. The
module should work with versions 5.10 and higher of Perl. The implementation code
for the module remains unchanged.
Version 2.01 removes many errors in the documentation. The changes made to the module
in Version 2.0 were not reflected properly in the documentation page for that
version. The implementation code remains unchanged.
Version 2.0 includes significant additional functionality: (1) You now have the
option to cluster using the Mahalanobis distance metric (the default is the Euclidean
metric); and (2) With the two C<which_cluster> methods that have been added to the
module, you can now determine the best cluster for a new data sample after you have
created the clusters with the previously available data. Finding the best cluster
for a new data sample can be done using either the Euclidean metric or the
Mahalanobis metric.
Version 1.40 includes a C<smart> option for seeding the clusters. This option,
supplied through the constructor parameter C<cluster_seeding>, means that the
clusterer will (1) Subject the data to principal components analysis in order to
determine the maximum variance direction; (2) Project the data onto this direction;
(3) Find peaks in a smoothed histogram of the projected points; and (4) Use the
locations of the highest peaks as initial guesses for the cluster centers. If you
don't want to use this option, set C<cluster_seeding> to C<random>. That should work
as in the previous version of the module.
Version 1.30 includes a bug fix for the case when the datafile contains empty lines,
that is, lines with no data records. Another bug fix in Version 1.30 deals with the
case when you want the module to figure out how many clusters to form (this is the
C<K=0> option in the constructor call) and the number of data records is close to the
minimum.
Version 1.21 includes fixes to handle the possibility that, when clustering the data
for a fixed number of clusters, a cluster may become empty during iterative
calculation of cluster assignments of the data elements and the updating of the
cluster centers. The code changes are in the C<assign_data_to_clusters()> and
C<update_cluster_centers()> subroutines.
Version 1.20 includes an option to normalize the data with respect to its variability
along the different coordinates before clustering is carried out.
Version 1.1.1 allows for range limiting the values of C<K> to search through. C<K>
stands for the number of clusters to form. This version also declares the module
dependencies in the C<Makefile.PL> file.
Version 1.1 is a an object-oriented version of the implementation presented in
version 1.0. The current version should lend itself more easily to code extension.
You could, for example, create your own class by subclassing from the class presented
here and, in your subclass, use your own criteria for the similarity distance between
the data points and for the QoC (Quality of Clustering) metric, and, possibly a
different rule to stop the iterations. Version 1.1 also allows you to directly
access the clusters formed and the cluster centers in your calling script.
=head1 SPECIAL USAGE NOTE
If you were directly accessing in your own scripts the clusters produced by the older
versions of this module, you'd need to make changes to your code if you wish to use
Version 2.0 or higher. Instead of returning arrays of clusters and cluster centers,
Versions 2.0 and higher return hashes. This change was made necessary by the logic
required for implementing the two new C<which_cluster> methods that were introduced
in Version 2.0. These methods return the best cluster for a new data sample from the
clusters you created using the existing data. One of the C<which_cluster> methods is
based on the Euclidean metric for finding the cluster that is closest to the new data
sample, and the other on the Mahalanobis metric. Another point of incompatibility
with the previous versions is that you must now explicitly set the C<cluster_seeding>
parameter in the call to the constructor to either C<random> or C<smart>. This
parameter does not have a default associated with it starting with Version 2.0.
=head1 DESCRIPTION
Clustering with K-Means takes place iteratively and involves two steps: 1) assignment
of data samples to clusters on the basis of how far the data samples are from the
cluster centers; and 2) Recalculation of the cluster centers (and cluster covariances
if you are using the Mahalanobis distance metric for clustering).
Obviously, before the two-step approach can proceed, we need to initialize the the
cluster centers. How this initialization is carried out is important. The module
gives you two very different ways for carrying out this initialization. One option,
called the C<smart> option, consists of subjecting the data to principal components
analysis to discover the direction of maximum variance in the data space. The data
points are then projected on to this direction and a histogram constructed from the
projections. Centers of the smoothed histogram are used to seed the clustering
operation. The other option is to choose the cluster centers purely randomly. You
get the first option if you set C<cluster_seeding> to C<smart> in the constructor,
and you get the second option if you set it to C<random>.
How to specify the number of clusters, C<K>, is one of the most vexing issues in any
approach to clustering. In some case, we can set C<K> on the basis of prior
knowledge. But, more often than not, no such prior knowledge is available. When the
programmer does not explicitly specify a value for C<K>, the approach taken in the
current implementation is to try all possible values between 2 and some largest
possible value that makes statistical sense. We then choose that value for C<K>
which yields the best value for the QoC (Quality of Clustering) metric. It is
generally believed that the largest value for C<K> should not exceed C<sqrt(N/2)>
where C<N> is the number of data samples to be clustered.
What to use for the QoC metric is obviously a critical issue unto itself. In the
current implementation, the value of QoC is the ratio of the average radius of the
clusters and the average distance between the cluster centers.
Every iterative algorithm requires a stopping criterion. The criterion implemented
here is that we stop iterations when there is no re-assignment of the data points
during the assignment step.
Ordinarily, the output produced by a K-Means clusterer will correspond to a local
minimum for the QoC values, as opposed to a global minimum. The current
implementation protects against that when the module constructor is called with the
C<random> option for C<cluster_seeding> by trying different randomly selected initial
cluster centers and then selecting the one that gives the best overall QoC value.
A K-Means clusterer will generally produce good results if the overlap between the
clusters is minimal and if each cluster exhibits variability that is uniform in all
directions. When the data variability is different along the different directions in
the data space, the results you obtain with a K-Means clusterer may be improved by
first normalizing the data appropriately, as can be done in this module when you set
the C<do_variance_normalization> option in the constructor. However, as pointed out
elsewhere in this documentation, such normalization may actually decrease the
performance of the clusterer if the overall data variability along any dimension is
more a result of separation between the means than a consequence of intra-cluster
variability.
=head1 METHODS
The module provides the following methods for clustering, for cluster visualization,
for data visualization, for the generation of data for testing a clustering
algorithm, and for determining the cluster membership of a new data sample:
=over 4
=item B<new():>
my $clusterer = Algorithm::KMeans->new(datafile => $datafile,
mask => $mask,
K => $K,
cluster_seeding => 'random', # also try 'smart'
use_mahalanobis_metric => 1, # also try '0'
terminal_output => 1,
write_clusters_to_files => 1,
);
A call to C<new()> constructs a new instance of the C<Algorithm::KMeans> class. When
C<$K> is a non-zero positive integer, the module will construct exactly that many
clusters. However, when C<$K> is 0, the module will find the best number of clusters
to partition the data into. As explained in the Description, setting
C<cluster_seeding> to C<smart> causes PCA (principal components analysis) to be used
for discovering the best choices for the initial cluster centers. If you want purely
random decisions to be made for the initial choices for the cluster centers, set
C<cluster_seeding> to C<random>.
The data file is expected to contain entries in the following format
c20 0 10.7087017086940 9.63528386251712 10.9512155258108 ...
c7 0 12.8025925026787 10.6126270065785 10.5228482095349 ...
b9 0 7.60118206283120 5.05889245193079 5.82841781759102 ...
....
....
where the first column contains the symbolic ID tag for each data record and the rest
of the columns the numerical information. As to which columns are actually used for
clustering is decided by the string value of the mask. For example, if we wanted to
cluster on the basis of the entries in just the 3rd, the 4th, and the 5th columns
( run in 0.664 second using v1.01-cache-2.11-cpan-96521ef73a4 )