Algorithm-KMeans

 view release on metacpan or  search on metacpan

lib/Algorithm/KMeans.pm  view on Meta::CPAN

# K, see the comments made under Description.  Note also how this method makes 20
# different tries for each value of K as a defense against the problem of the final
# result corresponding to some local minimum in the values of the QoC metric.  Out of
# these 20 tries for each K, it retains the clusters and the cluster centers for only
# that try that yields the smallest value for the QoC metric.  After estimating the
# "best" QoC values for all possible K in this manner, it then finds the K for which
# the QoC is the minimum.  This is taken to be the best value for K.  Finally, the
# output clusters are written out to separate files.
#
# If the KMeans constructor is invoked with the (Kmin, Kmax) options, then, instead
# of iterating through 2 and the maximum permissible value for K, the iterations are
# carried out only between Kmin and Kmax.
sub iterate_through_K {
    my $self = shift;
    my @all_data_ids = @{$self->{_data_id_tags}};
    my $N = $self->{_N};
    croak "You need more than 8 data samples. The number of data points must satisfy " .
          "the relation N > 2xK**2 where K is the number of clusters.  The smallest " .
          "value for K is 2.\n"  if $N <= 8;
    my $K_statistical_max = int( sqrt( $N / 2.0 ) );
    my $Kmin = $self->{_K_min} eq 'unknown' 

lib/Algorithm/KMeans.pm  view on Meta::CPAN

        my $determinant = $covariance->det();
        push @determinants, $determinant;
    }
    return [\@clusters, \@determinants];
}    

# This is the main routine that along with the update_cluster_centers() routine
# constitute the two key steps of the K-Means algorithm.  In most cases, the infinite
# while() loop will terminate automatically when the cluster assignments of the data
# points remain unchanged. For the sake of safety, we keep track of the number of
# iterations. If this number reaches 100, we exit the while() loop anyway.  In most
# cases, this limit will not be reached.
sub assign_data_to_clusters {
    my $self = shift;
    my $clusters = shift;
    my $K = shift;
    my $final_cluster_centers;
    my $iteration_index = 0;
    while (1) {
        my $new_clusters;
        my $assignment_changed_flag = 0;

lib/Algorithm/KMeans.pm  view on Meta::CPAN


Version 1.1.1 allows for range limiting the values of C<K> to search through.  C<K>
stands for the number of clusters to form.  This version also declares the module
dependencies in the C<Makefile.PL> file.

Version 1.1 is a an object-oriented version of the implementation presented in
version 1.0.  The current version should lend itself more easily to code extension.
You could, for example, create your own class by subclassing from the class presented
here and, in your subclass, use your own criteria for the similarity distance between
the data points and for the QoC (Quality of Clustering) metric, and, possibly a
different rule to stop the iterations.  Version 1.1 also allows you to directly
access the clusters formed and the cluster centers in your calling script.


=head1 SPECIAL USAGE NOTE

If you were directly accessing in your own scripts the clusters produced by the older
versions of this module, you'd need to make changes to your code if you wish to use
Version 2.0 or higher.  Instead of returning arrays of clusters and cluster centers,
Versions 2.0 and higher return hashes. This change was made necessary by the logic
required for implementing the two new C<which_cluster> methods that were introduced

lib/Algorithm/KMeans.pm  view on Meta::CPAN

possible value that makes statistical sense.  We then choose that value for C<K>
which yields the best value for the QoC (Quality of Clustering) metric.  It is
generally believed that the largest value for C<K> should not exceed C<sqrt(N/2)>
where C<N> is the number of data samples to be clustered.

What to use for the QoC metric is obviously a critical issue unto itself.  In the
current implementation, the value of QoC is the ratio of the average radius of the
clusters and the average distance between the cluster centers.

Every iterative algorithm requires a stopping criterion.  The criterion implemented
here is that we stop iterations when there is no re-assignment of the data points
during the assignment step.

Ordinarily, the output produced by a K-Means clusterer will correspond to a local
minimum for the QoC values, as opposed to a global minimum.  The current
implementation protects against that when the module constructor is called with the
C<random> option for C<cluster_seeding> by trying different randomly selected initial
cluster centers and then selecting the one that gives the best overall QoC value.

A K-Means clusterer will generally produce good results if the overlap between the
clusters is minimal and if each cluster exhibits variability that is uniform in all



( run in 0.858 second using v1.01-cache-2.11-cpan-71847e10f99 )