Algorithm-KMeans

 view release on metacpan or  search on metacpan

MANIFEST  view on Meta::CPAN

examples/cleanup_directory.pl
examples/cluster_after_data_normalization.pl
examples/cluster_and_visualize.pl
examples/cluster_and_visualize_with_data_visualization.pl
examples/data_generator.pl
examples/which_cluster_for_new_data.pl
examples/find_best_K_and_cluster.pl
examples/find_best_K_in_range_and_cluster.pl
examples/sphericaldata.csv
examples/mydatafile1.dat
examples/mydatafile2.dat
examples/mydatafile3.dat
examples/param.txt
examples/param2.txt
examples/param3.txt
examples/README
lib/Algorithm/KMeans.pm
Makefile.PL

examples/README  view on Meta::CPAN


    The more time you spend with this script, the more comfortable you will
    become with the use of this module. The script file contains a large
    comment block that talks about the SIX LOCATIONS in the script where
    you have to make decisions about how to use the module.



2)  For clustering with an unknown value for K:

       find_best_K_and_cluster.pl

    This script is about the `K=0' option in the constructor that causes
    the module to search for the best K for your data.  Since this search
    is virtually unbounded --- subject only the size of your data file ---
    it may a long time to run this script on a large data file.  Hence the
    next script.



3)  If your datafile is too large, you may need to range limit the values of
    K that are searched through

       find_best_K_in_range_and_cluster.pl



4)  If you also want to include data normalization (it may reduce the
    performance of the clusterer in some cases):

       cluster_after_data_normalization.pl



examples/cluster_and_visualize.pl  view on Meta::CPAN

##      2) Next, choose the data mask to apply to the columns of the data file.  The
##           position of the letter `N' in the mast indicates the column that
##           contains a symbolic name for each data record.  If the symbolic name for
##           each data record is in the first column and you want to cluster 3D data
##           that is in the next three columns, your data mask will be N111.  On the
##           other hand, if for the same data file, you want to carry out 2D
##           clustering on the last two columns, your data mask will be N011.
##
##      3) Next, you need to decide how many clusters you want the program to return.
##           If you want the program to figure out on its own how many clusters to 
##           partition the data into, see the script find_best_K_and_cluster.pl in this
##           directory.
##
##      4) Next you need to decide whether you want to `random' seeding or `smart'
##           seeding.  Bear in mind that `smart' seeding may produce worse results
##           than `random' seeding, depending on how the data clusters are actually
##           distributed.  
##
##      5) Next you need to decide whether or not you want to use the Mahalanobis
##           distance metric for clustering.  The default is the Euclidean metric.
##

examples/data_generator.pl  view on Meta::CPAN


use strict;
use Algorithm::KMeans;

# The Parameter File:

# How the synthetic data is generated for clustering is
# controlled entirely by the input_parameter_file keyword in
# the function call shown below.  The mean vector and
# covariance matrix entries in file must be according to the
# syntax shown in the example param.txt file.  It is best to
# edit this file as needed for the purpose of data
# generation.

#my $parameter_file = "param.txt";
#my $parameter_file = "param3.txt";
my $parameter_file = "param2.txt";
#my $out_datafile = "mydatafile2.dat";
my $out_datafile = "mydatafile3.dat";

Algorithm::KMeans->cluster_data_generator( 

examples/find_best_K_and_cluster.pl  view on Meta::CPAN

#!/usr/bin/perl -w

#use lib '../blib/lib', '../blib/arch';

##  find_best_K_and_cluster.pl


##  IMPORTANT:  Read the 6 point customization of a script like this in the file:
##
##                       cluster_and_visualize.pl



##  This script is a demonstration of the constructor option:
##
##                           K => 0
##
##  for an unbounded search for the best K --- unbounded to the extent permitted by
##  the number of data records in your data file.  Recall K is the number of clusters
##  in your data.  By its very nature, unbounded search for the best K could take
##  more time than you have patience for if your data file is large.  In such cases,
##  you could try range bounded search as in the script: 
##
##                    find_best_K_in_range_and_cluster.pl




use strict;
use Algorithm::KMeans;

my $datafile = "mydatafile1.dat";                  # contains 3 clusters, 3D data
#my $datafile = "mydatafile2.dat";                   # contains 2 clusters, 3D data

examples/find_best_K_and_cluster.pl  view on Meta::CPAN

    print "\n$cluster_id   =>   @{$clusters_hash->{$cluster_id}}\n";
}

print "\nDisplaying cluster centers in the terminal window:\n";
foreach my $cluster_id (sort keys %{$cluster_centers_hash}) {
    print "\n$cluster_id   =>   @{$cluster_centers_hash->{$cluster_id}}\n";
}

# ACCESSING THE BEST VALUE OF K FOUND:

my $K_best = $clusterer->get_K_best();
$clusterer->show_QoC_values();


# VISUALIZATION:

# Visualization mask:

# In most cases, you would not change the value of the mask between clustering and
# visualization.  But, if you are clustering multi-dimensional data and you wish to
# visualize the projection of of the data on each plane separately, you can do so by

examples/find_best_K_in_range_and_cluster.pl  view on Meta::CPAN

#!/usr/bin/perl -w

#use lib '../blib/lib', '../blib/arch';


##  find_best_K_in_range_and_cluster.pl


##  IMPORTANT:  Read the 6 point customization of a script like this in the
##              file:
##                            cluster_and_visualize.pl


##  This script is a demonstration of the constructor options:
##
##                    Kmin     and     Kmax
##
##  for a range-bound search for the best K.  Recall K is the number of clusters
##  in your data.


use strict;
use Algorithm::KMeans;

#my $datafile = "mydatafile1.dat";                 # contains 3 clusters, 3D data
my $datafile = "sphericaldata.csv";               # contains 3 clusters, 3D data

my $mask = "N111";

examples/find_best_K_in_range_and_cluster.pl  view on Meta::CPAN

    print "\n$cluster_id   =>   @{$clusters_hash->{$cluster_id}}\n";
}

print "\nDisplaying cluster centers in the terminal window:\n";
foreach my $cluster_id (sort keys %{$cluster_centers_hash}) {
    print "\n$cluster_id   =>   @{$cluster_centers_hash->{$cluster_id}}\n";
}

# ACCESSING THE BEST VALUE OF K FOUND:

my $K_best = $clusterer->get_K_best();
$clusterer->show_QoC_values();


# VISUALIZATION:

# Visualization mask:
# In most cases, you would not change the value of the mask between clustering and
# visualization.  But, if you are clustering multi-dimensional data and you wish to
# visualize the projection of of the data on each plane separately, you can do so by
# changing the value of the visualization mask.  The number of on bits in the

lib/Algorithm/KMeans.pm  view on Meta::CPAN

        _K_min                    =>   $args{Kmin} || 'unknown',
        _K_max                    =>   $args{Kmax} || 'unknown',
        _cluster_seeding          =>   $args{cluster_seeding} || croak("must choose smart or random ".
                                                                       "for cluster seeding"),
        _var_normalize            =>   $args{do_variance_normalization} || 0,
        _use_mahalanobis_metric   =>   $args{use_mahalanobis_metric} || 0,  
        _clusters_2_files         =>   $args{write_clusters_to_files} || 0,
        _terminal_output          =>   $args{terminal_output} || 0,
        _debug                    =>   $args{debug} || 0,
        _N                        =>   0,
        _K_best                   =>   'unknown',
        _original_data            =>   {},
        _data                     =>   {},
        _data_id_tags             =>   [],
        _QoC_values               =>   {},
        _clusters                 =>   [],
        _cluster_centers          =>   [],
        _clusters_hash            =>   {},
        _cluster_centers_hash     =>   {},
        _cluster_covariances_hash =>   {},
        _data_dimensions          =>   0,

lib/Algorithm/KMeans.pm  view on Meta::CPAN

            $cluster_centers = deep_copy_AoA( $new_cluster_centers );
        } 
    }
    die "\n\nThe constructor options you have chosen do not work with the data.  Try\n" .
        "turning off the Mahalanobis option if you are using it.\n"
        unless defined $clusters;
    $self->{_clusters} = $clusters;
    $self->{_cluster_centers} = $cluster_centers;  
    $self->{_QoC_values}->{"$K"} = $QoC; 
    if ($self->{_terminal_output}) {
        print "\nDisplaying final clusters for best K (= $K) :\n";
        display_clusters( $clusters );
        $self->display_cluster_centers( $clusters );
        print "\nQoC value (the smaller, the better): $QoC\n";
    }
}

# For the smart try, we do initial cluster seeding by subjecting the data to
# principal components analysis in order to discover the direction of maximum
# variance in the data space.  Subsequently, we try to find the K largest peaks along
# this direction.  The coordinates of these peaks serve as the seeds for the K

lib/Algorithm/KMeans.pm  view on Meta::CPAN

    };
    if ($@) {
        die "In cluster_for_fixed_K_single_smart_try:  insufficient data for clustering " .
            "with $self->{_K} clusters --- $@";
    }
    my $QoC = $self->cluster_quality( $clusters, $cluster_centers );
    $self->{_clusters} = $clusters;
    $self->{_cluster_centers} = $cluster_centers;  
    $self->{_QoC_values}->{"$K"} = $QoC; 
    if ($self->{_terminal_output}) {
        print "\nDisplaying final clusters for best K (= $K) :\n";
        display_clusters( $clusters );
        $self->display_cluster_centers( $clusters );
        print "\nQoC value (the smaller, the better): $QoC\n";
    }
}

# The following subroutine is the top-level routine to call if you want the system to
# figure out on its own what value to use for K, the number of clusters.  If you call
# the KMeans constructor with the K=0 option, the method below will try every
# possible value of K between 2 and the maximum possible depending on the number of
# data points available. For example, if the number of data points is 10,000, it will
# try all possible values of K between 2 and 70. For how the maximum value is set for
# K, see the comments made under Description.  Note also how this method makes 20
# different tries for each value of K as a defense against the problem of the final
# result corresponding to some local minimum in the values of the QoC metric.  Out of
# these 20 tries for each K, it retains the clusters and the cluster centers for only
# that try that yields the smallest value for the QoC metric.  After estimating the
# "best" QoC values for all possible K in this manner, it then finds the K for which
# the QoC is the minimum.  This is taken to be the best value for K.  Finally, the
# output clusters are written out to separate files.
#
# If the KMeans constructor is invoked with the (Kmin, Kmax) options, then, instead
# of iterating through 2 and the maximum permissible value for K, the iterations are
# carried out only between Kmin and Kmax.
sub iterate_through_K {
    my $self = shift;
    my @all_data_ids = @{$self->{_data_id_tags}};
    my $N = $self->{_N};
    croak "You need more than 8 data samples. The number of data points must satisfy " .

lib/Algorithm/KMeans.pm  view on Meta::CPAN

            $QoC = $self->cluster_quality($clusters,$cluster_centers);
        } else {
            die "You must either choose 'smart' for cluster_seeding or 'random'. " .
                "Fix your constructor call." 
        }
        push @QoC_values, $QoC;
        push @array_of_clusters, $clusters;
        push @array_of_cluster_centers, $cluster_centers;
    }
    my ($min, $max) = minmax( \@QoC_values );
    my $K_best_relative_to_Kmin = get_index_at_value($min, \@QoC_values );
    my $K_best = $K_best_relative_to_Kmin + $Kmin;
    if ($self->{_terminal_output}) {
        print "\nDisplaying final clusters for best K (= $K_best) :\n";
        display_clusters( $array_of_clusters[$K_best_relative_to_Kmin] );
        $self->display_cluster_centers($array_of_clusters[$K_best_relative_to_Kmin]);
        print "\nBest clustering achieved for K=$K_best with QoC = $min\n" if defined $min;
        my @printableQoC = grep {$_} @QoC_values;
        print "\nQoC values array (the smaller the value, the better it is) for different " . 
              "K starting with K=$Kmin:  @printableQoC\n";
    }
    $self->{_K_best} = $K_best;
    foreach my $i (0..@QoC_values-1) {
        my $k = $i + $Kmin;
        $self->{_QoC_values}->{"$k"} = $QoC_values[$i]; 
    }
    $self->{_clusters} = $array_of_clusters[$K_best_relative_to_Kmin];
    $self->{_cluster_centers} =  
                $array_of_cluster_centers[$K_best_relative_to_Kmin];
}

# This is the function to call if you already know what value you want to use for K,
# the number of expected clusters.  The purpose of this function is to do the
# initialization of the cluster centers and to carry out the initial assignment of
# the data to the clusters with the initial cluster centers.  The initialization
# consists of 3 steps: Construct a random sequence of K integers between 0 and N-1
# where N is the number of data points to be clustered; 2) Call
# get_initial_cluster_centers() to index into the data array with the random integers
# to get a list of K data points that would serve as the initial cluster centers; and

lib/Algorithm/KMeans.pm  view on Meta::CPAN

        my $unscaled_peak_point = $min + $peaks_index_hash->{$peak} * $scale * $delta;
        push @K_highest_peak_locations, $unscaled_peak_point
            if $k < $how_many;
        last if ++$k == $how_many;
    }
    return @K_highest_peak_locations;
}

# The purpose of this routine is to form initial clusters by assigning the data
# samples to the initial clusters formed by the previous routine on the basis of the
# best proximity of the data samples to the different cluster centers.
sub assign_data_to_clusters_initial {
    my $self = shift;
    my @cluster_centers = @{ shift @_ };
    my @clusters;
    foreach my $ele (@{$self->{_data_id_tags}}) {
        my $best_cluster;
        my @dist_from_clust_centers;
        foreach my $center (@cluster_centers) {
            push @dist_from_clust_centers, $self->distance($ele, $center);
        }
        my ($min, $best_center_index) = minimum( \@dist_from_clust_centers );
        push @{$clusters[$best_center_index]}, $ele;
    }
    return \@clusters;
}    

sub assign_data_to_clusters_initial_mahalanobis {
    my $self = shift;
    my @cluster_centers = @{ shift @_ };
    my @clusters;
    foreach my $ele (@{$self->{_data_id_tags}}) {
        my $best_cluster;
        my @dist_from_clust_centers;
        foreach my $center (@cluster_centers) {
            push @dist_from_clust_centers, $self->distance($ele, $center);
        }
        my ($min, $best_center_index) = minimum( \@dist_from_clust_centers );
        push @{$clusters[$best_center_index]}, $ele if defined $best_center_index;
    }
    # Since a cluster center may not correspond to any particular sample, it is possible
    # for one of the elements of the array @clusters to be null using the above 
    # strategy for populating the initial clusters.  Let's say there are five cluster
    # centers in the array @cluster_centers.  The $best_center_index may populate the
    # the elements of the array @clusters for the indices 0, 1, 2, 4, which would leave
    # $clusters[3] as undefined.  So, in what follows, we must first check if all of
    # the elements of @clusters are defined.
    my @determinants;
    foreach my $cluster(@clusters) {
        die "The clustering program started with bad initialization.  Please start over" 
            unless defined $cluster;
        my $covariance = $self->estimate_cluster_covariance($cluster);
        my $determinant = $covariance->det();
        push @determinants, $determinant;

lib/Algorithm/KMeans.pm  view on Meta::CPAN

        my $cluster_centers = $self->update_cluster_centers( deep_copy_AoA_with_nulls( $clusters ) );
        $iteration_index++;
        foreach my $cluster (@$clusters) {
            my $current_cluster_center = $cluster_centers->[$current_cluster_center_index];
            foreach my $ele (@$cluster) {
                my @dist_from_clust_centers;
                foreach my $center (@$cluster_centers) {
                    push @dist_from_clust_centers, 
                               $self->distance($ele, $center);
                }
                my ($min, $best_center_index) = minimum( \@dist_from_clust_centers );
                my $best_cluster_center = $cluster_centers->[$best_center_index];
                if (vector_equal($current_cluster_center, $best_cluster_center)){
                    push @{$new_clusters->[$current_cluster_center_index]}, $ele;
                } else {
                    $assignment_changed_flag = 1;             
                    push @{$new_clusters->[$best_center_index]}, $ele;
                }
            }
            $current_cluster_center_index++;
        }
        next if ((@$new_clusters != @$clusters) && ($iteration_index < 100));
        # Now make sure that none of the K clusters is an empty cluster:
        foreach my $newcluster (@$new_clusters) {
            $cluster_size_zero_condition = 1 if ((!defined $newcluster) or  (@$newcluster == 0));
        }
        push @$new_clusters, (undef) x ($K - @$new_clusters) if @$new_clusters < $K;

lib/Algorithm/KMeans.pm  view on Meta::CPAN

                foreach my $i (0..@$clusters-1) {
                    my $center = $cluster_centers->[$i];
                    my $covariance = $cluster_covariances->[$i];
                    my $maha_distance;
                    eval {
                        $maha_distance = $self->distance_mahalanobis($ele, $center, $covariance);
                    };
                    next if $@;
                    push @mahalanobis_dist_from_clust_centers, $maha_distance; 
                }
                my ($min, $best_center_index) = minimum( \@mahalanobis_dist_from_clust_centers );
                die "The Mahalanobis metric may not be appropriate for the data" 
                    unless defined $best_center_index;
                my $best_cluster_center = $cluster_centers->[$best_center_index];
                if (vector_equal($current_cluster_center, $best_cluster_center)){
                    push @{$new_clusters->[$current_cluster_center_index]}, $ele;
                } else {
                    $assignment_changed_flag = 1;             
                    push @{$new_clusters->[$best_center_index]}, $ele;
                }
            }
            $current_cluster_center_index++;
        }
        next if ((@$new_clusters != @$clusters) && ($iteration_index < 100));
        # Now make sure that none of the K clusters is an empty cluster:
        foreach my $newcluster (@$new_clusters) {
            $cluster_size_zero_condition = 1 if ((!defined $newcluster) or  (@$newcluster == 0));
        }
        push @$new_clusters, (undef) x ($K - @$new_clusters) if @$new_clusters < $K;

lib/Algorithm/KMeans.pm  view on Meta::CPAN

        my $filename = "cluster" . $i . ".txt";
        print "\nWriting cluster $i to file $filename\n" if $self->{_terminal_output};
        open FILEHANDLE, "| sort > $filename"  or die "Unable to open file: $!";
        foreach my $ele (@{$clusters[$i]}) {        
            print FILEHANDLE "$ele ";
        }
        close FILEHANDLE;
    }
}

sub get_K_best {
    my $self = shift;
    croak "You need to run the clusterer with K=0 option " .
          "before you can call this method" if $self->{_K_best} eq 'unknown';
    print "\nThe best value of K: $self->{_K_best}\n" if $self->{_terminal_output};
    return $self->{_K_best};
}

sub show_QoC_values {
    my $self = shift;
    croak "\nYou need to run the clusterer with K=0 option before you can call this method" 
                            if $self->{_K_best} eq 'unknown';
    print "\nShown below are K on the left and the QoC values on the right (the smaller " .
          "the QoC, the better it is)\n";
    foreach my $key (sort keys %{$self->{_QoC_values}} ) {
        print " $key  =>  $self->{_QoC_values}->{$key}\n" if defined $self->{_QoC_values}->{$key};
    }
}

sub DESTROY {
    unlink "__temp_" . basename($_[0]->{_datafile});
    unlink "__temp_data_" . basename($_[0]->{_datafile});

lib/Algorithm/KMeans.pm  view on Meta::CPAN

                                          terminal_output => 1,
                                          do_variance_normalization => 1,
                                          write_clusters_to_files => 1,
                                        );

  # But bear in mind that such data normalization may actually decrease the
  # performance of the clusterer if the variability in the data is more a result of
  # the separation between the means than a consequence of intra-cluster variance.

  # Set K to 0 if you want the module to figure out the optimum number of clusters
  # from the data. (It is best to run this option with the terminal_output set to 1
  # so that you can see the different value of QoC for the different K):

  my $clusterer = Algorithm::KMeans->new( datafile => $datafile,
                                          mask     => $mask,
                                          K        => 0,
                                          cluster_seeding => 'random',    # or 'smart'
                                          terminal_output => 1,
                                          write_clusters_to_files => 1,
                                        );

lib/Algorithm/KMeans.pm  view on Meta::CPAN

  $clusterer->visualize_clusters($visualization_mask);


  # SYNTHETIC DATA GENERATION:

  # The module has been provided with a class method for generating multivariate data
  # for experimenting with clustering.  The data generation is controlled by the
  # contents of the parameter file that is supplied as an argument to the data
  # generator method.  The mean and covariance matrix entries in the parameter file
  # must be according to the syntax shown in the param.txt file in the examples
  # directory. It is best to edit this file as needed:

  my $parameter_file = "param.txt";
  my $out_datafile = "mydatafile.dat";
  Algorithm::KMeans->cluster_data_generator(
                          input_parameter_file => $parameter_file,
                          output_datafile => $out_datafile,
                          number_data_points_per_cluster => $N );

=head1 CHANGES

lib/Algorithm/KMeans.pm  view on Meta::CPAN

module should work with versions 5.10 and higher of Perl.  The implementation code
for the module remains unchanged.

Version 2.01 removes many errors in the documentation. The changes made to the module
in Version 2.0 were not reflected properly in the documentation page for that
version.  The implementation code remains unchanged.

Version 2.0 includes significant additional functionality: (1) You now have the
option to cluster using the Mahalanobis distance metric (the default is the Euclidean
metric); and (2) With the two C<which_cluster> methods that have been added to the
module, you can now determine the best cluster for a new data sample after you have
created the clusters with the previously available data.  Finding the best cluster
for a new data sample can be done using either the Euclidean metric or the
Mahalanobis metric.

Version 1.40 includes a C<smart> option for seeding the clusters.  This option,
supplied through the constructor parameter C<cluster_seeding>, means that the
clusterer will (1) Subject the data to principal components analysis in order to
determine the maximum variance direction; (2) Project the data onto this direction;
(3) Find peaks in a smoothed histogram of the projected points; and (4) Use the
locations of the highest peaks as initial guesses for the cluster centers.  If you
don't want to use this option, set C<cluster_seeding> to C<random>. That should work

lib/Algorithm/KMeans.pm  view on Meta::CPAN

access the clusters formed and the cluster centers in your calling script.


=head1 SPECIAL USAGE NOTE

If you were directly accessing in your own scripts the clusters produced by the older
versions of this module, you'd need to make changes to your code if you wish to use
Version 2.0 or higher.  Instead of returning arrays of clusters and cluster centers,
Versions 2.0 and higher return hashes. This change was made necessary by the logic
required for implementing the two new C<which_cluster> methods that were introduced
in Version 2.0.  These methods return the best cluster for a new data sample from the
clusters you created using the existing data.  One of the C<which_cluster> methods is
based on the Euclidean metric for finding the cluster that is closest to the new data
sample, and the other on the Mahalanobis metric.  Another point of incompatibility
with the previous versions is that you must now explicitly set the C<cluster_seeding>
parameter in the call to the constructor to either C<random> or C<smart>.  This
parameter does not have a default associated with it starting with Version 2.0.


=head1 DESCRIPTION

lib/Algorithm/KMeans.pm  view on Meta::CPAN

operation.  The other option is to choose the cluster centers purely randomly.  You
get the first option if you set C<cluster_seeding> to C<smart> in the constructor,
and you get the second option if you set it to C<random>.

How to specify the number of clusters, C<K>, is one of the most vexing issues in any
approach to clustering.  In some case, we can set C<K> on the basis of prior
knowledge.  But, more often than not, no such prior knowledge is available.  When the
programmer does not explicitly specify a value for C<K>, the approach taken in the
current implementation is to try all possible values between 2 and some largest
possible value that makes statistical sense.  We then choose that value for C<K>
which yields the best value for the QoC (Quality of Clustering) metric.  It is
generally believed that the largest value for C<K> should not exceed C<sqrt(N/2)>
where C<N> is the number of data samples to be clustered.

What to use for the QoC metric is obviously a critical issue unto itself.  In the
current implementation, the value of QoC is the ratio of the average radius of the
clusters and the average distance between the cluster centers.

Every iterative algorithm requires a stopping criterion.  The criterion implemented
here is that we stop iterations when there is no re-assignment of the data points
during the assignment step.

Ordinarily, the output produced by a K-Means clusterer will correspond to a local
minimum for the QoC values, as opposed to a global minimum.  The current
implementation protects against that when the module constructor is called with the
C<random> option for C<cluster_seeding> by trying different randomly selected initial
cluster centers and then selecting the one that gives the best overall QoC value.

A K-Means clusterer will generally produce good results if the overlap between the
clusters is minimal and if each cluster exhibits variability that is uniform in all
directions.  When the data variability is different along the different directions in
the data space, the results you obtain with a K-Means clusterer may be improved by
first normalizing the data appropriately, as can be done in this module when you set
the C<do_variance_normalization> option in the constructor.  However, as pointed out
elsewhere in this documentation, such normalization may actually decrease the
performance of the clusterer if the overall data variability along any dimension is
more a result of separation between the means than a consequence of intra-cluster

lib/Algorithm/KMeans.pm  view on Meta::CPAN

                                           mask            => $mask,
                                           K               => $K,
                                           cluster_seeding => 'random',     # also try 'smart'
                                           use_mahalanobis_metric => 1,     # also try '0'
                                           terminal_output => 1,     
                                           write_clusters_to_files => 1,
                                          );

A call to C<new()> constructs a new instance of the C<Algorithm::KMeans> class.  When
C<$K> is a non-zero positive integer, the module will construct exactly that many
clusters.  However, when C<$K> is 0, the module will find the best number of clusters
to partition the data into.  As explained in the Description, setting
C<cluster_seeding> to C<smart> causes PCA (principal components analysis) to be used
for discovering the best choices for the initial cluster centers.  If you want purely
random decisions to be made for the initial choices for the cluster centers, set
C<cluster_seeding> to C<random>.

The data file is expected to contain entries in the following format

   c20  0  10.7087017086940  9.63528386251712  10.9512155258108  ...
   c7   0  12.8025925026787  10.6126270065785  10.5228482095349  ...
   b9   0  7.60118206283120  5.05889245193079  5.82841781759102  ...
   ....
   ....

lib/Algorithm/KMeans.pm  view on Meta::CPAN

you want the module to cluster.

=item C<mask>:

This parameter supplies the mask to be applied to the columns of your data file. See
the explanation in Synopsis for what this mask looks like.

=item C<K>:

This parameter supplies the number of clusters you are looking for.  If you set this
option to 0, that means that you want the module to search for the best value for
C<K>.  (Keep in mind the fact that searching for the best C<K> may take a long time
for large data files.)

=item C<Kmin>:

If you supply an integer value for C<Kmin>, the search for the best C<K> will begin
with that value.

=item C<Kmax>:

If you supply an integer value for C<Kmax>, the search for the best C<K> will end at
that value.

=item C<cluster_seeding>:

This parameter must be set to either C<random> or C<smart>.  Depending on your data,
you may get superior clustering with the C<random> option.  The choice C<smart> means
that the clusterer will (1) subject the data to principal components analysis to
determine the maximum variance direction; (2) project the data onto this direction;
(3) find peaks in a smoothed histogram of the projected points; and (4) use the
locations of the highest peaks as seeds for cluster centers.  If the C<smart> option

lib/Algorithm/KMeans.pm  view on Meta::CPAN


    or 

    my ($clusters_hash, $cluster_centers_hash) = $clusterer->kmeans();

The first call above works solely by side-effect.  The second call also returns the
clusters and the cluster centers. See the C<cluster_and_visualize.pl> script in the
C<examples> directory for how you can in your own code extract the clusters and the
cluster centers from the variables C<$clusters_hash> and C<$cluster_centers_hash>.

=item B<get_K_best()>

    $clusterer->get_K_best();

This call makes sense only if you supply either the C<K=0> option to the constructor,
or if you specify values for the C<Kmin> and C<Kmax> options. The C<K=0> and the
C<(Kmin,Kmax)> options cause the module to determine the best value for C<K>.
Remember, C<K> is the number of clusters the data is partitioned into.

=item B<show_QoC_values()>

    $clusterer->show_QoC_values();

presents a table with C<K> values in the left column and the corresponding QoC
(Quality-of-Clustering) values in the right column.  Note that this call makes sense
only if you either supply the C<K=0> option to the constructor, or if you specify
values for the C<Kmin> and C<Kmax> options.

lib/Algorithm/KMeans.pm  view on Meta::CPAN

        cluster_and_visualize.pl

The more time you spend with this script, the more comfortable you will become with
the use of this module. The script file contains a large comment block that mentions
six locations in the script where you have to make decisions about how to use the
module.

See the following script if you do not know what value to use for C<K> for clustering
your data:

        find_best_K_and_cluster.pl

This script uses the C<K=0> option in the constructor that causes the module to
search for the best C<K> for your data.  Since this search is virtually unbounded ---
limited only by the number of samples in your data file --- the script may take a
long time to run for a large data file.  Hence the next script.

If your datafile is too large, you may need to range limit the values of C<K> that
are searched through, as in the following script:

        find_best_K_in_range_and_cluster.pl

If you also want to include data normalization (it may reduce the performance of the
clusterer in some cases), see the following script:

        cluster_after_data_normalization.pl

When you include the data normalization step and you would like to visualize the data
before and after normalization, see the following script:

        cluster_and_visualize_with_data_visualization.pl*

lib/Algorithm/KMeans.pm  view on Meta::CPAN

    export PERL5LIB=/some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/


=head1 THANKS

I thank Slaven for pointing out that I needed to downshift the required version of Perl
for this module.  Fortunately, I had access to an old machine still running Perl
5.10.1.  The current version, 2.02, is based on my testing the module on that machine.

I added two C<which_cluster> methods in Version 2.0 as a result of an email from
Jerome White who expressed a need for such methods in order to determine the best
cluster for a new data record after you have successfully clustered your existing
data.  Thanks Jerome for your feedback!

It was an email from Nadeem Bulsara that prompted me to create Version 1.40 of this
module.  Working with Version 1.30, Nadeem noticed that occasionally the module would
produce variable clustering results on the same dataset.  I believe that this
variability was caused (at least partly) by the purely random mode that was used in
Version 1.30 for the seeding of the cluster centers.  Version 1.40 now includes a
C<smart> mode. With the new mode the clusterer uses a PCA (Principal Components
Analysis) of the data to make good guesses for the cluster centers.  However,



( run in 1.260 second using v1.01-cache-2.11-cpan-4e96b696675 )