Algorithm-KMeans
view release on metacpan or search on metacpan
examples/README view on Meta::CPAN
5) When you include the data normalization step and you would like to
visualize the data before and after normalization:
cluster_and_visualize_with_data_visualization.pl*
6) After you are done clustering, let's say you want to find the cluster
membership of a new data element. To see how you can do that, see the
script
which_cluster_for_new_data.pl
As written, the script gives you two answers for which cluster the new
data element belongs to. One of these is using the Euclidean metric to
calculate the distances between the new data element and the cluster
centers, and the other using the Mahalanobis metric. If the clusters
are strongly elliptical in shape, you are likely to get better results
examples/cluster_and_visualize.pl view on Meta::CPAN
## 4) Next you need to decide whether you want to `random' seeding or `smart'
## seeding. Bear in mind that `smart' seeding may produce worse results
## than `random' seeding, depending on how the data clusters are actually
## distributed.
##
## 5) Next you need to decide whether or not you want to use the Mahalanobis
## distance metric for clustering. The default is the Euclidean metric.
##
## 6) Finally, you need to choose a mask for visualization. Here is a reason
## for why the visualization mask is set independently of the data mask
## that was specified in Step 2: Let's say your datafile has 8 columns and
## you are choosing to cluster the data records using 4 of those.
## Subsequently, you may want to visually examine the quality of clustering
## by examining some or 2D or 3D subspace of of the 4-dimensional space
## used for clustering
use strict;
use Algorithm::KMeans;
examples/which_cluster_for_new_data.pl view on Meta::CPAN
#!/usr/bin/perl -w
#use lib '../blib/lib', '../blib/arch';
## which_cluster_for_new_data.pl
## Let's say that after you are done with the clustering of your data, you have a
## new data element and you want to find out as to which cluster it belongs to.
## This script demonstrates how you can do that by making calls to the following
## two methods of the module:
##
## which_cluster_for_new_data_element()
##
## which_cluster_for_new_data_element_mahalanobis()
##
## Both these methods do the same thing except that that latter uses the
## Mahalanobis metric to measure the distance between the new data element
lib/Algorithm/KMeans.pm view on Meta::CPAN
my $best_cluster;
my @dist_from_clust_centers;
foreach my $center (@cluster_centers) {
push @dist_from_clust_centers, $self->distance($ele, $center);
}
my ($min, $best_center_index) = minimum( \@dist_from_clust_centers );
push @{$clusters[$best_center_index]}, $ele if defined $best_center_index;
}
# Since a cluster center may not correspond to any particular sample, it is possible
# for one of the elements of the array @clusters to be null using the above
# strategy for populating the initial clusters. Let's say there are five cluster
# centers in the array @cluster_centers. The $best_center_index may populate the
# the elements of the array @clusters for the indices 0, 1, 2, 4, which would leave
# $clusters[3] as undefined. So, in what follows, we must first check if all of
# the elements of @clusters are defined.
my @determinants;
foreach my $cluster(@clusters) {
die "The clustering program started with bad initialization. Please start over"
unless defined $cluster;
my $covariance = $self->estimate_cluster_covariance($cluster);
my $determinant = $covariance->det();
lib/Algorithm/KMeans.pm view on Meta::CPAN
# Next, set the mask to indicate which columns of the datafile to use for
# clustering and which column contains a symbolic ID for each data record. For
# example, if the symbolic name is in the first column, you want the second column
# to be ignored, and you want the next three columns to be used for 3D clustering,
# you'd set the mask to:
my $mask = "N0111";
# Now construct an instance of the clusterer. The parameter K controls the number
# of clusters. If you know how many clusters you want (let's say 3), call
my $clusterer = Algorithm::KMeans->new( datafile => $datafile,
mask => $mask,
K => 3,
cluster_seeding => 'random',
terminal_output => 1,
write_clusters_to_files => 1,
);
# By default, this constructor call will set you up for clustering based on
lib/Algorithm/KMeans.pm view on Meta::CPAN
produces bizarre results, try C<random>.
=item C<use_mahalanobis_metric>:
When set to 1, this option causes Mahalanobis distances to be used for clustering.
The default is 0 for this parameter. By default, the module uses the Euclidean
distances for clustering. In general, Mahalanobis distance based clustering will
fail if your data resides on a lower-dimensional hyperplane in the data space, if you
seek too many clusters, and if you do not have a sufficient number of samples in your
data file. A necessary requirement for the module to be able to compute Mahalanobis
distances is that the cluster covariance matrices be non-singular. (Let's say your
data dimensionality is C<D> and the module is considering a cluster that has only
C<d> samples in it where C<d> is less than C<D>. In this case, the covariance matrix
will be singular since its rank will not exceed C<d>. For the covariance matrix to
be non-singular, it must be of full rank, that is, its rank must be C<D>.)
=item C<do_variance_normalization>:
When set, the module will first normalize the data variance along the different
dimensions of the data space before attempting clustering. Depending on your data,
this option may or may not result in better clustering.
lib/Algorithm/KMeans.pm view on Meta::CPAN
If you also want to include data normalization (it may reduce the performance of the
clusterer in some cases), see the following script:
cluster_after_data_normalization.pl
When you include the data normalization step and you would like to visualize the data
before and after normalization, see the following script:
cluster_and_visualize_with_data_visualization.pl*
After you are done clustering, let's say you want to find the cluster membership of a
new data sample. To see how you can do that, see the script:
which_cluster_for_new_data.pl
This script returns two answers for which cluster a new data sample belongs to: one
using the Euclidean metric to calculate the distances between the new data sample and
the cluster centers, and the other using the Mahalanobis metric. If the clusters are
strongly elliptical in shape, you are likely to get better results with the
Mahalanobis metric. (To see that you can get two different answers using the two
different distance metrics, run the C<which_cluster_for_new_data.pl> script on the
( run in 0.428 second using v1.01-cache-2.11-cpan-483215c6ad5 )