Algorithm-KMeans

 view release on metacpan or  search on metacpan

examples/cluster_and_visualize.pl  view on Meta::CPAN



# Mask: (For emphasis, this is a slightly more detailed repetition of the comment
# made above in Item 2)

# The mask tells the module which columns of the data file are are to be used for
# clustering, which columns are to be ignored, and which column contains a symbolic
# ID tag for a data point.  If the ID tag is in the first column and you are
# clustering 3D data in a file that has just four columns, the mask would be "N111".
# Note the first character in the mask in this case is `N' for "Name".  If, on the
# other hand, you wanted to ignore the first data coordinate (which is in the second
# column of the data file) for clustering, the mask would be "N011".  The symbolic ID
# can be in any column --- you just have to place the character `N' at the right
# place:


my $mask = "N111";         # for mydatafile1.dat, mydatafile2.dat, and sphericaldata.csv 
#my $mask = "N011";        # for mydatafile1.dat --- use all only last two cols
#my $mask = "N100";        # for mydatafile1.dat --- use only the first coordinate
#my $mask = "N11";         # for mydatafile3.dat

lib/Algorithm/KMeans.pm  view on Meta::CPAN

}

#  It makes sense to call visualize_data() only AFTER you have called the method
#  read_data_from_file().
#
#  The visualize_data() is meant for the visualization of the original data in its
#  various 2D or 3D subspaces.  The method can also be used to visualize the normed
#  data in a similar manner.  Recall the normed data is the original data after each
#  data dimension is normalized by the standard-deviation along that dimension.
#
#  Whether you see the original data or the normed data depends on the second
#  argument supplied in the method call.  It must be either the string 'original' or
#  the string 'normed'.
sub visualize_data {
    my $self = shift;
    my $v_mask = shift || croak "visualization mask missing";
    my $datatype = shift;    # must be either 'original' or 'normed'
    croak "\n\nABORTED: You called visualize_data() for normed data " .
          "but without first turning on data normalization in the " .
          "in the KMeans constructor"
          if ($datatype eq 'normed') && ! $self->{_var_normalize}; 

lib/Algorithm/KMeans.pm  view on Meta::CPAN

  # In all cases, you'd obviously begin with

  use Algorithm::KMeans;

  # You'd then name the data file:

  my $datafile = "mydatafile.csv";

  # Next, set the mask to indicate which columns of the datafile to use for
  # clustering and which column contains a symbolic ID for each data record. For
  # example, if the symbolic name is in the first column, you want the second column
  # to be ignored, and you want the next three columns to be used for 3D clustering,
  # you'd set the mask to:

  my $mask = "N0111";

  # Now construct an instance of the clusterer.  The parameter K controls the number
  # of clusters.  If you know how many clusters you want (let's say 3), call

  my $clusterer = Algorithm::KMeans->new( datafile        => $datafile,
                                          mask            => $mask,

lib/Algorithm/KMeans.pm  view on Meta::CPAN

                                          mask            => $mask,
                                          K               => 3,
                                          cluster_seeding => 'random',
                                          use_mahalanobis_metric => 1,
                                          terminal_output => 1,
                                          write_clusters_to_files => 1,
                                        );

  # For both constructor calls shown above, you can use smart seeding of the clusters
  # by changing 'random' to 'smart' for the cluster_seeding option.  See the
  # explanation of smart seeding in the Methods section of this documentation.

  # If your data is such that its variability along the different dimensions of the
  # data space is significantly different, you may get better clustering if you first
  # normalize your data by setting the constructor parameter
  # do_variance_normalization as shown below:

  my $clusterer = Algorithm::KMeans->new( datafile => $datafile,
                                          mask     => $mask,
                                          K        => 3,
                                          cluster_seeding => 'smart',    # or 'random'

lib/Algorithm/KMeans.pm  view on Meta::CPAN


Obviously, before the two-step approach can proceed, we need to initialize the the
cluster centers.  How this initialization is carried out is important.  The module
gives you two very different ways for carrying out this initialization.  One option,
called the C<smart> option, consists of subjecting the data to principal components
analysis to discover the direction of maximum variance in the data space.  The data
points are then projected on to this direction and a histogram constructed from the
projections.  Centers of the smoothed histogram are used to seed the clustering
operation.  The other option is to choose the cluster centers purely randomly.  You
get the first option if you set C<cluster_seeding> to C<smart> in the constructor,
and you get the second option if you set it to C<random>.

How to specify the number of clusters, C<K>, is one of the most vexing issues in any
approach to clustering.  In some case, we can set C<K> on the basis of prior
knowledge.  But, more often than not, no such prior knowledge is available.  When the
programmer does not explicitly specify a value for C<K>, the approach taken in the
current implementation is to try all possible values between 2 and some largest
possible value that makes statistical sense.  We then choose that value for C<K>
which yields the best value for the QoC (Quality of Clustering) metric.  It is
generally believed that the largest value for C<K> should not exceed C<sqrt(N/2)>
where C<N> is the number of data samples to be clustered.

lib/Algorithm/KMeans.pm  view on Meta::CPAN

   c7   0  12.8025925026787  10.6126270065785  10.5228482095349  ...
   b9   0  7.60118206283120  5.05889245193079  5.82841781759102  ...
   ....
   ....

where the first column contains the symbolic ID tag for each data record and the rest
of the columns the numerical information.  As to which columns are actually used for
clustering is decided by the string value of the mask.  For example, if we wanted to
cluster on the basis of the entries in just the 3rd, the 4th, and the 5th columns
above, the mask value would be C<N0111> where the character C<N> indicates that the
ID tag is in the first column, the character C<0> that the second column is to be
ignored, and the C<1>'s that follow that the 3rd, the 4th, and the 5th columns are to
be used for clustering.

If you wish for the clusterer to search through a C<(Kmin,Kmax)> range of values for
C<K>, the constructor should be called in the following fashion:

    my $clusterer = Algorithm::KMeans->new(datafile => $datafile,
                                           mask     => $mask,
                                           Kmin     => 3,
                                           Kmax     => 10,

lib/Algorithm/KMeans.pm  view on Meta::CPAN

    $clusterer->read_data_from_file()

=item B<kmeans()>

    $clusterer->kmeans();

    or 

    my ($clusters_hash, $cluster_centers_hash) = $clusterer->kmeans();

The first call above works solely by side-effect.  The second call also returns the
clusters and the cluster centers. See the C<cluster_and_visualize.pl> script in the
C<examples> directory for how you can in your own code extract the clusters and the
cluster centers from the variables C<$clusters_hash> and C<$cluster_centers_hash>.

=item B<get_K_best()>

    $clusterer->get_K_best();

This call makes sense only if you supply either the C<K=0> option to the constructor,
or if you specify values for the C<Kmin> and C<Kmax> options. The C<K=0> and the

lib/Algorithm/KMeans.pm  view on Meta::CPAN

The visualization mask here does not have to be identical to the one used for
clustering, but must be a subset of that mask.  This is convenient for visualizing
the clusters in two- or three-dimensional subspaces of the original space.

=item B<visualize_data()>

    $clusterer->visualize_data($visualization_mask, 'original');

    $clusterer->visualize_data($visualization_mask, 'normed');

This method requires a second argument and, as shown, it must be either the string
C<original> or the string C<normed>, the former for the visualization of the raw data
and the latter for the visualization of the data after its different dimensions are
normalized by the standard-deviations along those directions.  If you call the method
with the second argument set to C<normed>, but do so without turning on the
C<do_variance_normalization> option in the KMeans constructor, it will let you know.


=item  B<which_cluster_for_new_data_element()>

If you wish to determine the cluster membership of a new data sample after you have
created the clusters with the existing data samples, you would need to call this
method. The C<which_cluster_for_new_data.pl> script in the C<examples> directory
shows how to use this method.



( run in 0.713 second using v1.01-cache-2.11-cpan-39bf76dae61 )