cid results from the CPAN

Algorithm-KMeans

##
##
##      2) Next, choose the data mask to apply to the columns of the data file.  The
##           position of the letter `N' in the mast indicates the column that
##           contains a symbolic name for each data record.  If the symbolic name for
##           each data record is in the first column and you want to cluster 3D data
##           that is in the next three columns, your data mask will be N111.  On the
##           other hand, if for the same data file, you want to carry out 2D
##           clustering on the last two columns, your data mask will be N011.
##
##      3) Next, you need to decide how many clusters you want the program to return.
##           If you want the program to figure out on its own how many clusters to 
##           partition the data into, see the script find_best_K_and_cluster.pl in this
##           directory.
##
##      4) Next you need to decide whether you want to `random' seeding or `smart'
##           seeding.  Bear in mind that `smart' seeding may produce worse results
##           than `random' seeding, depending on how the data clusters are actually
##           distributed.  
##
##      5) Next you need to decide whether or not you want to use the Mahalanobis
##           distance metric for clustering.  The default is the Euclidean metric.
##
##      6) Finally, you need to choose a mask for visualization.  Here is a reason
##           for why the visualization mask is set independently of the data mask
##           that was specified in Step 2: Let's say your datafile has 8 columns and
##           you are choosing to cluster the data records using 4 of those.
##           Subsequently, you may want to visually examine the quality of clustering
##           by examining some or 2D or 3D subspace of of the 4-dimensional space
##           used for clustering

lib/Algorithm/KMeans.pm view on Meta::CPAN

The data file is expected to contain entries in the following format

   c20  0  10.7087017086940  9.63528386251712  10.9512155258108  ...
   c7   0  12.8025925026787  10.6126270065785  10.5228482095349  ...
   b9   0  7.60118206283120  5.05889245193079  5.82841781759102  ...
   ....
   ....

where the first column contains the symbolic ID tag for each data record and the rest
of the columns the numerical information.  As to which columns are actually used for
clustering is decided by the string value of the mask.  For example, if we wanted to
cluster on the basis of the entries in just the 3rd, the 4th, and the 5th columns
above, the mask value would be C<N0111> where the character C<N> indicates that the
ID tag is in the first column, the character C<0> that the second column is to be
ignored, and the C<1>'s that follow that the 3rd, the 4th, and the 5th columns are to
be used for clustering.

If you wish for the clusterer to search through a C<(Kmin,Kmax)> range of values for
C<K>, the constructor should be called in the following fashion:

    my $clusterer = Algorithm::KMeans->new(datafile => $datafile,

( run in 0.497 second using v1.01-cache-2.11-cpan-de7293f3b23 )