Algorithm-KMeans
view release on metacpan or search on metacpan
lib/Algorithm/KMeans.pm view on Meta::CPAN
my $clusterer = Algorithm::KMeans->new( datafile => $datafile,
mask => $mask,
K => 3,
cluster_seeding => 'random',
use_mahalanobis_metric => 1,
terminal_output => 1,
write_clusters_to_files => 1,
);
# For both constructor calls shown above, you can use smart seeding of the clusters
# by changing 'random' to 'smart' for the cluster_seeding option. See the
# explanation of smart seeding in the Methods section of this documentation.
# If your data is such that its variability along the different dimensions of the
# data space is significantly different, you may get better clustering if you first
# normalize your data by setting the constructor parameter
# do_variance_normalization as shown below:
my $clusterer = Algorithm::KMeans->new( datafile => $datafile,
mask => $mask,
K => 3,
cluster_seeding => 'smart', # or 'random'
terminal_output => 1,
do_variance_normalization => 1,
write_clusters_to_files => 1,
);
# But bear in mind that such data normalization may actually decrease the
# performance of the clusterer if the variability in the data is more a result of
# the separation between the means than a consequence of intra-cluster variance.
# Set K to 0 if you want the module to figure out the optimum number of clusters
# from the data. (It is best to run this option with the terminal_output set to 1
# so that you can see the different value of QoC for the different K):
my $clusterer = Algorithm::KMeans->new( datafile => $datafile,
mask => $mask,
K => 0,
cluster_seeding => 'random', # or 'smart'
terminal_output => 1,
write_clusters_to_files => 1,
);
# Although not shown above, you can obviously set the 'do_variance_normalization'
# flag here also if you wish.
# For very large data files, setting K to 0 will result in searching through too
# many values for K. For such cases, you can range limit the values of K to search
# through by
my $clusterer = Algorithm::KMeans->new( datafile => $datafile,
mask => "N111",
Kmin => 3,
Kmax => 10,
cluster_seeding => 'random', # or 'smart'
terminal_output => 1,
write_clusters_to_files => 1,
);
# FOR ALL CASES ABOVE, YOU'D NEED TO MAKE THE FOLLOWING CALLS ON THE CLUSTERER
# INSTANCE TO ACTUALLY CLUSTER THE DATA:
$clusterer->read_data_from_file();
$clusterer->kmeans();
# If you want to directly access the clusters and the cluster centers in your own
# top-level script, replace the above two statements with:
$clusterer->read_data_from_file();
my ($clusters_hash, $cluster_centers_hash) = $clusterer->kmeans();
# You can subsequently access the clusters directly in your own code, as in:
foreach my $cluster_id (sort keys %{$clusters_hash}) {
print "\n$cluster_id => @{$clusters_hash->{$cluster_id}}\n";
}
foreach my $cluster_id (sort keys %{$cluster_centers_hash}) {
print "\n$cluster_id => @{$cluster_centers_hash->{$cluster_id}}\n";
}
# CLUSTER VISUALIZATION:
# You must first set the mask for cluster visualization. This mask tells the module
# which 2D or 3D subspace of the original data space you wish to visualize the
# clusters in:
my $visualization_mask = "111";
$clusterer->visualize_clusters($visualization_mask);
# SYNTHETIC DATA GENERATION:
# The module has been provided with a class method for generating multivariate data
# for experimenting with clustering. The data generation is controlled by the
# contents of the parameter file that is supplied as an argument to the data
# generator method. The mean and covariance matrix entries in the parameter file
# must be according to the syntax shown in the param.txt file in the examples
# directory. It is best to edit this file as needed:
my $parameter_file = "param.txt";
my $out_datafile = "mydatafile.dat";
Algorithm::KMeans->cluster_data_generator(
input_parameter_file => $parameter_file,
output_datafile => $out_datafile,
number_data_points_per_cluster => $N );
=head1 CHANGES
Version 2.05 removes the restriction on the version of Perl that is required. This
is based on Srezic's recommendation. He had no problem building and testing the
previous version with Perl 5.8.9. Version 2.05 also includes a small augmentation of
the code in the method C<read_data_from_file_csv()> for guarding against user errors
in the specification of the mask that tells the module which columns of the data file
are to be used for clustering.
Version 2.04 allows you to use CSV data files for clustering.
Version 2.03 incorporates minor code cleanup. The main implementation of the module
remains unchanged.
Version 2.02 downshifts the version of Perl that is required for this module. The
module should work with versions 5.10 and higher of Perl. The implementation code
for the module remains unchanged.
Version 2.01 removes many errors in the documentation. The changes made to the module
in Version 2.0 were not reflected properly in the documentation page for that
version. The implementation code remains unchanged.
Version 2.0 includes significant additional functionality: (1) You now have the
option to cluster using the Mahalanobis distance metric (the default is the Euclidean
metric); and (2) With the two C<which_cluster> methods that have been added to the
module, you can now determine the best cluster for a new data sample after you have
created the clusters with the previously available data. Finding the best cluster
for a new data sample can be done using either the Euclidean metric or the
Mahalanobis metric.
Version 1.40 includes a C<smart> option for seeding the clusters. This option,
supplied through the constructor parameter C<cluster_seeding>, means that the
clusterer will (1) Subject the data to principal components analysis in order to
determine the maximum variance direction; (2) Project the data onto this direction;
(3) Find peaks in a smoothed histogram of the projected points; and (4) Use the
locations of the highest peaks as initial guesses for the cluster centers. If you
don't want to use this option, set C<cluster_seeding> to C<random>. That should work
as in the previous version of the module.
Version 1.30 includes a bug fix for the case when the datafile contains empty lines,
that is, lines with no data records. Another bug fix in Version 1.30 deals with the
case when you want the module to figure out how many clusters to form (this is the
C<K=0> option in the constructor call) and the number of data records is close to the
minimum.
( run in 0.538 second using v1.01-cache-2.11-cpan-140bd7fdf52 )