Algorithm-LinearManifoldDataClusterer
view release on metacpan or search on metacpan
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
$squared_sum += ($ele1[$i] - $ele2[$i])**2;
}
my $dist = sqrt $squared_sum;
return $dist;
}
sub write_clusters_to_files {
my $self = shift;
my $clusters = shift;
my @clusters = @$clusters;
unlink glob "cluster*.txt";
foreach my $i (0..@clusters-1) {
my $filename = "cluster" . $i . ".txt";
print "\nWriting cluster $i to file $filename\n" if $self->{_terminal_output};
open FILEHANDLE, "| sort > $filename" or die "Unable to open file: $!";
foreach my $ele (@{$clusters[$i]}) {
print FILEHANDLE "$ele ";
}
close FILEHANDLE;
}
}
sub DESTROY {
my $filename = basename($_[0]->{_datafile});
$filename =~ s/\.\w+$/\.txt/;
unlink "__temp_" . $filename;
}
################################## Visualization Code ###################################
sub add_point_coords {
my $self = shift;
my @arr_of_ids = @{shift @_}; # array of data element names
my @result;
my $data_dimensionality = $self->{_data_dimensions};
foreach my $i (0..$data_dimensionality-1) {
$result[$i] = 0.0;
}
foreach my $id (@arr_of_ids) {
my $ele = $self->{_data_hash}->{$id};
my $i = 0;
foreach my $component (@$ele) {
$result[$i] += $component;
$i++;
}
}
return \@result;
}
# This is the main module version:
sub visualize_clusters_on_sphere {
my $self = shift;
my $visualization_msg = shift;
my $clusters = deep_copy_AoA(shift);
my $hardcopy_format = shift;
my $pause_time = shift;
my $d = $self->{_data_dimensions};
my $temp_file = "__temp_" . $self->{_datafile};
$temp_file =~ s/\.\w+$/\.txt/;
unlink $temp_file if -e $temp_file;
open OUTPUT, ">$temp_file"
or die "Unable to open a temp file in this directory: $!";
my @all_tags = "A".."Z";
my @retagged_clusters;
foreach my $cluster (@$clusters) {
my $label = shift @all_tags;
my @retagged_cluster =
map {$_ =~ s/^(\w+?)_(\w+)/$label . "_$2 @{$self->{_data_hash}->{$_}}"/e;$_} @$cluster;
push @retagged_clusters, \@retagged_cluster;
}
my %clusters;
foreach my $cluster (@retagged_clusters) {
foreach my $record (@$cluster) {
my @splits = grep $_, split /\s+/, $record;
$splits[0] =~ /(\w+?)_.*/;
my $primary_cluster_label = $1;
my @coords = @splits[1..$d];
push @{$clusters{$primary_cluster_label}}, \@coords;
}
}
foreach my $key (sort {"\L$a" cmp "\L$b"} keys %clusters) {
map {print OUTPUT "$_"} map {"@$_\n"} @{$clusters{$key}};
print OUTPUT "\n\n";
}
my @sorted_cluster_keys = sort {"\L$a" cmp "\L$b"} keys %clusters;
close OUTPUT;
my $plot;
unless (defined $pause_time) {
$plot = Graphics::GnuplotIF->new( persist => 1 );
} else {
$plot = Graphics::GnuplotIF->new();
}
my $arg_string = "";
$plot->gnuplot_cmd( "set hidden3d" ) unless $self->{_show_hidden_in_3D_plots};
$plot->gnuplot_cmd( "set title \"$visualization_msg\"" );
$plot->gnuplot_cmd( "set noclip" );
$plot->gnuplot_cmd( "set pointsize 2" );
$plot->gnuplot_cmd( "set parametric" );
$plot->gnuplot_cmd( "set size ratio 1" );
$plot->gnuplot_cmd( "set xlabel \"X\"" );
$plot->gnuplot_cmd( "set ylabel \"Y\"" );
$plot->gnuplot_cmd( "set zlabel \"Z\"" );
if ($hardcopy_format) {
$plot->gnuplot_cmd( "set terminal png" );
my $image_file_name = "$visualization_msg\.$hardcopy_format";
$plot->gnuplot_cmd( "set output \"$image_file_name\"" );
$plot->gnuplot_cmd( "unset hidden3d" );
}
# set the range for azimuth angles:
$plot->gnuplot_cmd( "set urange [0:2*pi]" );
# set the range for the elevation angles:
$plot->gnuplot_cmd( "set vrange [-pi/2:pi/2]" );
# Parametric functions for the sphere
# $plot->gnuplot_cmd( "r=1" );
if ($self->{_scale_factor}) {
$plot->gnuplot_cmd( "r=$self->{_scale_factor}" );
} else {
$plot->gnuplot_cmd( "r=1" );
}
$plot->gnuplot_cmd( "fx(v,u) = r*cos(v)*cos(u)" );
$plot->gnuplot_cmd( "fy(v,u) = r*cos(v)*sin(u)" );
$plot->gnuplot_cmd( "fz(v) = r*sin(v)" );
my $sphere_arg_str = "fx(v,u),fy(v,u),fz(v) notitle with lines lt 0,";
foreach my $i (0..scalar(keys %clusters)-1) {
my $j = $i + 1;
# The following statement puts the titles on the data points
$arg_string .= "\"$temp_file\" index $i using 1:2:3 title \"$sorted_cluster_keys[$i] \" with points lt $j pt $j, ";
}
$arg_string = $arg_string =~ /^(.*),[ ]+$/;
$arg_string = $1;
$plot->gnuplot_cmd( "splot $sphere_arg_str $arg_string" );
$plot->gnuplot_pause( $pause_time ) if defined $pause_time;
}
################################### Support Routines ########################################
sub cluster_split {
my $cluster = shift;
my $how_many = shift;
my @cluster_fragments;
foreach my $i (0..$how_many-1) {
$cluster_fragments[$i] = [];
}
my $delta = int( scalar(@$cluster) / $how_many );
my $j = 0;
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
}
}
push @covariances, $cluster_covariance;
}
if ($self->{_debug}) {
foreach my $i (0..$K-1) {
print "\n\nCluster center: @{$cluster_centers[$i]}\n";
print "\nCovariance:\n";
foreach my $j (0..1) {
foreach my $k (0..1) {
print "$covariances[$i]->[$j]->[$k] ";
}
print "\n";
}
}
}
my @data_dump;
foreach my $i (0..$K-1) {
my @m = @{shift @cluster_centers};
my @covar = @{shift @covariances};
my @new_data = Math::Random::random_multivariate_normal($N, @m, @covar);
if ($self->{_debug}) {
print "\nThe points for cluster $i:\n";
map { print "@$_ "; } @new_data;
print "\n\n";
}
my @wrapped_data;
foreach my $d (@new_data) {
my $wrapped_d;
if ($d->[0] >= 360.0) {
$wrapped_d->[0] = $d->[0] - 360.0;
} elsif ($d->[0] < 0) {
$wrapped_d->[0] = 360.0 - abs($d->[0]);
}
if ($d->[1] >= 90.0) {
$wrapped_d->[0] = POSIX::fmod($d->[0] + 180.0, 360);
$wrapped_d->[1] = 180.0 - $d->[1];
} elsif ($d->[1] < -90.0) {
$wrapped_d->[0] = POSIX::fmod($d->[0] + 180, 360);
$wrapped_d->[1] = -180.0 - $d->[1];
}
$wrapped_d->[0] = $d->[0] unless defined $wrapped_d->[0];
$wrapped_d->[1] = $d->[1] unless defined $wrapped_d->[1];
push @wrapped_data, $wrapped_d;
}
if ($self->{_debug}) {
print "\nThe unwrapped points for cluster $i:\n";
map { print "@$_ "; } @wrapped_data;
print "\n\n";
}
my $label = $point_labels[$i];
my $j = 0;
@new_data = map {unshift @$_, $label."_".$j; $j++; $_} @wrapped_data;
push @data_dump, @new_data;
}
if ($self->{_debug}) {
print "\n\nThe labeled points for clusters:\n";
map { print "@$_\n"; } @data_dump;
}
fisher_yates_shuffle( \@data_dump );
open OUTPUT, ">$output_file";
my $total_num_of_points = $N * $K;
print "Total number of data points that will be written out to the file: $total_num_of_points\n"
if $self->{_debug};
foreach my $ele (@data_dump) {
my ($x,$y,$z);
my $label = $ele->[0];
my $azimuth = $ele->[1];
my $elevation = $ele->[2];
$x = cos($elevation) * cos($azimuth);
$y = cos($elevation) * sin($azimuth);
$z = sin($elevation);
my $csv_str = join ",", ($label,$x,$y,$z);
print OUTPUT "$csv_str\n";
}
print "\n\n";
print "Data written out to file $output_file\n" if $self->{_debug};
close OUTPUT;
}
# This version for the embedded class for data generation
sub visualize_data_on_sphere {
my $self = shift;
my $datafile = shift;
my $filename = File::Basename::basename($datafile);
my $temp_file = "__temp_" . $filename;
$temp_file =~ s/\.\w+$/\.txt/;
unlink $temp_file if -e $temp_file;
open OUTPUT, ">$temp_file"
or die "Unable to open a temp file in this directory: $!";
open INPUT, "< $filename" or die "Unable to open $filename: $!";
local $/ = undef;
my @all_records = split /\s+/, <INPUT>;
my %clusters;
foreach my $record (@all_records) {
my @splits = split /,/, $record;
my $record_name = shift @splits;
$record_name =~ /(\w+?)_.*/;
my $primary_cluster_label = $1;
push @{$clusters{$primary_cluster_label}}, \@splits;
}
foreach my $key (sort {"\L$a" cmp "\L$b"} keys %clusters) {
map {print OUTPUT "$_"} map {"@$_\n"} @{$clusters{$key}};
print OUTPUT "\n\n";
}
my @sorted_cluster_keys = sort {"\L$a" cmp "\L$b"} keys %clusters;
close OUTPUT;
my $plot = Graphics::GnuplotIF->new( persist => 1 );
my $arg_string = "";
$plot->gnuplot_cmd( "set noclip" );
$plot->gnuplot_cmd( "set hidden3d" ) unless $self->{_show_hidden_in_3D_plots};
$plot->gnuplot_cmd( "set pointsize 2" );
$plot->gnuplot_cmd( "set parametric" );
$plot->gnuplot_cmd( "set size ratio 1" );
$plot->gnuplot_cmd( "set xlabel \"X\"" );
$plot->gnuplot_cmd( "set ylabel \"Y\"" );
$plot->gnuplot_cmd( "set zlabel \"Z\"" );
# set the range for azimuth angles:
$plot->gnuplot_cmd( "set urange [0:2*pi]" );
# set the range for the elevation angles:
$plot->gnuplot_cmd( "set vrange [-pi/2:pi/2]" );
# Parametric functions for the sphere
$plot->gnuplot_cmd( "r=1" );
$plot->gnuplot_cmd( "fx(v,u) = r*cos(v)*cos(u)" );
$plot->gnuplot_cmd( "fy(v,u) = r*cos(v)*sin(u)" );
$plot->gnuplot_cmd( "fz(v) = r*sin(v)" );
my $sphere_arg_str = "fx(v,u),fy(v,u),fz(v) notitle with lines lt 0,";
foreach my $i (0..scalar(keys %clusters)-1) {
my $j = $i + 1;
# The following statement puts the titles on the data points
$arg_string .= "\"$temp_file\" index $i using 1:2:3 title \"$sorted_cluster_keys[$i] \" with points lt $j pt $j, ";
}
$arg_string = $arg_string =~ /^(.*),[ ]+$/;
$arg_string = $1;
# $plot->gnuplot_cmd( "splot $arg_string" );
$plot->gnuplot_cmd( "splot $sphere_arg_str $arg_string" );
}
sub DESTROY {
use File::Basename;
my $filename = basename($_[0]->{_output_file});
$filename =~ s/\.\w+$/\.txt/;
unlink "__temp_" . $filename;
}
# from perl docs:
sub fisher_yates_shuffle {
my $arr = shift;
my $i = @$arr;
while (--$i) {
my $j = int rand( $i + 1 );
@$arr[$i, $j] = @$arr[$j, $i];
}
}
1;
=pod
=head1 NAME
Algorithm::LinearManifoldDataClusterer --- for clustering data that resides on a
low-dimensional manifold in a high-dimensional measurement space
=head1 SYNOPSIS
# You'd begin with:
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
=over 8
=item C<output_file>:
The numeric values are generated using a bivariate Gaussian distribution whose two
independent variables are the azimuth and the elevation angles on the surface of a
unit sphere. The mean of each cluster is chosen randomly and its covariance set in
proportion to the value supplied for the C< cluster_width> parameter.
=item C<cluster_width>:
This parameter controls the spread of each cluster on the surface of the unit sphere.
=item C<total_number_of_samples_needed>:
As its name implies, this parameter specifies the total number of data samples that
will be written out to the output file --- provided this number is divisible by the
number of clusters you asked for. If the divisibility condition is not satisfied,
the number of data samples actually written out will be the closest it can be to the
number you specify for this parameter under the condition that equal number of
samples will be created for each cluster.
=item C<number_of_clusters_on_sphere>:
Again as its name implies, this parameters specifies the number of clusters that will
be produced on the surface of a unit sphere.
=item C<show_hidden_in_3D_plots>:
This parameter is important for the visualization of the clusters and it controls
whether you will see the generated data on the back side of the sphere. If the
clusters are not too spread out, you can set this parameter to 0 and see all the
clusters all at once. However, when the clusters are spread out, it can be visually
confusing to see the data on the back side of the sphere. Note that no matter how
you set this parameter, you can interact with the 3D plot of the data and rotate it
with your mouse pointer to see all of the data that is generated.
=back
=over 4
=item B<gen_data_and_write_to_csv()>:
$training_data_gen->gen_data_and_write_to_csv();
As its name implies, this method generates the data according to the parameters
specified in the DataGenerator constructor.
=item B<visualize_data_on_sphere()>:
$training_data_gen->visualize_data_on_sphere($output_file);
You can use this method to visualize the clusters produced by the data generator.
Since the clusters are located at randomly selected points on a unit sphere, by
looking at the output visually, you can quickly reject what the data generator has
produced and try again.
=back
=head1 HOW THE CLUSTERS ARE OUTPUT
When the option C<terminal_output> is set in the constructor of the
C<LinearManifoldDataClusterer> class, the clusters are displayed on the terminal
screen.
And, when the option C<write_clusters_to_files> is set in the same constructor, the
module dumps the clusters in files named
cluster0.txt
cluster1.txt
cluster2.txt
...
...
in the directory in which you execute the module. The number of such files will
equal the number of clusters formed. All such existing files in the directory are
destroyed before any fresh ones are created. Each cluster file contains the symbolic
tags of the data samples in that cluster.
Assuming that the data dimensionality is 3, if you have set the constructor parameter
C<visualize_each_iteration>, the module will deposit in your directory printable PNG
files that are point plots of the different clusters in the different iterations of
the algorithm. Such printable files are also generated for the initial clusters at
the beginning of the iterations and for the final clusters in Phase 1 of the
algorithm. You will also see in your directory a PNG file for the clustering result
produced by graph partitioning in Phase 2.
Also, as mentioned previously, a call to C<linear_manifold_clusterer()> in your own
code will return the clusters to you directly.
=head1 REQUIRED
This module requires the following modules:
List::Util
File::Basename
Math::Random
Graphics::GnuplotIF
Math::GSL::Matrix
POSIX
=head1 THE C<examples> DIRECTORY
The C<examples> directory contains the following four scripts that you might want to
play with in order to become familiar with the module:
example1.pl
example2.pl
example3.pl
example4.pl
These scripts demonstrate linear-manifold based clustering on the 3-dimensional data
in the following three CSV files:
3_clusters_on_a_sphere_498_samples.csv (used in example1.pl and example4.pl)
3_clusters_on_a_sphere_3000_samples.csv (used in example2.pl)
( run in 1.753 second using v1.01-cache-2.11-cpan-ceb78f64989 )